White Paper
White Paper
In-depth technical papers exploring the design, implementation, and evaluation of TabbyDB’s engine-level enhancements to Apache Spark.
Each paper focuses on a specific class of performance or scalability challenges observed in complex analytical workloads.
Constraint Propagation Optimization
Eliminating Constraints Explosion:
This paper introduces a redesigned constraint propagation algorithm that addresses the permutational constraints explosion observed in stock Apache Spark. By tracking aliases, canonicalizing expressions, and avoiding redundant inference, the proposed approach significantly reduces query compilation time and memory usage— with guaranteed, identical or better optimized plan.
- Key topics covered:
- Limitations of EqualsNullSafe–based inference
- Alias tracking and canonicalization strategy
- Impact on compile time and memory consumption
Capping Query Plan Size
Preventing Query Plan Explosion During Analysis
This paper describes an approach to collapsing projection nodes during the analysis phase, preventing uncapped query plan growth in workloads built using iterative DataFrame APIs or deeply layered views. The technique preserves correctness, cache compatibility and improved cache lookup efficiency while dramatically reducing compilation overhead.
- Key topics covered:
- Why query plans grow unbounded in Spark
- Limitations of late-stage project collapsing
- Early collapse strategy during analysis
- Cache-safe plan transformation
Runtime File Pruning Using Broadcasted Keys
Improving Join Performance on Non-Partitioned Columns
This paper explores a runtime performance enhancement that leverages broadcast hash join keys as dynamic runtime filters. The approach enables more effective file-level and row-group pruning for joins on non-partitioned columns, extending the benefits of dynamic pruning beyond traditional partition-based strategies.
- Key topics covered
- Limitations of Dynamic Partition Pruning (DPP)
- Using broadcast variables as runtime filters
- Range-based pruning for non-partitioned joins
- Impact on nested and complex join workloads
TPC-DS Benchmark Evaluation
This paper provides a transparent breakdown of TPC-DS benchmark results, including execution timelines, configuration details, and experimental setup. It also discusses the limitations of TPC-DS in representing compile-time behavior and complex real-world Spark workloads.
- Key topics covered:
- Benchmark methodology and environment setup
- Configuration details and execution phases
- Runtime observations and consistency of results
- Limitations of TPC-DS for compile-time analysis