White Paper

White Papers

In-depth technical papers exploring the design, implementation, and evaluation of TabbyDB’s engine-level enhancements to Apache Spark.
Each paper focuses on a specific class of performance or scalability challenges observed in complex analytical workloads.

Constraint Propagation Optimization

Subtitle: Eliminating Constraint Explosion in Apache Spark

This paper introduces a redesigned constraint propagation algorithm that addresses the permutational constraint explosion observed in stock Apache Spark. By tracking aliases, canonicalizing expressions, and avoiding redundant inference, the proposed approach significantly reduces query compilation time and memory usage—without sacrificing optimization effectiveness.

Key topics covered
->Root cause of constraint blow-up in stock Spark
->Limitations of EqualsNullSafe–based inference
->Alias tracking and canonicalization strategy
->Impact on compile time and memory consumption

📄 Capping Query Plan Size

Subtitle: Preventing Query Plan Explosion During Analysis

This paper describes an approach to collapsing projection nodes during the analysis phase, preventing excessive query plan growth in workloads built using iterative DataFrame APIs or deeply layered views. The technique preserves correctness and cache compatibility while dramatically reducing compilation overhead.

Key topics covered
->Why query plans grow unbounded in Spark
->Limitations of late-stage project collapsing
->Early collapse strategy during analysis
->Cache-safe plan transformation

📄 Runtime File Pruning Using Broadcasted Keys

Subtitle: Improving Join Performance on Non-Partitioned Columns

This paper explores a runtime performance enhancement that leverages broadcast hash join keys as dynamic runtime filters. The approach enables more effective file-level and row-group pruning for joins on non-partitioned columns, extending the benefits of dynamic pruning beyond traditional partition-based strategies.

Key topics covered
->Limitations of Dynamic Partition Pruning (DPP)
->Using broadcast variables as runtime filters
->Range-based pruning for non-partitioned joins
->Impact on nested and complex join workloads

📄 TPC-DS Benchmark Evaluation

Subtitle: Methodology, Configuration, and Observations

This paper provides a transparent breakdown of TPC-DS benchmark results, including execution timelines, configuration details, and experimental setup. It also discusses the limitations of TPC-DS in representing compile-time behavior and complex real-world Spark workloads.

Key topics covered
->Benchmark methodology and environment setup
->Configuration details and execution phases
->Runtime observations and consistency of results
->Limitations of TPC-DS for compile-time analysis

Frequently asked questions

Find clear answers to common questions about TabbyDB’s capabilities, performance, and integration to help you make the most of our advanced query engine.

TabbyDB is a specialized fork of Apache Spark designed to optimize complex queries. It significantly reduces compilation time and memory usage for queries with nested joins, complex case statements, and large query trees through intelligent compile-time and runtime enhancements.

Yes, TabbyDB is built to manage vast and intricate query structures. Its optimizations improve both the speed and resource consumption, enabling faster execution of queries that would typically take hours to compile and run.

TabbyDB maintains full compatibility with Apache Spark’s DataFrame APIs, allowing corporate users to continue using their programmatic query methods while benefiting from enhanced performance without changing their existing codebase.

Please choose the Demo below!

Please select your desired option from here.