White Paper

Constraint Propagation Optimization

Subtitle: Eliminating Constraint Explosion in Apache Spark

This paper introduces a redesigned constraint propagation algorithm that addresses the permutational constraint explosion observed in stock Apache Spark. By tracking aliases, canonicalizing expressions, and avoiding redundant inference, the proposed approach significantly reduces query compilation time and memory usage—without sacrificing optimization effectiveness.

Key topics covered
->Root cause of constraint blow-up in stock Spark
->Limitations of EqualsNullSafe–based inference
->Alias tracking and canonicalization strategy
->Impact on compile time and memory consumption

📄 Capping Query Plan Size

Subtitle: Preventing Query Plan Explosion During Analysis

This paper describes an approach to collapsing projection nodes during the analysis phase, preventing excessive query plan growth in workloads built using iterative DataFrame APIs or deeply layered views. The technique preserves correctness and cache compatibility while dramatically reducing compilation overhead.

Key topics covered
->Why query plans grow unbounded in Spark
->Limitations of late-stage project collapsing
->Early collapse strategy during analysis
->Cache-safe plan transformation

📄 Runtime File Pruning Using Broadcasted Keys

Subtitle: Improving Join Performance on Non-Partitioned Columns

This paper explores a runtime performance enhancement that leverages broadcast hash join keys as dynamic runtime filters. The approach enables more effective file-level and row-group pruning for joins on non-partitioned columns, extending the benefits of dynamic pruning beyond traditional partition-based strategies.

Key topics covered
->Limitations of Dynamic Partition Pruning (DPP)
->Using broadcast variables as runtime filters
->Range-based pruning for non-partitioned joins
->Impact on nested and complex join workloads

📄 TPC-DS Benchmark Evaluation

Subtitle: Methodology, Configuration, and Observations

This paper provides a transparent breakdown of TPC-DS benchmark results, including execution timelines, configuration details, and experimental setup. It also discusses the limitations of TPC-DS in representing compile-time behavior and complex real-world Spark workloads.

Key topics covered
->Benchmark methodology and environment setup
->Configuration details and execution phases
->Runtime observations and consistency of results
->Limitations of TPC-DS for compile-time analysis

Frequently asked questions

Find clear answers to common questions about TabbyDB’s capabilities, performance, and integration to help you make the most of our advanced query engine.

What makes TabbyDB different from standard Apache Spark?

TabbyDB is a specialized fork of Apache Spark designed to optimize complex queries. It significantly reduces compilation time and memory usage for queries with nested joins, complex case statements, and large query trees through intelligent compile-time and runtime enhancements.

Can TabbyDB handle extremely large and complex query trees efficiently?

Yes, TabbyDB is built to manage vast and intricate query structures. Its optimizations improve both the speed and resource consumption, enabling faster execution of queries that would typically take hours to compile and run.

Is TabbyDB compatible with existing Spark DataFrame APIs?

TabbyDB maintains full compatibility with Apache Spark’s DataFrame APIs, allowing corporate users to continue using their programmatic query methods while benefiting from enhanced performance without changing their existing codebase.