Spark Queries taking
minutes - or hours - to run?
Hitting out-of-memory errors before execution even starts?
TabbyDB by KwikQuery is a high-performance fork of Apache Spark that eliminates query planning bottlenecks and dramatically accelerates complex SQL workloads.
The Real Bottleneck in Apache Spark!
Root Cause
• Compile time can take minutes to hours
• Happens before execution begins
• Often misdiagnosed as runtime
Why Tuning Fails
• Bottleneck is in planning, not execution
• Runtime tuning doesn’t reduce compile time
• More compute ≠ faster planning
Workarounds Backfire
• Disabling rules may reduce compile time
• But can hurt runtime performance
• Forces a tradeoff: speed vs efficiency
Impact
• Minutes–hours of delay • Wasted compute • Missed SLO risk
The Solution: KwikQuery's TabbyDB

Intelligent Compile-Time Optimizations
Introduces fundamental improvements to critical optimizer rules, including optimized constraint propagation, early project collapsing during analysis, reduced metadata calls, and other targeted enhancements.
Dramatically reduced query compile time for complex workloads

Advanced Broadcast Hash Join Handling
Extends broadcast hash join execution to enable dynamic file pruning for non-partitioned joins, reducing unnecessary data scans during runtime.
Improved runtime performance for nested and complex join queries

Improved Cache Lookup Efficiency
Enhances how cached in-memory query plans are matched and reused, increasing the likelihood of successful cache hits in layered and iterative workloads.
Higher cache reuse and lower execution overhead

Scalable Query Tree Management
Safely collapses redundant project nodes earlier in the query lifecycle, preventing unbounded query plan growth and reducing memory pressure during compilation.
Faster compilation and reduced risk of out-of-memory failures

Seamless Spark Compatibility
Maintains full compatibility with Apache Spark APIs, features, and tooling while delivering performance improvements.
Drop-in adoption with no code changes or workflow disruption
Evaluate TabbyDB on Real Workloads
Explore how TabbyDB behaves on real, complex queries using interactive Zeppelin notebooks.
Each notebook runs the same SQL and data on both stock Apache Spark and KwikQuery’s TabbyDB, allowing you to directly observe differences in planning behavior and execution flow.
What to Expect
->Identical queries executed in both environments
->Visibility into query planning and execution stages
->Representative workloads that reflect complex, real-world usage
Benchmark Results: Execution-Time Focus
TPC-DS Context
TPC-DS is a widely used analytical benchmark, but it does not fully represent the complexity of many production Spark workloads. Query structure is constrained, nesting depth is limited, and compilation overhead is typically not a dominant factor.
In limited TPC-DS testing on 1 TB and 2 TB datasets, using Spark with Hive external, non-partitioned tables, TabbyDB demonstrated:
~13% reduction in query execution time compared to stock Spark
Improvements observed consistently across queries, rather than driven by isolated outliers
These results primarily reflect runtime execution behavior, as TPC-DS queries do not meaningfully stress query compilation.
Observed Behavior: Complex Analytic Queries
In production analytics environments—particularly those involving:
->Programmatically generated SQL
->Iterative DataFrame transformations
->Deeply layered views or very large logical plans
query behavior differs significantly from benchmark workloads.
In such scenarios, stock Spark has been observed to:
->Spend extended periods in query compilation, ranging from tens of minutes to multiple hours
->Encounter planner memory pressure or compilation failures before execution begins
When evaluated under similar workload characteristics, TabbyDB has been observed to:
->Reduce compilation time to practical ranges (minutes or seconds), depending on query structure and environment
->Allow queries to reliably reach execution where compilation previously dominated overall latency
How to Interpret These Results
Benchmarks illustrate execution-time improvements under controlled conditions
Real-world workloads highlight differences in compilation behavior that benchmarks do not capture
Actual results will vary based on query complexity, data layout, and environment
TPCDS Run Benchmark Comparison


Resolved Issues and Enhancements in TabbyDB
TabbyDB incorporates targeted fixes and improvements addressing well-documented Apache Spark issues that impact query compilation, runtime performance, and correctness—particularly in complex analytical workloads.
These enhancements improve stability, predictability, and performance while remaining compatible with existing Spark APIs and behavior.
Performance Issues
The following issues are associated with excessive compilation time, inefficient optimization behavior, or suboptimal runtime execution, especially in large or complex query plans:
-> SPARK-33152: Constraint Propogation rule causing query compilation tomes to run into hours.
-> SPARK-36786: Inefficiency in PushDownPredicates rule affecting complex expressions.
-> SPARK-44662: Dynamic file pruning for non partition column joins.
-> SPARK-45373: Minimizing calls to HMS layer. Issue impacts hive metastore based tables, with query having repeated reference to the tables.
-> SPARK-45866: Reuse of Exchange broken in AQE when runtime filters are pushed down to scan.
-> SPARK-45959: uncapped tree size in analysis phase, causing compilation to run into hours.
-> SPARK-46671: Redundant filter creation due to buggy Constraint Propagation rule.
-> SPARK-47609: Cached Plan lookup may miss picking valid plan.
-> SPARK-49618: Canonicalization differences in Union may cause failure in re-use of exchange or cached plans.
-> SPARK-49881: Minimizing the cost of DeduplicateRelations in the analyzer.
Functional Issues
The following issues relate to query correctness, stability, and deterministic behavior in advanced Spark usage patterns:
-> SPARK-47320: Self join inconsistencies and exceptions.
-> SPARK-49727: Data Loss issue when POJO Dataset is converted into DataFrame and back.
-> SPARK-49789: Exception in encoding POJOs with generic type fields.
-> SPARK-51016: Incorrect results during retry when joining column is indeterministic.
-> SPARK-45658: Canonicalization of DynamicPruningSubquery is broken.
-> SPARK-53264: Incorrect nullability when correlated subquery gets converted to Left Outer Join.
-> SPARK-47217: DeduplicateRelations may cause failure in plan resolution.
-> SPARK-51016: Join on intermediate column may give wrong results on retry.
Why Choose TabbyDB?
TabbyDB is built for teams running complex, production-critical Apache Spark workloads—where query compilation time, optimizer behavior, and correctness matter as much as execution speed.
Beyond performance improvements, TabbyDB reflects a deep focus on stability, predictability, and real-world Spark usage, informed by addressing long-standing issues in Spark’s SQL and optimizer layers.
More Than a Faster Engine
->Engine-level enhancements targeting complex query behavior
->Improved reliability for large and deeply nested workloads
->Compatibility with existing Spark APIs, tooling, and workflows
Partner With the Team Behind TabbyDB
If you are encountering functional or performance issues in Apache Spark—particularly within the SQL or optimizer layers—we’re open to collaborating on solutions tailored to your workload or codebase.
Whether it’s diagnosing a bottleneck, validating a fix, or contributing targeted improvements, we’re happy to engage.
Contact Specialist
Speak with our experts to get a solution tailored to your business goals and data needs. We help you plan the right strategy for faster growth and better results.
Frequently asked questions
Find clear answers to common questions about TabbyDB’s capabilities, performance, and integration to help you make the most of our advanced query engine.
TabbyDB is a specialized fork of Apache Spark designed to optimize complex queries. It significantly reduces compilation time and memory usage for queries with nested joins, complex case statements, and large query trees through intelligent compile-time and runtime enhancements.
Yes, TabbyDB is built to manage vast and intricate query structures. Its optimizations improve both the speed and resource consumption, enabling faster execution of queries that would typically take hours to compile and run.
TabbyDB maintains full compatibility with Apache Spark’s DataFrame APIs, allowing corporate users to continue using their programmatic query methods while benefiting from enhanced performance without changing their existing codebase.