Spark Queries taking

minutes - or hours - to run?

Hitting out-of-memory errors before execution even starts?

TabbyDB by KwikQuery is a high-performance fork of Apache Spark that eliminates query planning bottlenecks and dramatically accelerates complex SQL workloads.

The Real Bottleneck in Apache Spark!

The Solution: KwikQuery's TabbyDB

Evaluate TabbyDB on Real Workloads

Explore how TabbyDB behaves on real, complex queries using interactive Zeppelin notebooks. Each notebook runs the same SQL and data on both stock Apache Spark and KwikQuery’s TabbyDB, allowing you to directly observe differences in planning behavior and execution flow.
What to Expect
->Identical queries executed in both environments
->Visibility into query planning and execution stages
->Representative workloads that reflect complex, real-world usage

Benchmark Results: Execution-Time Focus

TPC-DS Context
TPC-DS is a widely used analytical benchmark, but it does not fully represent the complexity of many production Spark workloads. Query structure is constrained, nesting depth is limited, and compilation overhead is typically not a dominant factor.

In limited TPC-DS testing on 1 TB and 2 TB datasets, using Spark with Hive external, non-partitioned tables, TabbyDB demonstrated:

~13% reduction in query execution time compared to stock Spark
Improvements observed consistently across queries, rather than driven by isolated outliers

These results primarily reflect runtime execution behavior, as TPC-DS queries do not meaningfully stress query compilation.

Observed Behavior: Complex Analytic Queries

In production analytics environments—particularly those involving:
->Programmatically generated SQL
->Iterative DataFrame transformations
->Deeply layered views or very large logical plans
query behavior differs significantly from benchmark workloads.

In such scenarios, stock Spark has been observed to:
->Spend extended periods in query compilation, ranging from tens of minutes to multiple hours
->Encounter planner memory pressure or compilation failures before execution begins

When evaluated under similar workload characteristics, TabbyDB has been observed to:
->Reduce compilation time to practical ranges (minutes or seconds), depending on query structure and environment
->Allow queries to reliably reach execution where compilation previously dominated overall latency

How to Interpret These Results

Benchmarks illustrate execution-time improvements under controlled conditions
Real-world workloads highlight differences in compilation behavior that benchmarks do not capture
Actual results will vary based on query complexity, data layout, and environment

TPCDS Run Benchmark Comparison

Resolved Issues and Enhancements in TabbyDB

TabbyDB incorporates targeted fixes and improvements addressing well-documented Apache Spark issues that impact query compilation, runtime performance, and correctness—particularly in complex analytical workloads.

These enhancements improve stability, predictability, and performance while remaining compatible with existing Spark APIs and behavior.

Performance Issues

The following issues are associated with excessive compilation time, inefficient optimization behavior, or suboptimal runtime execution, especially in large or complex query plans:

-> SPARK-33152: Constraint Propogation rule causing query compilation tomes to run into hours.

-> SPARK-36786: Inefficiency in PushDownPredicates rule affecting complex expressions.

-> SPARK-44662: Dynamic file pruning for non partition column joins.

-> SPARK-45373: Minimizing calls to HMS layer. Issue impacts hive metastore based tables, with query having repeated reference to the tables.

-> SPARK-45866: Reuse of Exchange broken in AQE when runtime filters are pushed down to scan.

-> SPARK-45959: uncapped tree size in analysis phase, causing compilation to run into hours.

-> SPARK-46671: Redundant filter creation due to buggy Constraint Propagation rule.

-> SPARK-47609: Cached Plan lookup may miss picking valid plan.

-> SPARK-49618: Canonicalization differences in Union may cause failure in re-use of exchange or cached plans.

-> SPARK-49881: Minimizing the cost of DeduplicateRelations in the analyzer.

Functional Issues

The following issues relate to query correctness, stability, and deterministic behavior in advanced Spark usage patterns:

-> SPARK-47320: Self join inconsistencies and exceptions.

-> SPARK-49727: Data Loss issue when POJO Dataset is converted into DataFrame and back.

-> SPARK-49789: Exception in encoding POJOs with generic type fields.

-> SPARK-51016: Incorrect results during retry when joining column is indeterministic.

-> SPARK-45658: Canonicalization of DynamicPruningSubquery is broken.

-> SPARK-53264: Incorrect nullability when correlated subquery gets converted to Left Outer Join.

-> SPARK-47217: DeduplicateRelations may cause failure in plan resolution.

-> SPARK-51016: Join on intermediate column may give wrong results on retry.

Why Choose TabbyDB?

TabbyDB is built for teams running complex, production-critical Apache Spark workloads—where query compilation time, optimizer behavior, and correctness matter as much as execution speed.

Beyond performance improvements, TabbyDB reflects a deep focus on stability, predictability, and real-world Spark usage, informed by addressing long-standing issues in Spark’s SQL and optimizer layers.

Contact Specialist

Speak with our experts to get a solution tailored to your business goals and data needs. We help you plan the right strategy for faster growth and better results.

Frequently asked questions

Find clear answers to common questions about TabbyDB’s capabilities, performance, and integration to help you make the most of our advanced query engine.

What makes TabbyDB different from standard Apache Spark?

TabbyDB is a specialized fork of Apache Spark designed to optimize complex queries. It significantly reduces compilation time and memory usage for queries with nested joins, complex case statements, and large query trees through intelligent compile-time and runtime enhancements.

Can TabbyDB handle extremely large and complex query trees efficiently?

Yes, TabbyDB is built to manage vast and intricate query structures. Its optimizations improve both the speed and resource consumption, enabling faster execution of queries that would typically take hours to compile and run.

Is TabbyDB compatible with existing Spark DataFrame APIs?

TabbyDB maintains full compatibility with Apache Spark’s DataFrame APIs, allowing corporate users to continue using their programmatic query methods while benefiting from enhanced performance without changing their existing codebase.

Start your free trial

Click the button to access the free download page.

Spark Queries taking

minutes - or hours - to run?

Hitting out-of-memory errors before execution even starts?

The Real Bottleneck in Apache Spark!

Root Cause

Why Tuning Fails

Workarounds Backfire

Impact

The Solution: KwikQuery's TabbyDB

Intelligent Compile-Time Optimizations

Advanced Broadcast Hash Join Handling

Improved Cache Lookup Efficiency

Scalable Query Tree Management

Seamless Spark Compatibility

Evaluate TabbyDB on Real Workloads

Benchmark Results: Execution-Time Focus

Observed Behavior: Complex Analytic Queries

How to Interpret These Results

TPCDS Run Benchmark Comparison

Resolved Issues and Enhancements in TabbyDB

Performance Issues

Functional Issues

Why Choose TabbyDB?

More Than a Faster Engine

Partner With the Team Behind TabbyDB

Contact Specialist

Frequently asked questions

Start your free trial

Spark Queries taking minutes - or hours - to run?

Hitting out-of-memory errors before execution even starts?

The Real Bottleneck in Apache Spark!

Root Cause

Why Tuning Fails

Workarounds Backfire

Impact

The Solution: KwikQuery's TabbyDB

Intelligent Compile-Time Optimizations

Advanced Broadcast Hash Join Handling

Improved Cache Lookup Efficiency

Scalable Query Tree Management

Seamless Spark Compatibility

Evaluate TabbyDB on Real Workloads

Benchmark Results: Execution-Time Focus

Observed Behavior: Complex Analytic Queries

How to Interpret These Results

TPCDS Run Benchmark Comparison

Resolved Issues and Enhancements in TabbyDB

Performance Issues

Functional Issues

Why Choose TabbyDB?

More Than a Faster Engine

Partner With the Team Behind TabbyDB

Contact Specialist

Frequently asked questions

Start your free trial

Fill out your details to talk to our Specialist

Please choose the Demo below!

Spark Queries taking

minutes - or hours - to run?