Splice Machine:

Best of Apache Derby, Hadoop and Spark

Splice Machine marries three proven technology stacks: HBase/Hadoop, Spark and Apache Derby.

HBase/Hadoop: Proven, Distributed Computing Infrastructure
HBase and Hadoop have become the leading platforms for distributed computing. HBase uses the Hadoop Distributed File System (HDFS) for reliable and replicated storage. HBase provides auto-sharding and failover technology for scaling database tables across multiple servers.

HBase and Hadoop are the only technologies proven to scale to dozens of petabytes on commodity servers and are used by companies such as Facebook, Twitter, Adobe and Salesforce.com. Splice Machine chose HBase and Hadoop because of their proven auto-sharding, replication, and failover technology.

Spark: Powerful, In-Memory Computation Engine
Spark has emerged as a popular in-memory computation engine for Big Data analytics. Spark has very efficient in-memory processing that can spill to disk (instead of dropping the query) if the query processing exceeds available memory.

Spark is also unique in its resilience to node failures, which may occur in a commodity cluster. Other in-memory technologies will drop all queries associated with a failed node, while Spark uses ancestry (as opposed to replicating data) to regenerate its in-memory Resilient Distributed Datasets (RDDs) on another node.

Apache Derby: Java-Based, ANSI SQL Database
Splice Machine chose Apache Derby because it is a full-featured ANSI SQL database, lightweight (<3 MB), and Java-based, making it easy to embed into the HBase/Hadoop stack.

Hybrid OLTP/OLAP Architecture

Splice Machine started with Apache Derby, an ANSI SQL Java database, and replaced its storage layer with HBase/Hadoop and its executor with HBase co-processors and Spark Workers. The Splice Machine optimizer automatically evaluates each query and sends it to the right data flow engine:

  • OLTP queries (i.e., small read/writes, range queries) go to HBase/Hadoop
  • OLAP queries (i.e., large joins or aggregations) go to Spark

With separate processes and advanced resource management from Hadoop and Spark, Splice Machine can ensure that OLAP queries do not interfere with OLTP queries.

Splice Machine also leverages Spark resource pools to enable custom priority levels for OLAP queries, so that important or urgent queries are not blocked behind massive batch processes that consume all cluster resources.

Distributed, Parallelized Query Execution

Splice Machine embeds HBase, Spark and Apache Derby on each cluster node. Splice Machine uses the Apache Derby parser and modified the planner, optimizer, and executor to leverage the distributed HBase and Spark computation engines.

The Splice Machine optimizer automatically evaluates each query and routes OLTP queries to the distributed HBase regions or OLAP queries to the distributed Spark workers.

HBase co-processors are used to embed Splice Machine in each distributed HBase region (i.e., data shard). This enables Splice Machine to achieve massive parallelization by pushing the computation down to each distributed data shard.

Splice Machine accelerates generation of Spark RDDs by reading HBase HFiles in HDFS and augmenting it with any changes in Memstore that have not been flushed to HFiles. Splice Machine then uses the RDDs and Spark operators to distribute processing across Spark Workers.

Compatible with Standard Hadoop Distributions

Splice Machine can be used with any standard Hadoop distribution.
Supported Hadoop distributions include Cloudera, MapR and Hortonworks.