Many organizations use Hadoop and MapReduce to do batch analysis of large data sets. Splice Machine, on the other hand, uses Apache HBase®, which accesses HDFS directly without the overhead of MapReduce. HBase is what enables real-time updates on top of the Hadoop Distributed File System (HDFS).
By powering real-time operational applications, Splice Machine demonstrates that Hadoop is no longer just for batch processing and ad-hoc analytics. Splice Machine delivers the best of both worlds: the robust functionality of an RDBMS with the proven scale-out of Hadoop.
Apache HBase is an open-source distributed database modeled after Google’s BigTable and is one of the key building blocks of Splice Machine. It delivers scalability up to dozens of petabytes and supports automatic sharding, data replication, and real-time updates. HBase provides very fast reads and writes, but it does not provide a SQL interface or key RDBMS functionality, such as joins, secondary indexes, or ACID transactions.
Splice Machine started with Apache Derby™, an ANSI SQL relational database, and replaced its storage layer with HBase. Splice Machine retains the Apache Derby parser, but it redesigned the planner, optimizer, and executor to leverage the distributed HBase computation engine.
Apache Hive, a key component of the Hadoop ecosystem focused on analytics, enables querying and managing large datasets in Hadoop. It can be thought of as a data-warehouse that cannot manage real-time operational workloads.
Splice Machine is built on HBase and is designed for real-time updates (in milliseconds) and supports real-time applications that require thousands of concurrent readers and writers.
According to Hortonworks, Hive does not support highly concurrent workloads: “Hive is not meant for low latency updates and deletes […] Multiple writers to same partition of the same [Hive] table will be serialized and wait behind each other.” Splice Machine, on the other hand, does not block readers for writers or vice-versa, as it uses a MVCC (Multi-Version Concurrency Control) concurrency model.
Cloudera Impala is an open-source massively parallel processing (MPP) SQL query engine for Hadoop. It is similar to Apache Hive in that it is focused on fast analytics and does not support real-time applications, which require highly concurrent workloads, real-time updates, and ACID transactions.
Like Splice Machine, NoSQL databases have highly concurrent, scale-out architectures. However, by design, NoSQL databases lack SQL, which will often require fundamental rewrites to any application currently on a SQL database.
Both MongoDB, a document-oriented data store, and Cassandra, a key-value data store, do not support ACID transactions, joins, and other key database functionality. Unlike Splice Machine, they also do not support native Hadoop integration, meaning that businesses are forced to import data from the Hadoop Distributed File System (HDFS) and store data into two different clusters.
For businesses that already have data stored in Hadoop/HBase, there are a variety of ways of accessing that data in Splice Machine. If that data will be accessed or updated frequently by Splice Machine, it is best to import the data directly into Splice Machine. If it only is required infrequently, Splice Machine’s new Virtual Table Interface (VTI) allows access directly to data stored in a variety of external sources including RDBMSs and Hadoop.
One can stream data into Splice Machine or use a stored procedure for bulk imports. Splice Machine’s bulk import capability is distributed for performance and maintains indexes, constraints and triggers upon import with transactional integrity. With Splice Machine’s compact byte-encoded storage format, customers have seen storage requirements shrink by up to 10x as compared to native HBase implementations of the same data model.
Spark is a cluster processing framework that does not durably store its own data. In a Hadoop cluster, data for Spark will often be stored as HDFS files, which will likely be bulk imported into Splice Machine or streamed in.
Customers will need to install HBase and Apache ZooKeeper™, a distributed coordination tool for Hadoop, as part of the installation process for Splice Machine. Splice Machine is distribution-agnostic, and users can use the streamlined installation processes from Cloudera, Hortonworks, or MapR. Once an HBase cluster is installed, installing Splice Machine is as simple as deploying the Splice Machine jar files to each HBase region server.
As a full-featured Hadoop RDBMS, Splice Machine supports CRUD operations for Creating, Reading, Updating, and Deleting data. Splice Machine expedites deletes by marking the records as deleted immediately, without actually deleting the data. Then, on a periodic basis, we delete the actual records during the compaction process, which is scheduled depending on workload and performance requirements.
Splice Machine provides support for user authentication and supports FIPS-compliant password encryption algorithms, including SHA-512 (default). Splice Machine also supports integration with the LDAP v3 standard, allowing users to be validated against an LDAP-supported directory service.
Splice Machine also provides support for roles and privileges, which allows for specific users to be assigned roles, which can be granted privileges for read/update to database objects at both the table and column level.
Splice Machine supports ODBC and JDBC connectivity. Our customers have successfully tested several BI tools, such as Informatica, Microstrategy, and Tableau with Splice Machine. All ETL tools that satisfy either JDBC or ODBC connectivity can also inter-operate with Splice Machine. Our customer Harte Hanks uses ODBC to connect the ETL tool Ab Initio with Splice Machine.
Splice Machine can co-exist with existing HBase and Spark installations.
Compared to Splice Machine, HBase and Spark are low-level storage and processing engines. HBase has many shortcomings compared to Splice Machine:
Spark is a cluster processing framework that does not durably store its data. It is not designed to power applications.
By contrast, Splice Machine is full-featured, transactional RDBMS that leverages HBase and Spark as its storage and processing engines. Consider this analogy: If Splice Machine were a car that you could buy and drive after getting the keys, Spark and HBase are like the engine and transmission that makes it work.