Splice Machine Releases Native Spark Integration For AI and IoT Applications

New Native Apache Spark DataSource Now Available

SAN FRANCISCO, Mar. 6, 2018 – Splice Machine, provider of the leading data platform to power intelligent applications, today announced the availability of its native Apache Spark DataSource to simplify and speed up IoT and machine learning applications.

This connector provides a fast, native, ACID-compliant datastore for Spark and also opens up Splice Machine’s underlying Apache Spark engine to directly use advanced capabilities such as Spark SQL, Spark Streaming, and MLlib or R (for machine learning). The connector enables data engineers, data scientists, and developers to directly use Spark without excessive data transfers in and out of Splice Machine. Unlike other Spark DataSources which require data serialization and transfer across JDBC/ODBC connections, Splice Machine’s is native to Spark.

The connector is now a part of Splice Machine’s community edition, with a simple query example and a streaming example also available on Github. Apache Zeppelin notebooks with streaming and machine learning examples of the native Spark DataSource are also available on Splice Machine’s Cloud Service.

The Native Spark DataSource enables the following functions:

  • Create Table – create a Splice Machine table from the schema of a Spark DataFrame
  • Insert – insert the rows of a DataFrame into a Splice Machine table
  • Update – update the rows of a Splice Machine table specified by a DataFrame
  • Upsert – update or insert the rows of a Splice Machine table specified by a DataFrame
  • Delete – delete the rows of a Splice Machine table specified by a DataFrame
  • Query – issue a SQL query and return the result set as a DataFrame

The Splice Machine Native Spark DataSource provides other advantages such as:

  • ACID transactions on all CRUD operations – Create table operations, updates, inserts, upserts, and deletes all commit atomically, meaning all or none of the record changes happen, they reflect snapshot isolation semantics, and if there is any problem, the entire transaction can be rolled back.
  • The CRUD operations preserve all ACID properties on secondary indexes automatically
  • Updates can update any number of columns simultaneously
  • Result sets return lazy-evaluated Spark DataFrames with instructions pipelined through Spark’s RDD structures

Use cases include:

  • High speed streaming ingestion – Apache Spark Streaming is a library that is extremely popular and easy to use with all inputs including files, message queues like Apache Kafka and Amazon Kinesis, and other sources such as RDDs and Apache Flume. It creates micro-batches of RDDs that can be processed. Splice Machine can insert each micro-batch with one insert operation efficiently. Therefore, streaming IoT applications can easily write data reliably to Splice Machine at high velocity and throughput, and use the data both for operational usage and analytics. See this blog for an example of streaming in weather data to an application.
  • ETL – Spark is a powerful tool to manipulate large data sets to extract, transform, and load data from one application to another. The Splice Machine Native Spark DataSource provides transactional integrity when materializing stages of an ETL pipeline and provides easy rollback in case of failure. This also enables in-place updates for ETL use cases.
  • Machine Learning and AI applications – The Native Spark DataSource makes Spark’s MLlib, R, and Pandas immediately available as these libraries manipulate Spark DataFrames. This allows you to create ML pipelines with feature extraction and transformations being powered directly in database on DataFrames. As models are trained, they can be materialized in Splice operationalizing machine learning. For example, Mission Critical applications can frequently train models because no extraction and loading is necessary and scores provided by models can be queried immediately by the application. This capability allows companies to inject AI into mission-critical applications easily and efficiently.

One financial services customer uses the Splice Machine Native Spark DataSource to stream 1B credit card authorizations a day into Splice Machine via Apache Kafka.

The performance of the Native Spark DataSource is more than 10X faster than JDBC for data insertion.

Splice Machine provides the Native Spark DataSource API in Java, Scala, and Python. This new capability breaks new ground in blending the capabilities of traditional relational database management systems (RDBMSs) and data warehouses with the capabilities typically associated with Hadoop-based compute engines like Spark. With Splice Machine, you get both capabilities integrated on one platform.

Other connectors to Spark such as the Teradata Connector for Hadoop (TDCH) or the Oracle Big Data Connectors all need to serialize data to and from Spark. With Splice Machine, the data stays on Spark making the platform much faster.

About Splice Machine

Splice Machine is the new data platform for digital transformation. Unlike other Big Data platforms that provide offline, batch analysis, Splice Machine powers intelligent applications that are woven into the operational workflows of companies. It is a scale-out SQL RDBMS, data warehouse and machine learning platform in one. Splice Machine is open source and is built upon the popular Apache Hadoop, HBase, and Spark distributed platforms. Companies in financial services, healthcare, retail, manufacturing and logistics deploy Splice Machine to improve their operational efficiency, eliminate unnecessary costs and deliver superior service. The Splice Machine database can be deployed on-premise or as a fully-managed cloud service.

Splice Machine is a trademark of Splice Machine, Inc. All other trademarks are the property of their respective registered owners. Trademark use is for identification only and does not imply sponsorship, affiliation, or endorsement.