Getting Started with Splice Machine: A Newbie’s Perspective
Guest Blog by Sergio Ferragut, Senior Sales Engineer, Splice Machine
I recently joined Splice Machine as a Senior Sales Engineer and, as part of my getting started with Splice Machine data platform, I needed to perform basic tasks such as creating the cluster and loading data into it. This blog outlines the perspective of a user who is using the Splice Machine platform for the first time; we’ll also walk through the process of:
- Creating a cluster from scratch
- Setting up an S3 bucket to load data
- Loading the S3 bucket into the database
- Querying the data and creating visualizations
First, you will need to refer to the AWS S3 access key and secret key that we’re given to you when you created your AWS account.
Creating a Cluster using the Splice Machine Cloud Manager
Log in at: https://cloud.splicemachine.io
Click on “+ Create New Cluster”
Give it a name, select the size of internal storage, external storage, OLTP units (a.k.a HBASE regions), OLAP units (a.k.a. Spark executors).
For a learning system, 2 TB of internal data and the minimum OLTP/ OLAP settings are a good start.
In the second screen setup the initial database user and password. It’s advisable to not use special characters (@,!,#,$,%, etc.) in the password.
Check “S3 connectivity” under “IAM S3 Access” and paste the access and secret keys in the corresponding fields. If you don’t have these credentials, you can get them now by creating an S3 bucket and IAM policy following the instructions here.
Be sure to change the <bucket_name> and <prefix> in the policy JSON provided in the instructions to your S3 bucket name and base access folder respectively.
If you already have a cluster and the S3 credentials were not configured, you can click on ‘Reconfigure‘ option of the cluster home screen or when initially creating your cluster by specifying the S3 access parameter in the setup page. Setting up S3 access will allow imports of any file into the S3 bucket without having to add the credentials to each import command. This is highly recommended.
When you are ready, Check the “Confirmation” checkbox and then click “Launch“.
About 15 minutes later, you’ll receive an email that provides you with links to your cluster along with JDBC URL. You can connect to the cluster directly and use a Notebook to interact with the cluster.
Back in the Splice Machine dashboard, click on the cluster you just created. If you’ve received the email, the cluster will have an active status. Click on the name of the cluster you want to open:
This will open the cluster dashboard page showing the overall system utilization charts. At the top, you’ll see the buttons giving you access to the DB Console, Notebooks and Spark Console pages.
At the bottom of the page, you’ll find useful links to download JDBC and ODBC drivers, the Sqlshell client and the JDBC URL which you can use to connect to this cluster from any JDBC compliant query tool (like DBVisualizer).
Now you are ready to load some data. Click on the Notebook button at the top to bring up the Zeppelin notebooks view:
Click the Login button in the top right and use the database credentials you specified when you created this cluster. Under Notebook, you’ll see a list of folders which contain some great tutorials to help you get familiar with Splice Machine functionality.
Now you are ready to create a new note by clicking on “Create new note“. If you want to create it in a folder use “/” in the name of the note as in “My Tests/First Data Load“.
A blank notebook is presented with a first paragraph. Each paragraph’s first line indicates its interpreter, let’s use “%splicemachine” to start writing some SQL code:
Now let’s create a table to store the data that we are importing from S3. You’ll need to have at least the columns that are in the imported data file. If you haven’t placed any files in your S3 bucket, now is a good time to do that.
Hit the “Play” button to execute the paragraph. The status of the text next to the play button will change to “RUNNING” and finally to “FINISHED” when it is executed.
Now let’s import the data.
With credentials configured on the cluster, an import from a %splicemachine paragraph looks like:
CALL SYSCS_UTIL.IMPORT_DATA ( ‘<target_schema>’, ‘<target_table>’,
null, — column list
‘s3a://<bucket_name>/<path to file>’,– file to import
‘,’, — field delimiter
‘”‘, — string delimiter
null, — timestamp format – default yyyy-MM-dd HH:ss:ss
null, — date format – default yyyy-MM-dd
null, — time format – default HH:mm:ss
-1, — # bad records allowed before failing job
‘s3a://<bucket_name>/<bad data folder path>’, –path to store rejected rows
true, — one line records
‘UTF-8’ — encoding
All the null parameters imply that the defaults are being used. In order to use explicit credentials, the S3 URIs use the format: ‘s3a://(access key):(secret key)@<bucket_name>/<path to data file>‘
So here we go, to add another paragraph into the Notebook, just hover over the bottom edge of a paragraph and you’ll see:
Click on “+Add Paragraph” and you will get another paragraph to work on. I’ve separated the CREATE from the IMPORT paragraphs mainly so I can rerun the import as many times as I want. The first paragraph will not run the second time because it will fail on CREATE SCHEMA, if the schema already exists.
Our second paragraph:
The output of each command is rendered at the bottom of the paragraph, in this case:
As you can see, we have imported 150,000 rows into the table and no failed rows. If there were any failed rows, the failedLog column would contain the name of the file in the bad records path (as specified on the IMPORT_DATA call) that contains the failed rows and the reason for those failures.
Finally, let’s have a look at the data and use some visualization features.
A bar chart by nation, stacking by market segment:
As you can see, with Splice Machine data platform, it’s easy to create a cluster, load data and do in-depth data analysis and data science – even if you are a newbie.
Fast-Track Your Journey to Real-Time AI:
Try Splice Machine
Learn more, access our training notebooks, and download a Splice Machine Docker instance.