10 Tips for Importing Data with Splice Machine
February 13, 2017
By Erin Driggers
I’ve imported my share of data and have had some frustrating experiences importing data into Splice Machine and other databases. To help you avoid some of the frustration I’ve felt, I’d like to share several data importing tips I’ve acquired over time.
Tip 1 – File location
First, you must be aware of the difference between the fileOrDirectoryName parameter in the import command, depending on whether you are running an import in a local stand alone environment or a clustered environment. If you are running a stand alone environment, the name or path will be to a file or directory on the filesystem. For example, /users/erin/mydata/mytable.csv. However, if you are running this on a cluster, the path is to a file on HDFS (or the MapR Filesystem).
Tip 2 – Files & directory structure matter
When you are importing data into Splice Machine, you can choose to specify a single file or a directory that contains many files. If you have a lot of data (100s of millions / billions of records), you will be tempted to create one massive file with all the records and import that file. Don’t do it. This is what I recommend:
- Break the file into smaller files. You want the files to be about 40 Gigs or so and have about ~50 million records (depending on how wide your table is).
- If you have a lot of files, create multiple directories and load each directory one at a time. For example, here is a structure I create:
Tip 3 – Don’t compress your file
Up until Splice Machine 2.0, we always recommended that you compress (gzip) your files because it performed best that way. Now that we’ve moved to the Apache Spark/HBase architecture, this is no longer the case. If your file is not compressed, Splice Machine takes the contents of that file and distributes it across all the nodes in the cluster and processes the portions in parallel.
Tip 4 – Choose special characters as your delimiters
Column and Character delimiters will drive you crazy. You will spend hours searching for some ‘bad’ data only to find that you have a column like a product description that has your column or character delimiter in it. The standard column delimiter is a comma (,) but we all know that if you have string data, chances are very high that you will have a comma in it. Some people get clever and use a pipe (|) as their column delimiter – and invariably that character is in the description as well. Instead use a control character like Control-A (or SOH character or 0x01 in hex) as your column delimiter. The trick here is that in Splice Machine, you cannot use the splice command prompt to type in that character as your delimiter, you will need to create a script file and type that character using an editor like vi or vim
You can specify control characters as delimiters as follows:
- Open your script file in vi or vim
- Get into insert mode
- Type CTRL-V, CTRL-A
Next is the character delimiter. By default in Splice Machine, the column delimiter is a double quote. But in a lot of cases that does not work well – perhaps you have a product description that has the symbol for inches (a double quote), or perhaps a double quote is part of the data. Instead of trying to figure out how to escape those values, use a special character for your delimiter. I like to use Control-G. You will need to use vi or vim like described above to insert it into your script.
Tip 5 – Date / Timestamp Format
Dates, times and timestamps will be the data types that will more than likely cause problems.
- Splice Machine follows the java SimpleDateFormat syntax (https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html) for any date, time or timestamp format. In Splice Machine, case matters. For example, a lower case ‘h’ is an hour represented between 0-12, an upper case ‘H’ is an hour represented in a 24 hour range 0-23.
- Splice Machine can handle one format each for the date, time and timestamp columns in your table. If you have a table that has 3 columns that have a datatype of DATE and 3 columns that are a datatype of TIME, then the 3 DATE columns need to be in the same format and the 3 TIME columns need to be in the same format.
- The Timestamp data type has a range of 1678-01-01 to 2261-12-31. I’ve seen where people put dummy timestamps in the source system like 9999-01-01 – timestamps in that format will fail. This is not an issue with the DATE data type.
- If you have any date or timestamp that is not in the format yyyy-MM-dd HH:mm:ss, I suggest you create a simple table that has just one or two columns and test importing the format – confirm that the data imported is what you expected
Tip 6 – Change the Bad directory for each table / group
When you are loading a lot of data, take advantage of the bad record directory parameter. Change it for each table or group, for example, /BAD/mytable1 or /BAD/mytable1/group1 – It will be easier to locate the files.
Tip 7 – Multi-line characters in your data
If your data contains line feed characters like control-m, you will need to set the property oneLineRecords to false. The downside to this is that the import will take longer because the import cannot take advantage of breaking up the file and distributing it across the cluster. Ideally, avoid having line feed characters in your data.
Tip 8 – Clobs / Blobs
When you are importing Clobs, please see Tip 4 and Tip 7. Make sure to use the special characters as both the column and character delimiters. If the clob data can span multiple lines, make sure to set the oneLineRecords to false.
At the present times, columns with a data type of ‘Blob’ cannot be imported using the import tool. One option I’ve seen is to create a VTI that reads the Blob and inserts the data.
Tip 9 – Scripting
Make your life easier and create scripts for importing the data – don’t sit and babysit the imports. There are a couple of pieces to scripting the import. First, create some .sql files that have your import statements. For example, you may have a file called load-mytable1.sql with entries like the following:
call SYSCS_UTIL.IMPORT_DATA ('SPLICE','mytable1',null,'/data/mytable1/data.csv',null,null,null,null,null,0,'/BAD/mytable1',null,null); call SYSCS_UTIL.IMPORT_DATA ('SPLICE','mytable2',null,'/data/mytable2/data.csv',null,null,null,null,null,0,'/BAD/mytable2',null,null); call SYSCS_UTIL.IMPORT_DATA ('SPLICE','mytable3',null,'/data/mytable3/data.csv',null,null,null,null,null,0,'/BAD/mytable3',null,null);
Don’t get overzealous and put a bunch of statements in one file or have one file that is going to load billions of records. Why? Because when you use a script file like load-mytable1.sql, it uses one connection to Splice Machine. If one of the statements fails and causes Splice Machine to disconnect, none of the subsequent statements will be executed because there is not a current connection – you will be sad when you kick off the script at night and expect it to be completed in the morning and it failed.
Create a script file to run multiple imports, something like the following (loadMyTable1.sh):
sqlshell.sh -f /home/splice/importMytable1Group1.sql > runMyTable1Group1.out sqlshell.sh -f /home/splice/importMytable1Group2.sql > runMyTable1Group2.out sqlshell.sh -f /home/splice/importMytable1Group3.sql > runMyTable1Group3.out
When you run that script, run it in the background:
nohup ./loadMyTable1.sh > loadMyTable1.out &
Finally, don’t run a large import directly on the Splice command prompt or through a SQL client. Use a script instead and run it with nohup in the background like the example above. At some point, you will lose your connection to the server through some network issue and you won’t see the file status of the export.
Tip 10 – Start Small
The last thing I can suggest, is to start small and run several tests before you kick off a large import. Before I start a large import, I normally take the first 5 records from the file and try to import it into the table. Once, I confirm those file records import okay, I drop the table and recreate it. You are probably wondering why I say drop it and recreate it instead of deleting the records or truncating the table. Well, if you run an import multiple times and it is failing, you technically have data in those tables that are hidden and marked with a specific transaction. It will eventually get cleaned up through compaction but I’ve seen it affect the load performance. Save yourself the headache – drop it and recreate it. I would recommend you do that for large imports that failed as well.
I hope these tips help get you moving in the right direction and make importing data into Splice Machine as smooth as possible. If you have any questions, join our community Slack channel to discuss them with me and the rest of the Splice Machine community.