What’s New in Splice Machine 3.0
January 15, 2020
Splice Machine is part of a relatively new segment of the database management systems (DBMS) market where transactions and analytics functionality converges. In July 2013, Gartner used the term HTAP or Hybrid Transaction Analytical Processing to describe this emerging market segment. Forrester calls it Transalytics. Today, Splice Machine is announcing the release of the latest version of its platform, Splice Machine 3.0. This release is a testament to how far we have come since the analysts coined those phrases. Splice Machine is an integrated platform that enterprises can use to power their mission-critical operational applications, as well as use as a decision support system for management reporting and to build machine learning models, all on a seamlessly integrated platform.
The benefits of using an integrated platform are:
- Elastic scalability
- Significantly lower development and licensing costs
- Minimal ETL latency
- Faster time to market for AI/ML
- Ability to lengthen the life and of legacy applications by modernizing them
In version 3.0 we have taken major steps towards adding features and functionality that will enable our platform to be used by enterprises to modernize their mission-critical applications. With exhaustive SQL support for both transactional and analytical workloads, built-in machine learning, and artificial intelligence capabilities, and unified deployment experience on-premises and on cloud Splice Machine 3.0 is uniquely positioned as the database for application modernization.
Splice Machine 3.0 offers enterprise-grade functionality for application developers, DevOps/ DBAs, analysts, data engineers, and data scientists. In version 3.0, DBA’s will have the ability to allocate analytical resources by roles using analytical resource queues so that specific users’ workloads receive sufficient resources. This is useful to ensure performance for high priority workloads and also is useful to allocate computing resources across different business organizations. Splice Machine 3.0 safeguards business data and assures business continuity and high availability in the face of natural disasters, infrastructure failure, and user error with active-passive failover. In Splice Machine 3.0 we have extensively advanced security options by offering the ability to redact sensitive business data.
Splice Machine 3.0’s standard ODBC/JDBC interfaces enable analysts to integrate with business intelligence and data visualization tools of their choice and provide access to the notebooks created by data scientists to perform what-if analysis. Version 3.0 offers our ML Manager 2.0 which enables an end-to-end data science workflow with unprecedented ease of operationalizing ML, new native JupyterLab notebooks, and new features to track ML experiments.
Splice Machine 3.0 significantly lowers the burden on IT organizations of managing operational, analytical, and ML/AI workloads to support both analytical queries for historical reporting, and operational, system of record use cases. This greatly reduces the implementation time and the number of data engineers needed to maintain this architecture. For an organization that plans to migrate from a proprietary database, Splice Machine 3.0 offers a seamless replacement of legacy databases through an exhaustive SQL implementation, including support for some proprietary SQL extensions. Splice Machine 3.0 offers infrastructure agnostic deployment that grants flexibility to deploy on-premises or on the cloud.
Splice Machine 3.0 includes major improvements in numerous functional areas including:
- Workload Management
- SQL coverage
- Replication and HA
- Data science productivity
- Kubernetes support
Let us explore each of these in detail.
Application Server Queues: This feature supports the use of multiple OLAP (online analytical processing) queues that allow users to reserve cluster capacity for specific queries, track resources consumed by each server/role, and manage resource capacity for specific kinds of queries and organizations. This functionality allows users to isolate workloads from each other to ensure adequate resources are available even when multiple resource intensive queries are running simultaneously.
Figure: Splice Machine 3.0 has introduced Application Server Queues (ASQs) to provide workload management and isolation.
Figure: Fair Scheduler gives all applications on the cluster to get an equal share of cluster resources. Capacity Scheduler specifies a minimum or maximum amount of capacity for a user.
Support for DB2 specific SQL syntax: Splice Machine 3.0 now supports many DB2-specific extensions that make it easy to migrate from DB2 with minimal SQL rewrite. Examples include support for DB2 trigger syntax, error codes, text manipulation syntax, etc.
Full Outer Join support: FULL OUTER is a join option present in some SQL languages that could be achieved via query rewrites with previous Splice Machine syntax but is now directly supported in Splice Machine 3.0. This will eliminate the need to rewrite queries written against a legacy database that utilizes this syntax.
Time Travel – Point in time queries: Splice Machine 3.0 supports a powerful new SQL syntax extension that allows database users to query the database as it existed at some time in the past. This functionality is very useful in a wide variety of scenarios. It can be used to support flexibility when working with slowly changing dimensions, support various data auditing scenarios, understand changes made by users that may need to be unwound, reproduce historical reports, and analyze trends over time.
Figure: Point in time queries or time travel queries allow database users to query data as it existed at some point in the past.
Enhanced trigger support: New options are available in Splice Machine 3.0 for triggers that allow more flexibility for events that can trigger automatic actions and the actions that can be taken as a result of those triggers.
Replication and HA
Active-Passive replication: In this release, Splice Machine supports the ability to stand up multiple DB clusters that are automatically kept in sync via active-passive replication to achieve rigorous recovery point objectives (RPOs) and recovery time objectives (RTOs).
Figure: Splice Machine DB 3.0 includes asynchronous, active/passive replication
Schema Access Restrictions: Allows the ability to restrict access to objects belonging to a specified schema so that other users cannot view, access, or even understand that the objects exist without appropriate administrative privileges.
Figure: A user must be specifically granted access to the sys schema or the schema restrict configuration must be disabled.
Customized pattern matching for log redaction: This feature allows users to use patterns defined with regular expressions to redact sensitive information from system logs.
Figure: Logs can be configured to automatically recognize and redact sensitive information through the definition of “masks”
Data Science Productivity
Support for Jupyter notebooks: Jupyter notebooks are by far the most popular open-source notebook implementation. In Splice Machine 3.0 Jupyter notebooks are the standard. Splice’s native Jupyter support comes with JupyterHub as well as BeakerX.
- JupyterHub is the best way to serve Jupyter notebooks to multiple users. Each user has their own dedicated server for hosting, storing and running their Jupyter notebooks, and users are guaranteed isolation from others. With the added ability to link a GitHub account, notebooks can be easily shared between users for faster collaboration and development.
- BeakerX is an added layer that sits on top of Jupyter, providing a number of powerful features.
- Polyglot programming. With independent kernels for each language, Splice Machine allows programming in multiple different languages within a single Jupyter notebook, from SQL to R to Python; even Java and Scala. This vastly increases development time by allowing the feature engineering and experimentation to occur all in the same place.
- Cross kernel variable availability. BeakerX’s global namespace creates the opportunity to build cross-language models. You can store variables into the global beakerx object in your Python kernel and access that data in your r kernel. You can even SELECT INTO a variable in SQL and access it from any other kernel. This powerful feature brings massive time savings to any data scientists who want to quickly get in and analyze subsets their data.
Model Workflow Management: With our new MLManager platform, based on MLFlow, we’ve closed the machine learning lifecycle. Our improved API makes it quicker and easier to manage your ML development, from bulk logging of model parameters and metrics to full visibility into pipeline stages and feature transformations. With just a few added lines of code, you can recreate any ML pipeline in seconds.
Because everything is stored within Splice Machine, it’s especially easy to maintain governance of your models. Direct access to the training and testing tables allows you to guarantee new models are evaluated on the same data as currently deployed ones.
- New in-database, transactional machine learning model deployment – with one function real-time scoring of data based on database triggers
- Automatic model instrumentation
- MLFlow 1.1.0
- Platform agnostic: Models, artifacts and metadata persisted in Splice Machine
- Splice support for SQL Alchemy
Major Platform Upgrades
- Support for Cloudera 6.3 and HWX 3.2.3
- Latest versions of Apache Spark and Apache HBase: Splice Machine 3.0 enables users to leverage the underlying functionality offered by HDFS 3.0, HBase 2.0 and Spark 2.4.1
Kubernetes Support – Native Spark Data Source (NSDS)
NSDS 2.0 streams Dataframes across the container/network boundary to Splice offering a high throughput solution implemented behind the scenes in Kafka.
Figure: NSDS enables data engineers and data scientists to operate Splice on Spark Dataframes avoiding JDBC/ODBC protocol serde and network overhead.