Data Science Challenges
|Before Splice Machine||After Splice Machine|
|Siloed without access to production data until it’s stale||RDBMS and ML Platform co-located for immediate access, bringing the data to the data scientist|
|Spending too much time managing infrastructure, not data science||Infrastructure is abstracted away. The full database interface is accessible through Cloud Managed Jupyterhub|
|Sub par models built on small subsets of data without the ability to scale||Native Spark-Database Integration for API access to massive datasets for analytics and modeling|
|Spending time finding, updating, and managing the dependency web of all of the open source libraries and tools needed to do the job||Out-of-the-box integration of Jupyter, SciKit, Keras, Tensorflow, H2O, SparkML, Conda, and easy to expand|
|Unmanaged and isolated modeling environments, leading to non-reproducible experiments||Database embedded MLflow provides simple packaging methods for reproducible models|
|Difficulty with the handoff between data engineer, data scientist, and application developer||One integrated platform for application developers, data engineers, and data scientists|
|Non repeatable or nonexistent model deployment pipelines||One-click deployment options through API and UI interfaces with scale built in.|
|Before Splice Machine|
|Siloed without access to production data until it’s stale|
|Spending too much time managing infrastructure, not data science|
|Sub par models built on small subsets of data without the ability to scale|
|Spending time finding, updating, and managing the dependency web of all of the open source libraries and tools needed to do the job|
|Unmanaged and isolated modeling environments, leading to non-reproducible experiments|
|Difficulty with the handoff between data engineer, data scientist, and application developer|
|Non repeatable or nonexistent model deployment pipelines|
|After Splice Machine|
|RDBMS and ML Platform co-located for immediate access, bringing the data to the data scientist|
|Infrastructure is abstracted away. The full database interface is accessible through Cloud Managed Jupyterhub|
|Native Spark-Database Integration for API access to massive datasets for analytics and modeling|
|Out-of-the-box integration of Jupyter, SciKit, Keras, Tensorflow, H2O, SparkML, Conda, and easy to expand|
|Database embedded MLflow provides simple packaging methods for reproducible models|
|One integrated platform for application developers, data engineers, and data scientists|
|One-click deployment options through API and UI interfaces with scale built in.|
Unprecedented Ease of Putting Models Into Production
A new approach: In-Database Model Deployment
Data Scientists are not Devops engineers, and they shouldn’t have to be. Put your models into production with 1 line of code; fully managed and tracked. Our new approach deploys the models directly into tables in your database, creating “intelligent” tables. Interact with models like you would any other table, just by inserting data. Model inference happens automatically and scales with your database. Stop worrying about rest endpoints, network latency, and governance access. We put the models right next to the data. Get back to modeling and experimenting, we’ll handle the rest.
Polyglot Programming and Powerful Visualization
Accelerate your data science flow by seamlessly sharing variables between SQL, Python, Scala, R and more. Use all of your favorite languages in the same notebook and stop bouncing from notebook to notebook depending on language. Utilize powerful libraries like D3.js and BeakerX’s TableDisplay for real-time interactive demos and tools. Let visualizations and demos be your accelerators, not your bottlenecks.
Out-of-The Box Model Libraries
Scale Data Analysis with the Native Spark Datasource
Stop limiting your models by the size of your Pandas dataframes. Get the most out of your data with our native Spark integration. The Native Spark Datasource provides Python, Scala and Java APIs to all of your data, via Spark dataframes without serialization and JDBC/ODBC protocols.
Track all of your experimentation efforts, from parameters to metrics to artifacts to models with a simple and intuitive API. Keep everything in one place with one source of truth. Seamlessly share results through the industry standard MLflow UI, reassured that it’s all stored in a durable, scalable, persistent database. Store your experiments and models right next to the data that created them.
Store Artifacts in Database and Git
Store your trained models, confusion matrices, environment config files and anything else you need all directly in the database with 1 line of code. Keep your artifacts close and easily shareable. Open source more your style? Easily integrate with Github without leaving your workspace.
Enterprise-Scale Feature Store
It’s crucially important for data scientists to avoid repeating the work of their teammates. Spending hours building predictive features should only have to happen once, not every time an experiment needs to be tested. Simple, shareable feature stores are key to building highly productive data science teams. Create useful features, share them with your team, and keep them up to date. It’s as simple as that.
Event-Driven RFM Aggregation
Utilize database triggers to execute arbitrary SQL, Java or Python on an event-driven basis. Keep all of your real-time features up to date without human intervention.
Lineage with Time Travel
Splice Machine keeps prior versions of rows, and allows you to query the database at some point in the past when the data was in a different state. INSERT, UPDATE and DELETE operations all create new versions of rows. Time travel functionality allows the tables to be queried at any point in a configurable time horizon regardless of how many times a row is inserted, deleted and/or updated. models back to their roots by recreating the training data as it was at the time of training. Query tables at past timestamps to gain insights into model drift and deterioration. Utilize what-if capabilities to analyze how models would perform at various points in time
Simple Sandbox Environments
Easily Build Sandboxes
Splice Machine Ops Center allows for a comprehensive data platform to be deployed to virtually any infrastructure in a handful of minutes. This capability eliminates the friction between data platform administrators and data scientists, guaranteeing that no production workloads are impacted while maintaining identical development environments.
JupyterLab and JupyterHub Workspace
Create dedicated workspaces for each data scientist with straightforward collaboration through MLflow. Integrate any open source tools like Github through Jupyter Notebook and Lab extensions. Customize each environment to the unique preferences of each teammate, and maintain those environments using the integrated Conda management system.