fbpx

Building ML models and moving them into production is still too hard. Creating, sharing and managing features is a common goal. Maintaining data and model lineage over time is now a must-have, not a nice to have. Splice Machine uniquely streamlines ML development, from data preparation to model training and deployment, at scale.

Data Science Challenges

Before Splice MachineAfter Splice Machine
Siloed without access to production data until it’s staleRDBMS and ML Platform co-located for immediate access, bringing the data to the data scientist
Spending too much time managing infrastructure, not data scienceInfrastructure is abstracted away. The full database interface is accessible through Cloud Managed Jupyterhub
Sub par models built on small subsets of data without the ability to scaleNative Spark-Database Integration for API access to massive datasets for analytics and modeling
Spending time finding, updating, and managing the dependency web of all of the open source libraries and tools needed to do the jobOut-of-the-box integration of Jupyter, SciKit, Keras, Tensorflow, H2O, SparkML, Conda, and easy to expand
Unmanaged and isolated modeling environments, leading to non-reproducible experimentsDatabase embedded MLflow provides simple packaging methods for reproducible models
Difficulty with the handoff between data engineer, data scientist, and application developerOne integrated platform for application developers, data engineers, and data scientists
Non repeatable or nonexistent model deployment pipelinesOne-click deployment options through API and UI interfaces with scale built in.
Before Splice Machine
Siloed without access to production data until it’s stale
Spending too much time managing infrastructure, not data science
Sub par models built on small subsets of data without the ability to scale
Spending time finding, updating, and managing the dependency web of all of the open source libraries and tools needed to do the job
Unmanaged and isolated modeling environments, leading to non-reproducible experiments
Difficulty with the handoff between data engineer, data scientist, and application developer
Non repeatable or nonexistent model deployment pipelines
After Splice Machine
RDBMS and ML Platform co-located for immediate access, bringing the data to the data scientist
Infrastructure is abstracted away. The full database interface is accessible through Cloud Managed Jupyterhub
Native Spark-Database Integration for API access to massive datasets for analytics and modeling
Out-of-the-box integration of Jupyter, SciKit, Keras, Tensorflow, H2O, SparkML, Conda, and easy to expand
Database embedded MLflow provides simple packaging methods for reproducible models
One integrated platform for application developers, data engineers, and data scientists
One-click deployment options through API and UI interfaces with scale built in.

Unprecedented Ease of Putting Models Into Production

The last mile of data science is now easy

A new approach: In-Database Model Deployment

Data Scientists are not Devops engineers, and they shouldn’t have to be. Put your models into production with 1 line of code; fully managed and tracked. Our new approach deploys the models directly into tables in your database, creating “intelligent” tables. Interact with models like you would any other table, just by inserting data. Model inference happens automatically and scales with your database. Stop worrying about rest endpoints, network latency, and governance access. We put the models right next to the data. Get back to modeling and experimenting, we’ll handle the rest.

Endpoint Deployment

Deploy where it matters most. MLManager’s one-click API and UI deployment options enable rapid endpoint deployments to Sagemaker and AzureML. Seamlessly integrate with pre-developed CI/CD pipelines through Github and other open source tools.

Model Development

Best-in-class developer environment lets teams collaborate and use any framework

Polyglot Programming and Powerful Visualization

Accelerate your data science flow by seamlessly sharing variables between SQL, Python, Scala, R and more. Use all of your favorite languages in the same notebook and stop bouncing from notebook to notebook depending on language. Utilize powerful libraries like D3.js and BeakerX’s TableDisplay for real-time interactive demos and tools. Let visualizations and demos be your accelerators, not your bottlenecks.

Out-of-The Box Model Libraries

Don’t be held back by your team’s available libraries or compute resources. Use the right tool for the job, and scale when needed with Spark, H2O, Keras, Tensorflow, Spark MLlib, and SciKit built in. Stop worrying about versioning nightmares. With one JupyterHub instance managing all of your Jupyter environments, you can keep all of your libraries in sync. And with our database-embedded MLFlow, any model logged gets an immediate snapshot of the library version and the python version.

Scale Data Analysis with the Native Spark Datasource

Stop limiting your models by the size of your Pandas dataframes. Get the most out of your data with our native Spark integration. The Native Spark Datasource provides Python, Scala and Java APIs to all of your data, via Spark dataframes without serialization and JDBC/ODBC protocols.

Experimentation Tracking

Metadata management and experiment tracking enables teams to stay coordinated on an on-going basis

Integrated MLflow

Track all of your experimentation efforts, from parameters to metrics to artifacts to models with a simple and intuitive API. Keep everything in one place with one source of truth. Seamlessly share results through the industry standard MLflow UI, reassured that it’s all stored in a durable, scalable, persistent database. Store your experiments and models right next to the data that created them.

One Function Logging

Log your entire Spark Pipeline, or all of your feature transformations from end to end, all with 1 line of code. Use MLflow’s autolog functionality for tracking your Keras models without having to think about anything. Focus on development, we’ll handle the tracking.

Store Artifacts in Database and Git

Store your trained models, confusion matrices, environment config files and anything else you need all directly in the database with 1 line of code. Keep your artifacts close and easily shareable. Open source more your style? Easily integrate with Github without leaving your workspace.

Enterprise-Scale Feature Store

Enterprise scale feature stores enable sharing and collaboration of features and eliminates duplicate work

Shareable Features

It’s crucially important for data scientists to avoid repeating the work of their teammates. Spending hours building predictive features should only have to happen once, not every time an experiment needs to be tested. Simple, shareable feature stores are key to building highly productive data science teams. Create useful features, share them with your team, and keep them up to date. It’s as simple as that.

RDBMS Enables Real-Time Feature Updating

Keep your models trained on the most up-to-date set of features. Feature stores ensure stale data isn’t being used by tracking when features are updated and what they were updated to.

Models as Features

Deploy models directly to your feature store to add real-time intelligence to your feature sets. Gain unprecedented access to real-time machine learning.

Event-Driven RFM Aggregation

Utilize database triggers to execute arbitrary SQL, Java or Python on an event-driven basis. Keep all of your real-time features up to date without human intervention.

Model Governance

Model Monitoring With Prediction Materization

Keep every prediction input with its output, enabling simplistic model monitoring. Understand why each decision was made with confidence.

Lineage with Time Travel

Splice Machine keeps prior versions of rows, and allows you to query the database at some point in the past when the data was in a different state. INSERT, UPDATE and DELETE operations all create new versions of rows. Time travel functionality allows the tables to be queried at any point in a configurable time horizon regardless of how many times a row is inserted, deleted and/or updated. models back to their roots by recreating the training data as it was at the time of training. Query tables at past timestamps to gain insights into model drift and deterioration. Utilize what-if capabilities to analyze how models would perform at various points in time

Simple Sandbox Environments

Easily Build Sandboxes

Splice Machine Ops Center allows for a comprehensive data platform to be deployed to virtually any infrastructure in a handful of minutes. This capability eliminates the friction between data platform administrators and data scientists, guaranteeing that no production workloads are impacted while maintaining identical development environments.

JupyterLab and JupyterHub Workspace

Create dedicated workspaces for each data scientist with straightforward collaboration through MLflow. Integrate any open source tools like Github through Jupyter Notebook and Lab extensions. Customize each environment to the unique preferences of each teammate, and maintain those environments using the integrated Conda management system.

Watch Our ML Manager Demo

Check out our webinar hosted by Splice Machine’s Ben Epstein