Building Production-Ready Machine Learning Models
March 4, 2020
See a new scale-out RDBMS with native MLOps that makes it easy to develop, manage, deploy and govern ML models.
by Monte Zweben & Ben Epstein
With the quality of machine learning packages being developed today, testing and creating models couldn’t be easier. Data scientists can simply import their favorite library and have immediate access to dozens of cutting edge algorithms. But creating machine learning models that are production-ready requires more than just effective algorithms. It takes experimentation, data exploration and solid organization.
Data scientists need an environment to freely explore the data and plot different trends and correlations without being handicapped by the size of their dataset. With Splice Machine’s MLManager 2.0 platform, it couldn’t be easier to build, test, experiment and deploy machine learning models seamlessly into production.
Here is a summary of data science functionality that is available in the latest release of ML Manager.
As part of ML Manager 2.0, we have built Jupyter notebooks directly into Splice Machine platform with BeakerX support. Jupyter Notebooks are the most popular tool for data scientists to work and share data along with code. They are easy to work with and allow the workflow to be modularized by separating parts of the data science process into different cells.
In ML Manager 2.0, we went beyond providing access to the generic Jupyter Notebooks. With ML Manager 2.0, data scientists have custom notebooks with added functionality from BeakerX. We’ve worked with the open source BeakerX project and custom built it for Splice Machine. One of the most important features of this customization is polyglot programming. This programming paradigm allows data scientists to write code in a number of different languages all within the same notebook. Each cell can be defined using a “magic” that tells Jupyter how to interpret the code.
Data scientists have the flexibility to define the language for an entire cell using %% or just a single line using %. This allows you to write SQL code in the same notebook as your Python, Scala, Java or R code.
On top of separated polyglot programming, we also allow for cross kernel variable sharing. What this means is you can define a variable in one language, and share it with another language using the underlying BeakerX variable. This gives data scientists unprecedented access to their SQL data. For example:
There are a number of other great features from BeakerX that are built directly into Splice Machine.
Jupyter notebooks also come with great built-in visualization tools such as matplotlib and plotly for 2D and 3D plots:
Data scientists also have access to interactive Pandas tables where you can both filter and sort results as well as add visualizations directly into the tables:
Native Spark DataSource — PySpliceContext
A serious problem in data science in overfitting. Overfitting can result from a number of different factors. These include creating models that are too finely tuned to training data, making biased splits in your dataset, and even picking the wrong algorithm for the problem. All of these issues can be overcome however.
Another problem that has been challenging data science teams is the issue of data size. The hardest form of bias to overcome is that of a skewed dataset. Many data science teams run into issues building their model on a subset of data and then training it on the entire dataset. This has the potential to introduce massive bias into the models.
It is therefore crucial to train the model on the entirety of the available data, and with Splice Machine this is no longer an issue. Because our Jupyter Notebooks are deployed co-resident with our database, data scientists have direct access to the scale-out capabilities to access the entirety of their datasets instantly. More importantly, data engineers and data scientists use a different programming pattern than application developers. They manipulate Spark and Pandas DataFrames. Now Splice offers this pattern as well with a full CRUD API. Using our Native Spark Data Source, you can query your entire table, no matter the size, and immediately work with it as a Spark Dataframe. In milliseconds and with one single api call, you can move all of the data from a table into a Spark Dataframe and display it to the screen. Even more powerful is the ability to insert or even upsert a DataFrame into Splice with full ACID transactionality. This level of data access and manipulation makes model building and data exploration easier than ever before.
Imagine being able to update real-time feature stores transactionally for multiple models to use in real-time. For example, these feature stores can have tens of thousands of columns tracking customer behavioral patterns that could be instantly turned into a feature vector in a dataframe for a model pipeline to test. Imagine being able to update this feature store in real-time. Models then test features in the moment. What was the transaction that occurred seconds before a call center call? What was the last line item added to an order before an abandonment? What was the last click on a website? With a real-time feature store and a Native Spark DataSource, ML becomes more valuable.
Modeling on the Splice Machine platform is just as easy. With Spark, Scikit-learn, TensorFlow, H2O, and PyTorch, you can work in your favorite library. For Scikit-learn, we can utilize skdist to train many scikit-learn models simultaneously on Spark (think cross validation models). Future work will include TensorflowOnSpark.
All of these great features bring with them lots of complexity. Now that you have total access to your data and modeling efforts, how do you build a team around this? How do you organize your work? How do you maintain governance? How do you deploy your models? These questions and many others have brought the term MLOps into view. MLOps is the process of operationalizing the data science workflow for production ready teams. Data Science is different from other engineering efforts in that it’s an experimental process, and requires a different approach to its structure and organization.
A First Pass
If you’ve ever seen a spreadsheet like this, you know the horror of its creation. This is a typical data science run book spreadsheet. Different experiments, runs, parameters, datasets, metrics, and everything else that may be necessary for a data science experiment all stored in a single excel spreadsheet.
This is unusable for any serious data science team. It takes a long time to create, it requires redesigns for every new project and it’s prone to user error as different versions are flying around people’s inboxes as attachments. The data scientist in you may be looking at this sheet and asking “what if I want to tune another parameter, or try a Random Forest, how would that fit into this spreadsheet?” You are correct in your fears, and the answer to those questions are “you can’t” and “it doesn’t.” This is where MLManager comes in.
MLManager (and MLFlow)
Splice Machine’s MLManager took the popular open source project, MLFlow, and added functionality that we think “completes the ML Lifecycle.” With MLFlow, you can easily and dynamically track everything and anything that pertains to your model attempts: Parameters, metrics, training time, run names, model types, artifacts like images (think AUC charts), even serialized models. With Splice Machine’s MLManager, all of those metrics, parameters, and artifacts are stored directly into the database with no need for any external storage mechanism.
You can easily build a multitude of models, track everything you need, and compare them numerically or visually in the MLFlow UI. For example, above we can see the metrics of a number of different runs created within an experiment, and below we dig into three of those runs and see how the number of trees in our Random Forest relates to our F1:
Data scientists can even post the notebooks that they have used to build their models to the Github gists in order to share code, notes, and snippets for peer review and transparency. Given an MLFlow run_id, you can trace a model back to its origins. For a production ready data science team, this is crucial to maintaining control over which models get deployed and when.
Our straightforward API makes this incredibly easy and reusable, where just a few function calls can give you full traceability, no matter what modeling process you are taking. Three key functions we offer are:
1. log_feature_transformations — Logs all of the individual transformations each feature in your feature vector undergoes to arrive at its final state (ie one-hot-encoding, normalization, standardization etc)
2. log_pipeline_stages — Logs all of the stages of your Spark Pipeline (ie oversampling, encoding, string indexing, vector assembling, modeling etc)
3. log_model_params — Logs all of the parameters and hyperparameters of your model
4. log_metrics — Logs all of the metrics of your model as determined by our build in Evaluator class
Once models are compared and tested and your team is ready for deployment, the next major hurdle is introduced. Model deployment is a largely discussed topic in the data science world because it is complicated, non generalizable, and difficult to oversee. Popular deployment mechanisms like AzureML and Sagemaker are difficult to build into your pipeline and are hard to maintain governance.
Who deployed the model? Who is allowed to call the model? Who has made calls to the model in the last 24 hours? How is the model performing? How do you integrate new models into an application? What if we don’t want to deploy to the cloud for security concerns?
These questions are important ones, and ones without straightforward answers. With Splice Machine’s MLManager, we remove the complexity with the use of in DB model deployment. Our platform allows you to deploy models directly into the database, so every time a new row is inserted into a table, your model is automatically and immediately run, prediction stored, and tracked back to the model in use. On top of all of that, it’s blazing fast, both to deploy and to utilize. Deployment takes less than 10 seconds (compared to the near 30 minutes for Sagemaker/Azure ML deployment). And all model triggers are fully ACID compliant.
In DB Model deployment, security and governance is also easy for your DBA’s to learn, because your model predictions are just in a table like any other table in the database. Want to revoke access to the model? Revoke access to the table. Want to gather statistics on the model? Just a few SQL queries. Want to substitute one model for another in an application? Change the table name in the query. There is no learning curve because there is no new technology in use. Plus since this is just persistent data you can architect microservices around the model as well.
How is this done? Our MLManager determines the type and structure of your model’s pipeline, and creates tables and triggers specific to your dataset and model, and deploys them to the table of your choosing. The Trigger is generated on the fly and runs immediately when new records are inserted in the chosen table. The Trigger calls a generated Stored Procedure that deserializes the model and applies it to the new records, and writes the predictions to a prediction table. Data scientists and data engineers don’t have to learn RDBMS triggers and stored procedures because they are generated automatically. For example, here is the generated trigger and function for a model deployment:
If however, your team still wants to deploy to Sagemaker or AzureML, MLManager supports that too, both with a Deployment UI and API calls.
AI is becoming ever more prudent in business success today. Without it, your company will be left behind by more agile and market ready competitors. Injecting AI has been the biggest roadblock to success because of serious risks and implications of poor model development environments and time-to-deploy. You cannot risk putting a bad model into production that is making mission critical decisions. With Splice Machine, however, governance is fully integrated, experimentation easier than ever, and behind the scenes plumbing a thing of the past.
For example, consider an insurance company. How does Splice Machine ML Manager helps you comply when a regulator starts questioning the model you use in underwriting? You time-travel back to a “virtual snapshot” of the database that existed when you trained the model and you show the features that went into the model and the statistical distributions of the data on those features AT THAT TIME. Splice enables the data scientist to log the Snapshot Isolation timestamp of our MVCC transaction engine as meta-data in MLFlow at the time of training. With time-travel, you can prove your compliance in minutes.
Build models you can trust with our pin-point governance architecture and access to full, unbiased data, deploy models with ease and in seconds, and utilize them in your mission critical applications without latency and with the same ACID compliance of all of your other tables. Integrate your models seamlessly without needing to learn any new architectures or platforms, your model is just a (superpowered) table. Monitor your model the same techniques you monitor any other table, and when a new model is ready to replace it, redeploy in no time. Stop re-creating the wheel for every application you want to modernize.
For a hands-on look at Splice Machine and ML Manager 2.0, watch a demo video that highlights its powerful functionality and how it enables you to build production-ready ML models.