The Secret to Operational ML: A Feature Store
How have billion-dollar businesses built on machine learning overcome the challenges of development, deployment and governance at scale? Most share a common thread: they have devoted millions of dollars to people and technology to create the lynchpin to machine learning: a feature store.
A feature store is a centralized repository of continuously updated, raw and transformed data for machine learning. It enables better models to be created faster with reusable features and makes model and feature governance possible for explainability and transparency. But how can other companies benefit from a feature store without going through the engineering effort that Uber, Airbnb, Netflix, Apple, Comcast and others had to?
Machine Learning Is Stuck
Few businesses are deriving the full value of machine learning in their daily operations. It takes an army of data scientists to deploy ML throughout the enterprise because feature engineering remains too time consuming with mundane data tasks.
Then each step of the machine learning lifecycle adds complexity, hindering an enterprise’s ability to get ML into production applications where it can have a tangible impact.
Why is Feature Engineering Hard?
- Too much infrastructure to connect, synchronize, and manage
- Duplicative feature engineering across the enterprise
- Custom coding to make ongoing training sets in production
- Hard to keep features up to date in real-time
- Hard to maintain lineage & provide transparency
A Better Feature Store: With One Data Engine
Most feature stores incur extra cost, complexity, latency because they need to maintain an online feature store and an offline feature store. Splice Machine is the only Feature Store powered by one ACID-compliant dual OLTP/OLAP RDBMS.
The Benefits of a Single-Engine Feature Store
- Easier to provision and operate
- Less infrastructure cost
- Easier to backup or replicate
- Native triggers enable event-driven pipelines
- No latency to synchronize
- True ACID transactionality
“We had this sort of a feature store at Airbnb, but it was limited by the fact that we were largely on HDFS. It enabled users to share features, but it didn’t solve the online/offline problem. But the solution can obviously be much more elegant if you start with a more amenable database that can function in realtime. Splice Machine seems to be doing exactly that – MLflow integration, database re-injection, Spark lazy loading, easy deployment, and API-less access.”
– Robert Yi, CDO at Dataframe and former Airbnb data scientist
The Fastest Way to a Feature Store
As the provider of the only scale-out SQL RDBMS with built-in machine learning, Splice Machine has driven advancements that others did not think possible. Unlike other feature stores, the Splice Machine Feature Store is built on a database. This delivers simplicity, scalability, and speed, both in implementation and operation.
By choosing the Splice Machine Feature Store over a single cloud option, companies can avoid cloud vendor lock-in – and retain the ability for on-premise hosting.
Key Capabilities of the Splice Machine Feature Store
- Architectural simplicity
- Horizontal scalability
- Low latency lookups
- Point in time consistency for training
- ELT in-DB scalable transformation in both SQL and Python for feature pipelines
- Event-driven and batch feature updates
- SQL transforms
- ACID compliance between “online” and “offline” data
- Automatic feature history
- SQL or Python feature retrieval
Need More Than a Feature Store?
Splice Machine also offers ML Manager, an end-to-end machine learning platform that solves the biggest problems requiring the most effort in the Machine Learning workflow.
For example, with Splice Machine ML Manager, you can:
- Run a real-time model that could predict fraudulent transactions moments after they occur
- Deploy that model using the power of the database in one line of code
- Schedule auto model retraining and champion challenger systems so the model is always trained on the most available data and become better over time
- Track which features are being used in that model and how those features change over time
- Create, reuse, and share new features for the existing model and automatically have those new features backfill to the history of the data
- Define a training dataset for the model and have that training set update with new data automatically as it becomes available
- Access the entirety of historical transactions wherever it is stored to train the model with