Point-in-Time Correctness in Machine Learning
Why is preventing data leakage so hard?
June 1, 2021
Data scientists know that when they build training sets, they need to watch out for data leakage in order to ensure that a model is only trained on the correct data. Data leakage occurs when models are trained on examples that did not really occur in the real world. In time-series models, data leakage typically is caused by adding features to your training set that occurred after a given prediction would have occurred. While all data scientists know data leakage is something they need to watch out for, actually building training sets without data leakage is rarely as straightforward as it seems, especially when machine learning models make predictions in real-time.
Building training sets that don’t have data leakage is in fact so challenging that tech leaders such as Airbnb, Netflix, and Uber needed to build a feature store in order to consistently build accurate training sets. But, not all feature stores actually solve the point-in-time correctness problem for data scientists.
The Point-in-Time Correctness Problem
When feature generation, predictions, and label generation occur at different points in time, data leakage can easily be introduced into your training sets. This is often called the point-in-time correctness problem. How might this problem pop up in a real-time machine learning application?
Imagine you have an e-commerce website that makes product recommendations. The features for this model might include:
- RFM metrics, such as the sum of products purchased by a user over the last week or month or year, calculated every week
- Summary of the items currently in a user’s cart that are updated in real time
The label for this model might be: was the product that was recommended actually purchased in the same web session.
A subtle, but important, complication in training a model like this is wrangling the many different timestamps that are present.
For this machine learning problem, we have 4 timestamps to keep track of:
- When the weekly batch RFM feature aggregation occurs
- When products are added to the cart
- When a product recommendation is generated
- When the purchase is actually made
This is a classic example of a point-in-time correctness problem that pops up whenever you have a real-time machine learning model. When building their training set, the data scientist should join the most up-to-date features that would have been available at prediction time to each training example, and no features that were generated after prediction time. Any feature values added to a row of training data that occurred after the prediction would have actually been generated would constitute data leakage – the real world model wouldn’t have access to that data.
To make this example more concrete, imagine that the training data we have looks something like this:
Once implemented, your product recommendation algorithm might have made a prediction at 2021–01–01 9:43:25. If so, the most up-to-date features would be those features that were observed prior to 2021–01–01 9:43:25 row 2 in the first feature set and row 1 in the second feature set.
If the prediction would have occurred at 2021–01–01 9:37:25 however, the most up-to-date feature would be the first row in the first feature set, not the second row. A data scientist needs to ensure they haven’t included any data that occurred after prediction time in their training set.
While it is certainly possible for a data scientist to construct the moving-window time point-in-time SQL joins to have the most up-to-date features for every prediction time, it is a time-consuming and error-prone task, and organizations that have successfully implemented real-time machine learning at scale have had to come up with alternative solutions.
The easiest way to ensure that a model is being trained on point-in-time correct data is to simply log the feature values that were available at prediction time. Whenever a prediction is generated, log the feature values available at that time, and your next training set is built for you automatically! While log-based training is undeniably elegant in its simplicity (you know your features would be available at prediction time because you logged them at prediction time), it comes with two important drawbacks.
First, adding new features takes time. If you would like to add a new feature to your model, you have to start logging values for that feature now, collect enough data to train your model, and then train. Features are the most important part of any machine learning model, and log-based retraining makes it impossible to iterate and improve your models quickly.
Second, feature values can’t be shared across different models. If you want to train a new model that executes predictions at a different point in time, you can’t use the feature values you have already logged. Instead, you have to start logging values all over again.
AS OF SQL Functionality — The Simple Case
The easiest solution to building point-in-time correct training sets is to use time-travel functionality. Many databases and data lakes allow you to query as of some point in the past, and you can simply query as of the time predictions were made. This solution works if predictions are always made at a certain time, such as each morning, but it doesn’t work if predictions are made at fluctuating times. It isn’t practical to have a separate AS OF query for each row of training data.
The Time-Series Feature Store
Feature stores were built in large part to solve the point-in-time correctness problem. Given a timestamp for each prediction, a feature store will automatically build a training set with only the features that were available at that time. Using a feature store, you can build training sets without having to wait to log new data, and without needing predictions to occur at predictable intervals. In comparison to manually logging long lists of feature values, feature stores require just a few lines of code.
How does it all work? There are a number of feature store architectures, but one method is to build a feature store using a hybrid (HTAP) SQL database, which can do both quick lookups and complex analytics. The HTAP database has two internal execution engines, one for operational workloads for low-latency lookups and updates and one for analytical workloads. The database cost-based optimizer automatically selects an execution engine by dynamically evaluating the query plan and it maintains consistency between these engines.
In this approach, each feature set, or group of features, is stored as two tables in the database: one table with the most up-to-date values of any given feature, and one with a time-series history feature values in the past. Using this historical table of feature values, the feature store can easily build a training set automatically using a simple API.
It is important to note that not all feature stores solve the point-in-time correctness problem for data scientists. Some feature stores do not have a mechanism to represent the timestamp of the training label and training sets in their create_training_set functions. In these feature store approaches, the complexity of joining disparate and asynchronous timestamps in a consistent and correct way is pushed down to the user.
Under the covers, the function shown above is automatically generating this complex SQL query with multiple joins and subqueries. In this case, three features sets are joined, but a training set can be created from an arbitrary number of feature sets. Without a time-series feature store, the data scientist would have to write functions to manage these separate points in time manually.
With a time-series feature store, if a data scientist wants to train their model on a new feature, they simply specify their training label as well as that label’s timestamps and join keys. Then, a point-in-time training set is generated automatically.
To learn more about how feature stores can help prevent data leakage and see the unique advantages of a feature store built on an HTAP database, check out this hands-on demo where I show you how to use Splice Machine’s feature store.