The Compliance Nightmare Lurking in Your Data Science Team

Protect your business from unpredictable analytics using a feature store

Monte Zweben and Jack Ploshnick
May 6, 2021
Data Science Team

When your team makes a business decision, they have to be able to justify it. Financial organizations have to document the research that led to a particular piece of investment advice, insurance organizations need to be able to explain why a particular policy was denied, and retailers need to explain any personalized promotional behavior.

Business Before Advanced Analytics

When people are the key decision makers in an organization, justifying decisions isn’t necessarily an easy task, but the process to do so is straightforward. There is a review process, forms are signed, boxes are checked. Months or years after an important decision was made you can look back and see who made a decision, when they made it, and why.

However, the decision-making process is changing. Increasingly, algorithms coming out of your data science team are informing decisions more than people’s opinions. Increasing automation of crucial business decisions is undoubtedly a good thing, as more accurate decisions can be made in less time, but the same standards of accountability still need to be met.

Unexplainable Decisions

If you approve or deny insurance claims, or even show recommendations to users on your website, based on an algorithm, you need the same kind of accountability you have when a decision is made by people: what algorithm made the decision, when, and why. In many industries, this kind of oversight isn’t just a nice thing to have, it’s a regulatory requirement.

Is your company’s data science team prepared to answer tough questions about their models? If a regulator comes knocking and asks why your loan pre-approval algorithm, or fraud detection algorithm, or product recommendation algorithm, seems to discriminate against one group of customers, will your team be prepared? Can they offer an explanation of how their model works immediately, or will it take months of internal investigation only to find that the data used to train a model can’t be identified?

A data science manager at a Fortune 100 bank told us that their data science organization operates like the Wild West. Different teams often use different technologies, build bespoke model approval processes, and have documentation procedures used nowhere else in the company. This isn’t just inefficient from an engineering perspective, it’s a compliance nightmare in the making.

The Unsustainable Status Quo

How can you protect your business from unexpected machine learning predictions? The most common method thus far has been to lock your data science team into rigid ML platforms. In these types of systems, data scientists are forced into single-use data flows, where all relevant data is stored within the platform itself. Pipeline-based machine learning platforms are excellent for getting your first one or two models up and running: A few static files with gigabytes of data can be added to the system, a pipeline can be built, and a model trained with little to no code required.

But, what happens when your organization matures? If you want more than one or two models in production, if the data fed into your models becomes too large to store inside your modeling environment, or if you want to make predictions based on data collected seconds ago instead of days ago, most ML platforms will only slow your data science team down. To track data lineage at real-world scale, companies of all sizes are turning to feature stores.

What is a feature store? A feature store is a centralized repository of continuously updated, raw and transformed data attributes, called features, used for analytics. In other words, all of the inputs to analytic models are stored in one persistent, searchable, scalable location.

Feature Stores in the Real World

In a real business use case, how would a feature store prevent compliance risk? Imagine a retail bank has an app that provides loan pre-approval. The algorithm takes data about past financial transactions as input, and user characteristics such as credit score, homeownership status, income level, etc. A customer alleges that the loan pre-approval algorithm is discriminatory. Would your team be able to prove that such an algorithm is not discriminatory?

Your data science team might want to show that for the customer in question, the model denied a loan not because of discriminatory factors, but because of that customer’s transaction history. The trouble is, there are dozens of data pipelines collecting data from separate sources that generate transaction history metrics, and finding which data was available at the time a prediction was made isn’t possible.

With a feature store however, data scientists can identify exactly what data was used to train a model, and what data was used to generate a particular prediction, in seconds. Regulators won’t have to take your data scientists’ word for it that their models aren’t biased – they can actually prove it. Moreover, end-to-end monitoring of data ensures that your team can adjust their models before bias is even introduced.

How to Get Started

Protecting against compliance risk using a feature store doesn’t require you to uproot your entire data infrastructure. A feature store can be implemented in weeks, connect to any data source, be deployed on any cloud or on premise, and enhance the ML framework your data science team currently uses. Data from your data warehouse, database, or real-time event stream can feed directly into the feature store where features are versioned and then sent to any modeling and model deployment environment.

For more information about how feature stores turn data into business value, check out this blog post.