How Data Science Silos Undermine Application Modernization
August 5, 2019
by Monte Zweben & Syed Mahmood of Splice Machine
It’s Time to Reframe Modernization
I frequently have conversations with business executives who manage custom-built applications in their organizations. When the topic converges on modernizing these custom and often legacy apps, they often proudly claim modernization is underway and explain to me that they are moving to the cloud, perhaps containerizing the app, and if ambitious, re-architecting the app to take advantage of microservices.
While these are worthwhile efforts, I rarely observe that these efforts move the needle with respect to business outcomes. Being on the cloud does not improve personalization in a marketing application, or reduce fraudulent claims in an insurance system, or optimize the supply chain of a manufacturer. The applications may definitely be more agile. And yes, those container orchestration and automation advances and microservice re-factors may be able to make developers more productive, but they rarely, if ever, end up fundamentally changing or improving the application itself. So while cloud migration and containerization are important, what is a more impactful modernization that can truly move the needle of business outcomes?
For that, I believe it is necessary to take advantage of the wealth of data sources available and supercharge applications with new external data, then use this data to train machine learning models that make applications adapt with experience. This next generation of modernization injects predictive models into the application to predict a certain future outcome, so that the application can then take action accordingly.
Building Intelligent Apps Requires Intelligent
But here’s the rub. When companies try to modernize with AI and ML, they often organize their teams poorly.
In fact, when you mention AI or ML to anyone on an IT application team, they immediately pivot you to the data science team or the data lake team. This is the first sign of a silo. That usually means the people who can manage large volumes of data and “do the math” of machine learning are sitting in their silos. They are away from the action — where the application interacts with customers, suppliers, employees, etc. They are one step removed from the business. Recently, when I met with the head of a data science team for an insurance company, they said the one thing holding back operationalizing their work was a lack of engagement from the application teams.
This has a profound negative impact on modernization. Some of this we recently talked about from a technology perspective (see our blog on What happened to Hadoop?).
But here I want to focus on people and process. How do silos affect modernization from a people and process perspective?
My view is that the status quo for injecting AI into an application is usually initiated by the AI team. They get their data from some data lake. They create a thesis and experiment with many models. Sometimes they create these models in a vacuum based on the data they have available. They run many permutations of features, algorithms, and parameters and if done well, they measure the experiments properly with accuracy metrics that can objectively measure how well the model predicts new examples in a test set. One of the best reads on experimental best practices for for machine learning is Andrew Ng’s new book, Machine Learning Yearning.
But here is the punchline. The AI or data science team is ill-equipped to get the job done independently. They simply do not have enough deep knowledge about the business or the applications that will deploy the models to lead to production operations that deliver business outcomes. This is not a slight on data scientists at all. I’ve been one. But the secret sauce to a successful team is diversity. Data science is a team sport. Data scientists need to work side by side with people who know the business and the application. Here’s why.
The Right Team Can Create The Right Features
My observation over the years is that many ML problems do not have large numbers of training examples like image, sound, video and other signal processing problems do, and when that is the case, the predictive signal comes from data scientists wrangling over the data to find really good attributes that produce some predictive signal. Usually, the data scientists are combining data elements in unique ways or most importantly aggregating data. These transformed data attributes are what data scientists call features and together they form the feature vectors that are the input to supervised or classification algorithms or unsupervised clustering algorithms. This entire process is called feature engineering and is — in my opinion — the critical success factor for practical ML projects that deal with corporate structured data.
Many data scientists write Medium articles about algorithms like decision trees, random forests, boosting algorithms, Bayesian algorithms or deep learning alternatives, and while these do have an impact on model precision, the most effective way to get better predictive signal is to get the right data. For example, RFM transformations are key — recency, frequency, and monetary value. This is the process of transforming transactional or behavioral data into how recently has someone transacted or visited, how frequently, or what is their average spend (of time or money). In media personalization, companies often used the fact that a particular user visited a particular site (like a luxury shoe brand) as a feature. But this is deceptive. It turns out that if you use the recency of a visit to a particular site (e.g., within 48hrs), you can get significantly better conversion on ads. You have to get the right features represented to get a model to perform!
Recency is a simple example of a transformation. This summer, we were fortunate to have an abstract accepted to the international conference on multiple sclerosis (ECTRIMS) for work with our customer, Precision Innovative Network (PIN), that is networking together independent neurology clinics to create a population data platform on de-identified patients. The company will provide invaluable clinical research data to pharmaceutical companies, machine learning advisors and to the clinics themselves that predict the trajectory of disease. In this project, we formed an interdisciplinary team of data engineers to prepare the data, data scientists to run experiments and one of the founding neurologists of PIN as the subject matter expert — Dr. Mark Gudesblatt. Dr. Gudesblatt was able to translate deep medical knowledge into language we could use to feature engineer. One example of a feature that we would never have thought of as data scientists is an aggregation of the number of cognitive domains (e.g., memory, attention, executive function, visio-spatial) impacted negatively by a patient (as measured by standard deviations off of a mean). When disability crosses the networks of function, it is highly correlated with the trajectory of disease.
Our work with PIN exemplifies this value of the collaboration of diverse teams, as illustrated above. Initially, the data scientists, data engineers, and subject matter expert are driving the project, while the application developers and business analysts will heavily participate in later phases of the project.
Engage Your Application Developers Early
Application developers, of course, are a critical team constituency. Without them, you can never figure out how to inject the model into the business logic. They will help you answer important questions like:
- How will the application use the predictive scores to change business logic?
- How often will data be extracted from the application and used to retrain the model to keep it fresh and reflecting current conditions?
- When and how will retrained models replace older ones?
- How will the model’s behavior be monitored for accuracy?
- Will there be a need to revert to older models?
- How can the most up-to-date data be used as input to the model so that it is not making decisions on stale data?
Don’t wait until the end of a modeling project to bring the application developers in. Make them part of the project from the outset to ensure that all the operational details are considered and that as much latency is removed from the entire process so that model accurately reflect the real world.
Create a Culture of Experimentation with a Feature Factory
I’m going to write about this more in a future post, but there may even be a more important point about ML projects than just including subject-matter experts and application developers as part of the data science and data engineering teams. And that point is that to do ML well, you have to create a culture of experimentation in the company and you must realize that an ML project does not have a go-live date so to speak and a handoff to operations to keep alive. It is an ongoing process of continuous experimentation. In fact the team needs to stay engaged to create what I love to call a feature factory. The feature factory is continuously seeking new features that boost signal. Unfortunately, markets change, bad actors innovate, the climate changes, the competitors change, and so much more. What was the perfect feature vector on go-live might produce noise 2 months later, or worse — tomorrow. So the secret is to keep the diverse team intact, frequently evaluating the deployed models, and most importantly keeping them as productive as possible to experiment with new features.
Commercial announcement (skip if this offends you): Our company makes a software platform that helps teams create this culture of experimentation with productive feature factories, enabling them to modernize custom applications with ML models. To see a demo of our ML Manager, click here.
In conclusion, when modernizing your custom applications, don’t stop at containerizing, or migrating to the cloud. Inject intelligence with machine learning into the application with the goal of continuously improve business outcomes. Don’t create a data science organization per se. Create “modernization SWAT teams” with data engineers, data scientists, application developers and operators, subject matter experts and analysts. Adopt tools and processes that enable these teams to construct a culture of experimentation.