Adaptive Applications — A Key Objective of Artificial Intelligence
January 11, 2017
By Monte Zweben, CEO and co-founder of Splice Machine
A key function of artificial intelligence (AI) is to make software applications adaptive, using machine learning to improve them based on the data that they process. The machine learning workflow typically involves data scientists to perform model selection, and then application developers to deploy the selected models. Model selection is an iterative process of feature engineering, algorithm selection, and parameter tuning in an experimentation environment.
Feature engineering is the process of transforming raw data into a vector of attribute-value pairs that represent the underlying problem in the best way possible to improve the predictive accuracy of the models on unseen data. For example, consumer marketers often transform their raw order, ad, and clickstream data into recency, frequency, and monetary value (RFM) attributes. Knowing that a woman, 17-24, in the NY area went to a shoe website is useful, but knowing that she went 10 minutes ago, visits monthly, and buys $1500 worth of shoes annually adds far more signal to the model. This transformation process requires extensive aggregation and group-by queries. Often, feature engineering can improve the predictive accuracy of a machine learning model, perhaps even more than selecting the right algorithm with the best parameterization.
Machine learning libraries typically have a variety of algorithms for each problem formulation. For example, classification tasks where you are learning how to classify a new vector based on a set of positive and negative training examples can be performed by deep learning / neural network models, decision trees, Bayes-Classifiers, logistic regression, and support-vector-machines (SVMs). Each of these have a variety of parameters that can be tweaked. Data scientists perform model selection experiments by varying different features, algorithms, and parameters iteratively to find the optimal predictor. Once the model is selected, the application developer must operationally deploy it. Deploying a model means that as new examples come in, the model can classify them, and the application can behave based on that classification. Then the application gathers more data based on the outcomes of the classification and the system re-trains itself.
The machine learning workflow as described can be quite slow, especially when data needs to be transferred to separate systems, and then processed by human data scientists. This causes applications to adapt with a lag of minutes, hours or even days after the data was collected. For many modern adaptive applications this process must be optimized in a real-time system. If a marketing application needs to score hundreds of models as part of a 100ms decision loop, that means that the model framework should be lightweight (e.g. logistic regression, not random forests with many trees). Sometimes, special in-memory caching of feature evaluations is deployed, which may be reused for many separate model evaluations. Good data structures for hashmaps and avoiding string manipulation are also vital to optimizing model performance.
The computational requirements of such adaptive systems are largely analytical – therefore requiring a compute engine that can scan, join, and aggregate billions or even trillions of records. But what is often overlooked is the data cleansing operation required for feature engineering. Raw data is often flawed, for example zip codes with 6 digits, latitude+longitudes that are not on land, duplicate customer records, and misspellings of enumerated types. Often, data scientists need to search for these anomalies and change them, which is much more efficient to do in engines that support short-reads and writes. These systems are typically row-based, store their data in some kind of sorted order, and have indexes that can be efficiently searched. Adaptive systems therefore need a combination of compute engines to accomplish their tasks. Splice Machine is an open-source SQL RDBMS for exactly such hybrid workloads. See what Splice Machine can do to support Adaptive Applications for your business.