Whose Job Is It Anyway?
How to delineate responsibilities in a data science team
Monte Zweben and Morgan Sweeney
April 23, 2021
Big Data and machine learning are beginning to be implemented in businesses around the world. Yet with this new technology comes new organizational problems – most hiring managers don’t really know the difference between a data scientist and a data engineer, and the CIO has too many choices as to what software to license for their new employees to use.
In some contexts, this means that a single data scientist is hired to do all the data management work of an entire team – essentially overseeing the machine learning and data lifecycle from end-to-end. In this case, one person is expected to write ETL pipelines, build and train models, and figure out how to integrate these models into applications often without proper training.
In other contexts, there’s a data engineer, a data scientist, and an ML engineer, but there’s not a clear demarcation between where one person’s responsibility ends and another takes over. We’ll start with a quick summary of the difference between different data workers, and then explain how new technology can help integrate these individual roles into a unified team.
Data Engineer: Data engineers are responsible for building and maintaining an organization’s data infrastructure. While this can depend on the organization, this typically includes databases, data warehouses, and data processing pipelines. An important part of a data engineer’s job is the transformation of raw data into a format that is useful for analysis. This means taking raw, unstructured data from different data sources (such as machine sensors or marketing tools) and cleaning, organizing, and processing it so it can be stored in a data warehouse to be analyzed.
Data Scientist: In an ideal world, the data scientist’s job is to take this data, analyze it in a bunch of different ways, and build models that draw conclusions for the business. In order to do this, they have to identify the problem they’re looking to solve, format the data they have in terms of the variables they’re looking to explore, and analyze it to find patterns, which is often done through statistics and machine learning models. Data scientists search for, clean, and validate data, along with feature engineering raw data points so they are interpretable by ML models. As data science is evolving, a big part of a data scientist’s job is using their models to inform decisions in the workflow of the enterprise.
ML Engineer: ML engineers take the models data scientists have developed and put them into end-user applications. This means building software that can serve ML / AI features on a website or mobile application, or perhaps in a pipeline or batch application. Maintaining and improving machine learning infrastructure is another part of the job, along with evaluating and improving model performance perhaps by retraining. In some cases, ML engineers also use models to automate key tasks for the company.
That said, it’s important to acknowledge that each individual company will specify each role’s responsibilities, and the lines between them can blur. One thing everyone can agree on is that to implement ML successfully, it has to be a collaborative process from all of the roles involved.
Changes in Collaboration
But collaboration is easier said than done, especially with a virtual workforce. Slacking a data engineer for the fourth time that day about pipeline updates and checking spreadsheets for the most recent model version is a nuisance, and important jobs can slip between the cracks. If the model makes a mistake, whose job is it to check? How can an ML engineer tell where the error is when the model has been trained 50 times on 50 different feature sets by the data scientist? And if the data coming in causes a new problem, whose job is it to fix it?
Having a central location to store all the information related to every model in the organization would build a necessary bridge between the data engineer, data scientist, and ML engineer. On top of that, having a single place to check ML performance that is updated in real-time saves time and energy messaging code snippets and updates back and forth. Having a singular transparent repository would not only make internal monitoring easy, but could also allow government regulators to check data lineage and model performance with the utmost ease.
Enter the Feature Store
A feature store is a shareable repository of features made to automate the input, tracking, and governance of data into ML models. Feature stores compute and store features, enabling them to be registered, discovered, used, and shared across a company. A feature store makes sure features are always up to date for predictions and maintains the history of each feature’s values in a consistent manner, so that models can be seamlessly trained and re-trained.
By providing a single source of information, eliminating repetitive time drains, and enabling absolute transparency, feature stores let each member of a data science team focus on their job.
Companies that rely heavily on machine learning models like Uber, Airbnb, and Spotify have already built their own feature stores in order to scale their models effortlessly. Small companies have even more to benefit – feature stores improve communication and productivity by automating key parts of the data lifecycle, so they can truly optimize the few employees focused on data.
Organizations are still learning how to implement data science into business practices, and how to best convince leadership of the value of a data science team. More than any other AI technology, feature stores are providing a concrete ROI for data science teams by revolutionizing productivity and clarity for an organization.
For more information about how feature stores turn data into business value, check out this blog post.