Scaling Machine Learning Productivity at LinkedIn

Joel Young

ML Infrastructure | Gen AI, Leadership

January 3, 2019

Co-authors: Joel Young, Bee-Chung Chen, Bo Long, Marius Seritan, and Priyanka Gariba

The rate at which artificial intelligence (AI) knowledge is being disseminated and the rate of change in applied AI technologies show no signs of slowing down. Today’s software engineers are increasingly learning machine learning (ML) strategies as undergraduate students, and hardly a day goes by without the announcement of a new ML technique or framework. In this post, we’ll talk about our approach at LinkedIn to scaling AI and ML systems so that more engineers can take advantage of these techniques.

Introducing Pro-ML

For the past decade, LinkedIn has embraced AI across our product lines—from anti-abuse anomaly detection to career recommendations to feed curation, we use AI in a variety of ways that improve the experience for members and customers. We have built many bespoke ML systems for highly performance-sensitive products that are critical to the member experience.

However, this approach doesn’t always scale efficiently. Each AI stack was built by separate teams, with little sharing between them. Additionally, custom workflows add complexity when onboarding new engineers, new features, and new modeling technologies. In recent years, we became acutely aware that these systems make it difficult for non-AI engineers to build, train, and run their own models.

In August 2017, we began a new program at LinkedIn called “Productive Machine Learning” (“Pro-ML” for short). The goal of Pro-ML is to double the effectiveness of machine learning engineers while simultaneously opening the tools for AI and modeling to engineers from across the LinkedIn stack.

As we mapped out the effort, we kept a set of key ideas in place to constrain the solution space and focus our efforts.

We will leverage and improve best-of-breed components from our existing code base to the maximum extent feasible. We are unlikely to rewrite our entire tech stack, but any particular component is fair game.
The state of the art is constantly evolving with new algorithms and open source frameworks—we need to be flexible to support our existing major ML algorithms as well as new ones that will emerge.
We will use an agile-inspired strategy so that each step we take is delivering value by making at least one product line better or providing generally useable improvements to existing components.
The ability to run the models in real-time is as important as the ability to author or train them. The services hosting the models must be able to be independently upgraded without breaking their downstream or upstream services.
New models, retrained models, and models using new technologies must be A/B testable in production.
We must build GDPR privacy requirements into every stage of the solution.

We should design to avoid known anti-patterns such as those identified in prior research into machine learning systems and tech debt.

We began with a review of our assets, counting hundreds of relevance services and several core learning technologies including tree ensembles, generalized additive mixture ensembles, and deep learning. We then broke down our efforts into a set of layers.

The layers we focused on are:

Exploring and authoring
Training
Deploying
Running
Health assurance
Feature marketplace

The Pro-ML life cycle

Exploring and authoring
The modeling process starts with exploration of the problem space, the features, and the data, and then identifying a particular goal—for example, computing the probability of a member clicking on a job. You then train and evaluate ML algorithms and train a model to achieve your goal. The model needs to be evaluated (e.g., cross-fold evaluation, area-under-the-curve, f-scores) and retried. This often takes many attempts, as there are many hyperparameters to test, especially for deep models (e.g., number of layers, size of layers, types of convolutions).

We selected two strategies to address this. First, we built a domain-specific language (DSL) with IntelliJ bindings to capture the input features, their transformations, the ML algorithms employed, and the output results. Second, we are building a Jupyter notebook integration that allows step-by-step exploration of the data, selection of features, and drafting the DSL. It also allows you to tune the model parameters and drive the training.

Training
Some of LinkedIn’s data-driven features are very time-sensitive, and as such, they are computed mostly online (for example, recommending new connections). For most of our products, however, we still use offline training; some teams may train every couple of hours, while other teams have tens of models (or sub-components of a model) that are trained and retrained daily. We rely heavily on our Hadoop systems for offline training. ML developers use our Pro-ML unified training service, to which we continuously add newer model types and other tools like hyperparameter tuning. The training service is tightly interconnected with the online serving and feature management ecosystems. This ensures that the same input files are used throughout the system and minimizes the risk of errors. The training services leverage Azkaban and Spark in order to run the actual training. Once a model passes offline validation, the training library passes over the trained artifacts and metadata to the deployment system.

Model deployment
Understanding deployment starts by defining what we mean by “ML artifacts.” We are interested in the identity, components, versioning, and dependencies relative to other artifacts in the system. For example, a model may have a global component in the tens of MB and member-specific components in the GB. Each of these may be created separately with its own version and have dependencies on code (libraries, services) and features. We then store this information in a central repository, where it is leveraged for automatic validation (e.g., are all the features available both offline and online?) and deployment. The target destination for an artifact may be a service, a key value store, or other infrastructure components. The deployment service provides orchestration, monitoring, and notification to ensure that the desired code and data artifacts are in sync. The deployment also ties with the experimentation platform to make sure that all active experiments have the required artifacts in the right targets in the overall system.

Running
Much of the excitement around AI focuses on the exploring and training steps. This isn’t enough for real systems, however. We need to be able to reliably, efficiently, and operably evaluate the models in production. This includes offline in systems such as Spark and Pig, nearline in Samza, online in REST services, and deep in our search stack. Historically, teams have written custom scorers in each environment, but this is intensive and error prone—it is too easy to have a small delta between the training and serving environments, leading to difficult-to-diagnose bugs. To address these overall challenges, we built a custom execution engine called Quasar for running the DSL discussed in the “Authoring” section above. The engine takes the features from the marketplace (see below) and the coefficients and DSL code from the model deployment system, and then applies the code to the data and coefficients. We have also built a higher-order declarative Java API (ReMix) for defining composable online workflows for query rewriting, feature integration, driving downstream recommendation engines, and blending the results. We are also building a distributed model serving system, driven by Quasar, to federate multiple inference engines, including various versions of TensorFlow Serving and XGBoost.

Health assurance
The processes that produce and update ML artifacts are hard to test and monitor. The health assurance layer of Pro-ML is made out of automatic and on-demand services. The automated services ensure that the online and offline features (inputs to the model) are similar in a statistical sense. They also validate that the online model behavior is in sync with the expected behavior; for example, that the predicted score is in line with the expected precision from the offline training. If an anomaly is detected, the ML engineer can use on-demand services to understand the source of the discrepancy. They use replay, store, explore, and perturb techniques in order to further isolate the problem: is there a bug in the code, missing data, or should the model simply be retrained?

Feature marketplace
The output of a system is only as good as the data that goes in. Big Data is powering the current AI cycle and managing it requires a dedicated system. We have tens of thousands of features that need to be produced, discovered, consumed, and monitored. At LinkedIn, we have Frame, a system to describe features both offline and online. Frame is used by both consumers and producers. We publish Frame’s metadata about the features in a centralized database/UI system, which is also connected to our Model Repository. This allows ML engineers to search for features based on various facets including the type of feature (numeric, categorical), statistical summary, and current usage in the overall ecosystem.

Organizational structure

How are we organizing the work of the AI teams at LinkedIn to help solve for the problems outlined above (scale, resources, opportunity, etc.)? Historically, many engineering organizations have been very hierarchical. You have managers, have managers reporting to those managers, and then you have teams of engineers. This isn’t how we have structured the Pro-ML initiative.

After a decade of rapid progress and experimentation within the Data organization at LinkedIn, we arrived at an organizational model that closely aligns AI teams with product teams, but maintains the reporting relationship to the parent AI organization. This ensures that researchers can collaborate and share best practices with fellow experts who are working to solve similar hard problems, while still having a dedicated ML team under the product “chain-of-command” that we are supporting.

Organizing the Pro-ML teams

Similarly, the team behind Pro-ML has been organized around five main pillars, each of which supports one of the stages of the model development life cycle. Typically, each of the pillars has a lead (usually an engineer), a tech lead, and several engineers. Just like with our embedded AI teams across LinkedIn’s business lines and practice areas, these engineers come from across the organization, including product engineering, our foundation/tools organization, and infrastructure teams. The Pro-ML team is distributed across the world, and includes engineers in Bangalore, Europe, and in multiple locations in the United States. We also have a leadership team that helps set the vision for the project and (most importantly) works to eliminate friction so that each of our pillars can stand on its own.

We are now more than one year into our effort to transform artificial intelligence at LinkedIn to make it scale across all of engineering—keeping it fast, efficient, and operable.

Conclusion

Just as software has taken over the world, artificial intelligence is taking over software. AI techniques are finding uses everywhere in software engineering, from detecting fake members to mapping out career paths. Similarly, we are making investments not only in new AI research and in developing the AI skills of our employees, but also in initiatives like Pro-ML that increase the productivity of our engineers.

Pro-ML will increase the number of products that can take advantage of AI and expand the number of teams that are able to train and deploy models. Additionally, it will reduce the time needed for model selection, deployment, etc., and provide automation in key areas like health assurance. Finally, it gives our people more time to do what they do best: finding creative solutions to hard technical problems, using LinkedIn’s unique and highly-structured dataset.

Learn more about Pro-ML and AI at LinkedIn

Joel Young and Bo Long presented the strategy behind designing Pro-ML at the Strange Loop conference in St. Louis last fall—you can watch the video above to see their talk.

You can also check out this interview with Pro-ML visionary Bee-Chung Chen on the This Week in Machine Learning & AI podcast.

And finally, to stay up-to-date on our progress, follow the #proml hashtag on LinkedIn!

For more on AI at LinkedIn:

Check out the LinkedIn Engineering Blog post by our Pro-ML sponsor, Deepak Agarwal, here.
For more information, check out our other content on the LinkedIn Engineering Blog.
We are also active in the research community. You can find a collection of tutorials and papers at KDD 2018 and KDD 2017.

Acknowledgments

There are far too many engineers, TPMs, and product managers that are helping make Pro-ML a success to list here, but we would like to call out a few of our key partners. These include Rushi Bhatt, Sughosh P K, Steven Ihde, and Vasanth Rajamani. We’d also like to call out the engineering senior leadership making this possible: Igor Perisic, Swee Lim, Josh Walker, Deepak Agarwal, Dan Grillo, Scott Holmes, Jeff Galdes, and Kapil Surlaker.

Topics: Developer Experience/Productivity Artificial intelligence Data Machine Learning