One-stop MLOps portal at LinkedIn
May 19, 2022
Co-authors: Eing Ong, Shannon Bain, and Daniel Qiu
What is MLOps?
Before we dive into our MLOps portal, let’s begin by defining MLOps (Machine Learning Operations). MLOps is about continuously running ML correctly by managing the full lifecycle (developing, improving, and maintaining) for AI models. A structured and methodical approach that starts at problem exploration and goes through continuous integration and delivery is needed to make this process repeatable, scalable, and collaborative for data scientists and AI engineers.
At LinkedIn, we recognized this need back in 2017. We launched “Productive Machine Learning” (“Pro-ML” for short) to double the productivity of machine learning engineers while simultaneously standardizing the tools with a pluggable toolkit that we could leverage across LinkedIn. Pro-ML Workspace is one of the major components in Pro-ML, focusing on user interfaces and visualizations. In this blog post, we will deep dive into Pro-ML Workspace, our MLOps portal.
History of the Pro-ML Workspace
Pro-ML Workspace was launched in mid-2020. Today, it has become the one-stop portal for our AI engineers and data scientists to easily find and analyze their training; analyze and evaluate their AI models and data quality; and deploy and monitor machine learning models in production. We will start with an overview of the architecture behind Pro-ML Workspace before walking through the key user interfaces.
To visualize the entire ML lifecycle, an infrastructure is needed to automatically track every step of the machine learning process. We created a data schema to capture the complete, structured, and well-documented information detailing how machine learning models are produced. In particular, model lineage (history of iterations/offline experimentations) provides the necessary data to reproduce the results. This is the key to an auditable approach from which we can compare, track progress, improve, and learn.
Hence, we created AI metadata (AIM), the backbone infrastructure for Pro-ML Workspace. AIM provides the auditability, traceability, and reproducibility for models, which improves AI productivity by organizing our work and gives us a better understanding of the model, making AI at LinkedIn more transparent. Some examples of this metadata are unique identifiers across the lifecycle such as projects, training runs, artifacts, creation time, and operations performed.
We leveraged LinkedIn’s open source Generalized Metadata Architecture (GMA) to rapidly build the AIM infrastructure to ingest, process, and serve the metadata. Using GMA allows our machine learning infrastructure and tools across LinkedIn to leverage AIM’s well-defined and structured schema to persist their metadata. AIM service APIs would then allow clients to retrieve data for visualization, debugging, and insights such as productivity metrics.
ML lifecycle in Pro-ML Workspace
Now that we have our critical AI metadata infrastructure in place, we can visualize the ML lifecycle in Pro-ML Workspace. We have launched most of the ML lifecycle phases, with remaining effort outlined in the ongoing work section.
Model authoring, creation, and evaluation
Model authoring and creation is one of the earliest phases of the model lifecycle (after feature generation and serving). We created a model-tracking structure that mirrors the ML training lifecycle with all the metadata collected in AI metadata. In Pro-ML Workspace, we visualize the steps and progress of the model training pipeline, directly linking to all generated artifacts in context (see the following image).
Some of the artifacts generated during model training enable evaluation of the model’s performance relative to their business goals. We support a rich ecosystem of such evaluation analyses contributed from various AI teams. We partner with AI teams to build beautiful and informative visualizations for their analyzers, such as our work on the fairness model analyzer. Models can further be analyzed across different training runs.
In the following two diagrams, two models are selected to compare their applicable metrics. Because the following examples are binary classification models, evaluation metrics such as AU-ROC and AU-PR and visualizations such as ROC and PR curves and confusion matrices with actual distributions are automatically provided. Modelers can further interact with these graphs - cropping and zooming into curves and manually exploring thresholds to see the impact on positives and negatives.
Model productionisation is the next major phase in the ML lifecycle. In Pro-ML Workspace, we help to simplify and enable production-ready models to advance to the next state, known as ‘publishable’ state at LinkedIn. We created a visualization of the workflows to publish and deprecate models. We also integrated with LinkedIn’s Centralized Release Tool (CRT) to guide users to deploy these published models via CRT. The following graphic is the landing page where users can see the model status and take actions to publish, review for publishing, or deprecate their model.
Once models are in production, they enter the online experimentation and continuous maintenance phases. Models are instrumented to enable health monitoring and sharing of their metadata to AI metadata. We focus our visuals on the key metrics that are of interest to AI teams, such as service latency, feature consistency, and drift monitoring, to direct users to LinkedIn tools (such as ThirdEye and InGraphs) for further analysis or troubleshooting if needed (see the following image). You can read more details about LinkedIn’s model health assurance platform in this blog.
While we have built the foundation to support major phases of the ML lifecycle, we want to continue to improve experiences and productivity for our users. We identified three key areas to broaden our impact - feature exploration, assisted workflows, and notebook integration.
Feature exploration will help our users find, explore, and gain insights into features within models. This will provide the foundation towards feature generation, feature serving, and productionisation in our user interface, which is the last major phase left to build in our architecture diagram. In assisted workflows, we will build the interfaces to automate common and mundane ML tasks to lead us to continuous model training, management, and delivery. Some examples of automation (or assisted workflows) are feature recommendations, dataset recommendations, anomaly detections, and model ramps/de-ramps.
Finally, we would like to enable an iterative ML development experience for our users (see LinkedIn’s DARWIN) - allowing them to create, start, or continue where they left off in their training as well as performing ad-hoc analysis of their data and models.
The content of this blog post is based on the work of the Relevance Explains and Internal Applications Designers teams at LinkedIn:
Relevance Explains team: Daniel Qiu, Helen Yang, Mingran Jing, DJ Kim, Wei Li, Wanli Wang, Jason Belmonti, Yunqing (Martin) Zheng, Ai Shi, Adam Casey, and Eing Ong
Internal Applications Designers team: Shannon Bain and Matt Valente
We’re also thankful to teams across Machine Learning Infra, Data & AI Platforms, Artificial Intelligence, and Data Science (names in alphabetical order) that we have worked closely with (and many more not listed).
Leadership: Christopher Garvey, Joshua Hartman, Kapil Surlaker, and Tanton Gibbs
AI & DS: Alexandre Patry, Declan Boyd, Dean Young, Heloise Logan, Kinjal Basu, Sakshi Jain, and Tie Wang
GMA team: Jyoti Wadhwani, Na Zhang, Yang Yang, and Woody Zhou
SRE team: Emmanuel Shiferaw and Daniel Lewis
Past key contributors: Yucheng Qian, Yazhou Cao, David Hoa, Jingyi Ouyang, Marius Seritan, Scarlet Nguyen, Kathleen Liang, Joel Young, Hossein Attar, and Ann Yan