Model health assurance platform at LinkedIn
July 13, 2021
At LinkedIn, we leverage AI to provide a world-class product experience to our customers and members. Over the last two years, we moved to a centralized ML platform called Pro-ML, which now hosts hundreds of AI models running in production. The goal of Pro-ML is to significantly increase the productivity of AI engineers by giving them the best tools and services in an opinionated manner. Health assurance (HA) is a key component of this platform.
In this blog post, we’ll describe the health assurance component of Pro-ML, which is an effort to platformize the provisioning of systems and tools which ensure the health of ML models.
But before going into those details, let’s explain a few terms related to Pro-ML:
Step: A logical unit of work that takes well-defined inputs from previous steps and produces well-defined outputs to be used by other steps.
Pipeline: A data flow for the offline training and subsequent publishing of an ML model by connecting several steps.
Model group: A set of published ML models intended to be used to solve a particular problem. A service depends on a particular model group to do inference.
AI metadata (AIM): Metadata around all pipeline executions, such as the topology of each pipeline at the time of execution, the configuration parameters of each step, and the artifacts created by each step run.
Workspace: This is the user interface for AI engineers backed by the AIM store. It is used for managing Pro-ML platform activities and processes, including model training, data/model analysis, etc.
Introduction to health assurance
Health assurance is a platformized initiative that provides AI engineers with tools and systems that help them identify issues with productionized models faster, and in many cases detect symptoms and causes of underperforming models in production as soon as possible. For example, issues like a system failure caused by a model can be observed in a dark canary environment before ramping the model to a production environment, as part of the health assurance process.
The core idea behind health assurance is to instrument and observe a variety of indicators, such as system characteristics of the hosts running the model, model-specific metrics like model inference latency, or data issues like drift in features values and prediction values, and provide these in a platformized experience. Previous to the health assurance component of Pro-ML, individual teams at LinkedIn had to develop their own systems and tools, which took extra effort from the AI engineers, decreasing their productivity.
Monitoring in ML model lifecycle
The below diagram provides a bird’s-eye view of the overall model development lifecycle. It has two major components: the offline pipeline, which is used to train a model; and an inference system, where the model runs in a production environment.
Image shows a typical model’s lifecycle at LinkedIn
HA provides the ability to monitor the health of ML models during both the training phase in the offline pipeline and the inference phase while serving live traffic.
There are several problems that can occur during the above two phases, which we monitor for via HA.
One potential problem can occur in production, where a model’s performance can degrade if the production data diverge from the data used to train the model. This can happen for various reasons, including:
The distribution of the input data and/or output variable seen by the model in production may change over time and deviate from the distribution seen in the training dataset. This may happen due to a gradual change in customer behavior and/or the business environment.
A change in data being fed into models due to errors in one of the many upstream data pipelines or systems (which often span multiple teams), on which the models depend.
The code paths that generate input features may differ at training and inference time.
The training data set may not be a good representation of product data due to improper curation.
Another potential problem is that the model serving system in production, like any software system, can underperform and miss performance SLAs in areas like latency and number of queries served.
HA is an important part of maintaining a healthy AI ecosystem because offline testing, however extensive, cannot by itself guarantee that a model will perform well in live production settings. This is true for a variety of reasons, including potential limitations of training data that make it unable to represent full production data, or a mismatch between the model artifact and the production serving infrastructure. Testing models via a dark canary before upgrading to production is part of our HA process that helps to mitigate this issue.
Some of the other potential problems that can arise during the lifecycle of an ML model are explained in detail in this paper. Let’s now get into detail of how health assurance addresses these challenges.
High level design of inference system
Image shows a few other components in Pro-ML platform, along with HA, in a typical inference system
The inference application contains a health assurance component, which takes care of generating real-time feature distributions and offline data drift metrics. As part of measuring data drift, the feature values that are computed at inference time are tracked and a daily job computes statistics on them and pushes them to Pinot. Pinot then pushed these statistics to our in-house alerting and monitoring system, ThirdEye. Whenever there is a significant change in data distribution, ThirdEye alerts the user.
Online real-time feature distribution captures system metrics for models like number of requests per second and the performance latencies of these requests. This pipeline also captures the real-time distribution of numeric feature values. The feature distribution metrics are aggregated in the metrics aggregator library using our HA agent and are passed to TSDS (time series data system). These results are shown in LinkedIn’s internal real-time monitoring tool, inGraph.
Data drift monitoring
HA monitors drift in prediction variables and input features of a model. It measures and displays changes in the distribution of variables by letting users choose any two time periods to compare, from training time to the latest inference time.
One of the graphs below shows the timeline of average feature drift across all features in a model. The other graph categorizes and reports feature drift according to feature importance.
Image shows feature drift for input variables
Similarly, the diagram below shows the distribution of a prediction variable over time; the coloring represents how the distribution of the output feature has changed with time.
Image shows feature drift of output variable
AI engineers are alerted when a significant drift is observed either in model output variables or input feature variables.
Real-time feature distribution monitoring
HA captures inference-time feature values for the numeric features and computes quantiles from them. Every minute, we send a few sample quantiles for each feature to InGraphs for visualization.
Image shows real-time feature distributions
AI engineers can use InGraphs of feature distributions for use cases like:
Evaluating models during the dark setting phase: Any inconsistency in feature distributions between training data and current data can be caught before the model goes live by comparing their distributions.
Evaluating models during the experimentation phase: This is the intermediate phase where a model is serving a very small amount of production traffic. Models are monitored with different business and technical metrics. In the event of a drop in those metrics, feature distribution can be used to explore root causes.
Monitor models during MME (Majority Member Experience) phase: This is the live production phase, with significant traffic. A sudden change in feature distribution can be captured and engineers can be alerted by InGraphs.
One interesting scenario we faced early in development was the challenge of too many metrics being collected. For illustration, let’s assume LinkedIn is running 1,000 models, with each model running on an average of 500 hosts across different regions. Furthermore, assume we are tracking 10 features per model and 5 metrics per feature. For the above situation, we will have 25 million metric keys created in InGraphs.
To solve the metric bloat issue, we designed a library called “Metrics Aggregator.” The primary idea behind this library is the fact that, since we are capturing feature distribution at the model level, the host information is not relevant. This library aggregates the events from different hosts into one metric for each feature quantile by emitting events to Kafka periodically and then aggregating them in Samza. After this, the aggregated metrics are posted periodically to InGraphs.
Model inference latency monitoring
Model inference latency is an important metric for the application owners because this tells the overall time the model took in serving a particular scoring request. We typically monitor the mean, 50th, 75th, 90th, and 99th percentile latency.
These quantiles for latency can be used in multiple ways, such as:
Helping isolate the offending piece of a model within the entire lifecycle of a request. For example, in a typical search system, there are various phases like retrieval, ranking, response decoration, etc. If we capture the latencies that typically would be seen for the ranking phase, we can then easily identify which part is causing the change in latency.
Easily doing some early validation of a model in a test setting to understand whether the model can work within the SLA bound.
Onboarding to health assurance
We have tried to keep onboarding as simple as possible and made sure that it is configuration driven. A few capabilities, like model inference latency monitoring, are auto configured, whereas capabilities like which model features to track are configured by AI engineers. Currently, they provide these configurations in the model training pipeline. In the future, we plan to move these configurations to their workspace UI.
Health assurance for AI models is an evolving area in the AI industry. We are excited to be at the forefront of this field and help the AI community by sharing our learnings from our journey. Our HA platform is still in the development phase and we have already been able to identify major issues in some of our models. For example, one team was able to detect that some of their feature values were showing up as zero, while another team reported unexpectedly large feature values in their dataset. We are still developing several useful features and will plan to share more capabilities and features of our HA component as they are developed.
There are far too many individuals that are helping to make Health Assurance a success to list here, but we would like to call out a few of our key engineers.
These include Ameya Karve, Ashish Bhutani, Anirban Ghosh, Radhika Sharma, Karan Goyal, Aishwarya Netam, Girish Doddaballapur, Raghu V, Ankit Dhankhar, Akshay Uppal, Ajinkya Rasane, Ruoying Wang, Zhentao Xu, Sam Gong, and Helen Yang. We would also like to call out engineering senior leadership who are supporting us and making this possible: Tie Wang, Eing Ong, Shannon Bain, Niranjan Balasubramanian, Hossein Attar, David Hoa, Josh Hartman, Zheng Li, Priyanka Gariba and Romer Rosales.