Data Streaming/Processing

Lambda Learner: Nearline learning on data streams

Co-authors: Kirill Talanine, Jeffrey D. Gee, Rohan Ramanath, Konstantin Salomatin, Gungor Polatkan, Onkar Dalal, and Deepak Kumar

Introduction

A common challenge for production machine learning systems is reacting to change. The world can change quickly, particularly on a social network. This can range from sweeping changes at the scale of the whole economy (like a pandemic) down to everyday occurrences, such as a dormant member returning to LinkedIn in search of a new job. It’s typically not enough to train and deploy a model just once; we need to keep it fresh, while balancing model complexity, training time, and computational costs. Ideally, the instant new data becomes available, we would incorporate it into the model.

In this post, we describe a system we call Lambda Learner, which we apply to the problem of predicting the click-through rate for Sponsored Content, LinkedIn’s native advertising format where paid content is mixed into collections of organic member-generated content. This is a very time-sensitive problem; a model needs to adapt to new advertisers, new campaigns, and member behavior signaling short-term intentions. We take inspiration from the concept of lambda architecture, which involves combining batch offline processing with processing on data streams. The latter is often called “nearline” because data items are processed in near real time (within seconds). With Lambda Learner, we leverage this paradigm to combine the stability of offline model training with the responsiveness of nearline incremental training on fresh data.

In an earlier blog post, we described an approach we use for retraining personalized models (in our hiring ecosystem), capable of remaining fresh when change is measured on the scale of hours or days. In this post, we extend this approach to be able to adapt to change on the scale of seconds to minutes. For our target task of Sponsored Content click prediction, a very time-sensitive modeling problem, Lambda Learner outperforms conventional batch updates. While there are other approaches which seek to solve similar ML challenges, we believe Lambda Learner represents a novel combination of scalability and theoretical guarantees. You can deep dive into Lambda Learner in our paper, which we are presenting at the 2021 KDD conference, or check out our open source code on Github.

Technical background

LinkedIn has had success employing a class of model called GAME (Generalized Additive Mixed Effects), in use cases such as job recommendation, learning course recommendation, and advertising/Sponsored Content. This approach enables personalization by combining a large global model with smaller per-key models trained on data specific to each key. The keys index subgroups of the whole dataset; they could be as fine-grained as per-member, or coarser, e.g., per-advertiser as a grouping of ads. The large global model learns to generalize over the whole dataset, while the smaller per-key models memorize patterns relevant to specific entities or groups, personalizing the global predictions. We have applied this modeling approach to many problems, including Member-Job matching.

Here, we focus on the problem of click prediction for Sponsored Content. The model can be summarized as follows:

logit(P(response|ad,viewer)) = fglobal(x) + fadvertiser(x)

We predict the conditional probability of a response, given an ad and a viewer, as a sum of two components: a large fglobal model, which is trained on the full dataset, and a small fadvertiser model, which is different for every advertiser, and is trained only on the engagement of that advertiser's ads. The  fadvertiser fine-tunes the fglobal’s predictions, resulting in substantial lifts in modeling and business metrics.

The global model is somewhat insensitive to most changes in the task or the data. But the per-key models can be highly sensitive to rapid changes: for example, a single user action—such as dismissing an ad—can strongly signal that similar ads should cease to be shown.

Previously we described an approach capable of adapting to new data within hours. But we can do better, reacting within minutes or seconds. 

Sources of change

There are many sources of change to which ML models need to adapt. Most models take the form f(x) + ε  = y. The model is a function f which maps a datapoint x to a predicted value y representing the solution to a specific task which f attempts to approximate (with error ε).

One common class of change is “drift,” where statistical properties of the data change, or the task itself changes. One example is spam detection, where a sudden botnet attack may result in a surge of spam: the distribution of data the model encounters has changed, while the actual task hasn’t. As another example, imagine that our model predicts job matches for software engineers and a new skill comes into high demand. Knowing this skill is now more important for candidates, and predictions should change accordingly.

There are also a range of challenges arising from missing or stale data. Imagine a new ad campaign launches, and we don’t yet have any behavioral data about how members will react to it. This is an example of a cold-start problem. More generally, behavior can suddenly change, making old data stale. For example, if someone moves and is seeking work in a new locale, job recommendations may become stale until we can incorporate this new information.

These changes are constantly occurring at multiple scales. Over a span of years, structural changes in the economy may result in more job seekers in one industry and less in another. Over a span of months, typical job requirements change in fast-moving industries as new skills come into demand. Over a span of weeks, jobs saturate with applicants. Over a span of days, a manager searches for a new individual contributor role. Over a span of hours, a new member joins LinkedIn and starts a job search. Over a span of minutes or seconds, a buyer reads an article that suggests a purchase intent.

A spectrum of AI model freshness

We can think of model retraining approaches as a hierarchy:

  • Level 0: Train the model once and never retrain it. This is appropriate for “stationary” problems.

  • Level 1 (“cold-start retraining”): Periodically retrain the whole model on a batch dataset.

  • Level 2 (“warm-start retraining”): In addition to Level 1, if the model has personalized per-key components, retrain just these in bulk on data specific to each key (e.g., all impressions of an advertiser's ads), once enough data has accumulated.

  • Level 3 (“nearline retraining”): In addition to Level 2, retrain per-key components individually and asynchronously nearline on streaming data.

These levels build upon each other. Depending on how fresh a model needs to be to perform well at its task, we may elect to stay at Level 0 for a stationary problem, all the way through to Level 3 for problems where incorporating new data within seconds makes a difference; this is what Lambda Learner does.

illustration-of-a-hierarchy-of-model-retraining

Figure 1. A hierarchy of model retraining.

If we arrange all of the model updates in Level 3 on a timeline, as depicted in Figure 1, we’ll see three types of model update occurring. Occasionally, a Level 1 update will reset the whole model (both global model and all per-key components); this will be a batch offline update using a large accumulated dataset. More frequently, Level 2 updates will reset just the per-key components; this again will be an offline batch update, but won’t touch the large global model. Finally, “lightweight” Level 3 updates will occur almost continuously; any individual per-key component is tuned as soon as there is enough data to do so.

Solution overview

Our original system was capable of Level 1 and Level 2 retraining. All retraining occurred offline, and the quickest we could expect to incorporate new data into the model served in production was some hours. In the case of Sponsored Content, the model is deployed to an online scoring service, which scores potential ad impressions on demand. The outcomes of the served ads (click or no click) are gathered and joined with feature data to accumulate offline training datasets. The freshest data is used to retrain all components of the model, with the whole model being totally re-trained occasionally on the full dataset (Level 1), and with the per-key components being restrained more frequently on per-key data (Level 2).

Our extended system with Lambda Learner incorporates nearline data generation and nearline per-key model retraining to add Level 3 to this existing foundation. To produce data nearline, we begin in the online scoring service. Every time an ad impression is scored, we take a snapshot of the features used by the model, and emit these to a Kafka topic. Subsequently, when the ad is served, any click is also emitted to Kafka. By joining these two Kafka topics, we can create fully featured and labeled training examples available as a Kafka topic. Finally, we stream these training examples into a Samza Beam processor, which groups them into data mini-batches for each key, and then retrains the corresponding per-key model component whenever a mini-batch accumulates enough examples. Mini-batches could be as small as a single example (online learning), but this can be unstable. In general, the size of mini-batches presents a tradeoff between stability of learning and responsiveness to new data. The size can be tuned to the application, and even adjusted dynamically as factors such as data velocity and variance change over time. The keys we use here index advertisers—and for the most active advertisers, mini-batches of hundreds of examples accumulate within seconds, allowing for fast and stable model updates.

high-level-illustration-of-lambda-learner-system-design

Figure 2. High level Lambda Learner system design.

There are some interesting challenges involved in making such a system work well. The mini-batches are small and seen only once, which introduces the risk of high variance model updates. We mitigate this by capturing an approximation of previous mini-batches in the form of a prior and applying a Bayesian updating approach. Other challenges include systems challenges, such as ensuring serial model updates and data consistency when per-key model coefficients are stored in weakly consistent key-value stores, which are updated asynchronously. These and other details are explored in our paper.

Applications and impact

In a time-sensitive application, a model that is deployed and not updated will experience a decline in performance over time. Given this, we had two main hypotheses about using nearline updates. Firstly, we expected that a model updated nearline would resist performance degradation, at least in the short term. Secondly, we expected that over longer periods, errors in the nearline updates would compound, eventually degrading performance.

We tested these hypotheses on a time-sensitive dataset of Sponsored Content interactions. We found that nearline updates were able to outperform several real-world batch update methods, coming close to a theoretically ideal update scheme. Moreover, we discovered that performance of a model using nearline updates was relatively stable over periods as long as 60 hours without a full batch update.

In online experiments within LinkedIn’s ad business, new advertisers saw improvements in member engagement (1.76%) and ROI (reduced cost-per-click by 0.55%), and the platform saw an improvement in revenue (2.59%). Existing advertisers also saw modest improvements across these key metrics, resulting in an overall site-wide improvement in member engagement and platform revenue without hurting advertiser ROI.

While we have only applied Lambda Learner to Sponsored Content at present, this solution has the potential to improve metrics for any time sensitive ML application using a GAME-like modeling approach.

Acknowledgements

Special thanks to Hai Lu, Daniel Chen, Yunbo Ouyang, and Jun Shi for your invaluable help with nearline infrastructure and open sourcing.