Rapid experimentation through standardization: Typed AI features for LinkedIn’s feed

Ian Ackerman

Machine Learning Infra at Databricks

April 15, 2020

Serving the most relevant information for LinkedIn members in the homepage feed requires a massive effort—hundreds of features are used to personalize content for hundreds of millions of members. For each homepage visit, our machine learning models have to find and surface the best activity across a member's whole network, and they have to source that content from multiple online systems. For our members with over 5,000 connections, the amount of data processed quickly balloons, as we want to consider the likes, comments, and posts from each person in their network. Operating at this scale can make innovation challenging.

At the same time, focusing only on providing great content for today isn’t enough. The feed is the heart of the LinkedIn experience for millions of people who are looking for jobs, making hires, and trying to improve their skills. For this reason, our feed AI system has to allow for rapid experimentation and continuous improvements both in the training environment and production traffic in order to maximize member and customer utility.

Typed features to the rescue

Bringing in new member information through features is one of the core components of improving machine learning based recommendations. There is a constant tension between having the flexibility to quickly deploy new features and maintaining a typed, validated, efficient representation of features for reliability. The feed team is running hundreds of online experiments per month over a dozen new features. That work needs to run smoothly on top of the 100+ existing features being fetched, processed, and transmitted in parallel between multiple online systems to render the main feed for thousands of members per second who are opening their LinkedIn app.

To address this challenge, we’ve come up with a flexible type system built on top of tensors. This allows machine learning engineers to be expressive and clearly document what comprises their features, and to restrict their domain. It allows typed features to efficiently represent higher-level concepts in primitives such as integers. Since these typed features can be expressive, flexible, and efficient, we are able to use them throughout all of our systems via projects like Pro-ML. This removes the need to define multiple representations for features or convert between Hadoop file systems, databases, online services, and inference engines.

Old setup: Disparate formats and conversions

Old training and inference system, detailing the different types used in various parts of the system

In the past, the tradeoff between flexibility, stronger types, and efficiency had left different parts of our machine learning ecosystem with different setups. In our Spark training and inference pipeline (green in the diagram above), we focused on flexibility and development speed. In the training pipeline, machine learning engineers run hundreds of experiments on a daily basis across different features. Our old format utilized strings to store categoricals and a single value type of floats. While this system was simple to understand and build interfaces for, it did so at the cost of efficiency in representing integer counts, categoricals with known domains, and interacted features. Our inference-based format was also set up around this data type, again striving for universality at the cost of efficiency. Our HDFS feature snapshot (orange in the above diagram), on the other hand, was optimized for simple offline joins, was specific per feature rather than uniform, and did not take into account online systems. Often the optimizations for online feature storage were not unified with the offline snapshot due to the staggered development lifecycle.

At the scale of LinkedIn’s feed, passing around large feature datasets with redundant strings was error-prone and not efficient. This required feature owners to define efficient, per-feature schemas (blue in the above diagram) that used specific types to represent data, with two main drawbacks:

It was time consuming to update all the relevant APIs/databases
It lacked flexibility when experimenting with features

While a good schema can both act as documentation and concretely describe the feature, it can be a burden to machine learning engineers. Our previous setup required updating the schemas of our databases and APIs of online services to accommodate the feature-specific format of each feature experiment. This required cascading changes across systems, schema reviews by multiple teams, and glue code to ensure that the custom schema was correctly translated back into the flexible inference format (green in diagram). This slowed down velocity of experimentation and impeded the freedom to quickly test out variations in features.

Tensor setup: Type system for features

To avoid the problems described above, we have transitioned to a new way of storing features throughout our systems in a single format of tensors with feature-specific metadata. Tensors are a standard format throughout many deep learning frameworks, like Tensorflow, that allow for efficient linear algebra operations. In addition to tensors holding numbers, we have added feature-specific metadata of names, restricted domains, and vocabulary mappings of tokens to numerics. The new, comprehensive type information includes all primitives and higher-level concepts like categoricals and discrete counts. As this system is geared towards features specifically, rather than generic data like Avro, machine learning engineers can more naturally model their features. This metadata is useful not just for defining feature types, but also for unifying documentation and system knowledge. In the previous way of operating, with different schemas in different systems, there was no single source of truth for what a feature was; what was stored in the database schema didn’t always map onto what was used in the model. Now, there is a repository that gives feature owners a central place to document and concretely define their features. The team leverages this for links to project documentation, and we look to expand over time to automatically link to experiments/systems utilizing features.

By building on top of tensors, features are always serialized in the same generic schema. This allows for flexibility in quickly adding new features to systems without changing any APIs. This is achievable thanks to the metadata, which maps what previously would have been space-inefficient strings or custom-defined schemas to integers. The metadata is distributed to each of the systems once and then utilized to understand whatever serialized tensors are passed to the system by resolving a feature URN to its metadata. In addition to being efficient for serialization, the numeric representation allows for more powerful linear algebra operations. This includes dot products for linear models, integer representations to easily embed categoricals, and neural nets for deep learning work.

Detailed example

Here’s an example feature with types defined:

In the above example, we are defining a feature “historicalActionsInFeed” that will list historical actions a member has taken on the feed. The feature metadata information is defined inside the flagship namespace, with names and versions. This allows for any system to look up the feature in the metadata system using the urn urn:li:(flagship,historicalActionsInFeed,1,0). From this feature definition, we have two dimensions, which include a categorical listing of different action types in “feedActions” as well as a discrete count representing different time windows in a member’s history. Some example data points this could represent would be “10 clicks in the last 2 weeks” or “5 shares in the last month.” These two high-level concepts are shown in the table below, represented in both the previous format we used to employ and our new tensor format. Before, these were poorly encoded in strings and float maps as documentation, and understandability intermingled with serialization formats. By making the urn resolvable from all systems, we can just serialize the tensor data itself between systems (the third row in the table: Tensor Format).

table-showing-the-string-map-format-compared-to-the-tensor-format

Metadata design

A challenge in adding metadata to our feature system was how to distribute it. Previously, string features were self-describing and schemas relied on existing systems like Avro schemas or Rest.li systems. When designing the metadata component, we took inspiration from schema management at LinkedIn, creating a single, textual source of truth for each metadata type that is stored under source control and published as an artifact library. This allows for any deployed system to pull in the desired metadata simply by declaring a dependency. There is a provided metadata resolution library that will take the HOCON files (example above) and provide a lookup from matching urn to information. This metadata library is required for any manipulations or inference on the features. Each feed model is marked with the required version of this metadata library artifact, which will ensure that online systems have all information needed before allowing model deployment.

While this solution solves problems of distribution and easy definition, it makes the problem of evolving the metadata of categoricals more difficult. In particular for the feed, we have this problem when it comes to bringing new update types to our members. Examples include live video, presentations, events, and kudos. To address evolving categoricals, the feed team relies on hashing to provide a consistent mapping and a single update job to provide debuggable backwards mappings. Hashing allows for our many systems—online, nearline, and offline—to all agree on a consistent mapping from a new categorical string to a numeric representation. This hash sends new categoricals that come into the ecosystem to a numeric representation without losing the uniqueness of a new categorical representation or allowing system mappings to diverge. However, there needs to be a process by which these new categorical mappings can be added into our metadata’s categorical mappings so that systems can reverse the numeric hash into a human-understandable string. This is done as a single, offline job that looks at recorded string representations of these categoricals daily and applies the same hashing function to them. From this, we publish updates to our metadata library.

How new categoricals are captured and examples of how the metadata library is used in multiple systems

This is a simple system to bootstrap our metadata, but still has drawbacks of hash collisions and delayed information propagation. While we utilize enough hash space to avoid collisions, there is still the chance that two categoricals map to the same hash. A further problem is that hashing with a low chance of collision removes the easy ability to create embeddings, though a re-hashing to denser domains is possible. The system also needs to wait until the next deployment for newer categoricals to propagate, which for long-running nearline/online systems can be a burden. Our eventual desired solution to these problems is a more centralized, dynamic system which allows for querying by systems (online or offline) to get the latest metadata. This requires additional infrastructure to be in place, but is the direction we’d like to move as the metadata system becomes more mature and widespread.

Conclusion

We’ve talked about the challenges of feature management for the feed at LinkedIn, a typical large scale machine learning system. By introducing tensors and metadata, we have a data type system that can meet the varied needs of our machine learning engineers. Since the feed team has introduced this new data type, inference performance has increased over 20%. We also have simplified our online feature deployment process so engineers can bring a feature from experiment to production in less than two weeks. What previously was hundreds of lines of online re-formatting code, multi-team schema review processes, and custom offline joins plus formats has been replaced with configurations and fixed schemas.

Acknowledgements

We’d like to thank the many teams that have collaborated to make this project a success for our members’ feeds. This includes engineers from across our machine learning infrastructure teams of training, feature fetching, inference, and feed infra. Thank you to Andris Birkmanis, Devendra Jaisinghani, Adam Peck, Leon Gao, Zhifei Song, Pratik Dixit, Sunny Ketkar, Harsh Bhatt, Yanbin Jiang, Jimmy Guo, Zhiyuan Zou, Alan Li, Marius Seritan, Jun Jia, Harsh Khattri, Bee-Chung Chen, Hossein Attar, David Stein, Aditya Kumar, Yiming Ma, and Zheng Li.

Topics: Analytics Feed Artificial intelligence A/B Testing/Experimentation Data Data Management Machine Learning