Open sourcing Feathr – LinkedIn’s feature store for productive machine learning
April 12, 2022
We are open sourcing Feathr – the feature store we built to simplify machine learning (ML) feature management and improve developer productivity. At LinkedIn, dozens of applications use Feathr to define features, compute them for training, deploy them in production, and share them across teams. With Feathr, users reported significantly reduced time required to add new features to model training workflows and improved runtime performance compared to previous application-specific feature pipeline solutions.
The problem with scaling feature pipelines
At LinkedIn, we have hundreds of ML models running in applications like Search, Feed, and Ads. Our models are powered by thousands of features about entities in our Economic Graph like companies, job postings, and LinkedIn members. Preparing and managing features has been one of the most time-consuming parts of operating our ML applications at scale.
A few years ago, we noticed a pattern: teams were getting overburdened with increasing costs of maintaining their feature preparation pipelines, and this was hurting their productivity in innovating and improving their applications. Feature preparation pipelines, the systems and workflows that transform raw data into features for model training and inference, are non-trivial. They need to bring together time-sensitive data from many sources, join the features to training labels in a point-in-time-correct manner, and persist the features into storage for low-latency serving online. They also need to ensure features are prepared in the same way for the training and inferencing contexts to prevent training-serving skew.
The cost of building and maintaining feature pipelines was borne redundantly across many teams, since each team tended to have their own pipeline. And pipeline complexity tended to increase as new features and capabilities were added over time. Team-specific pipelines also made it impractical to reuse features across projects. Without a common abstraction for features, we had no uniform way to name features across models, no uniform type system for features, and no uniform way to deploy and serve features in production. Custom pipeline architectures made it prohibitively difficult to share work.
The solution: Feathr feature store
To address these problems, we built Feathr. Feathr is an abstraction layer that provides a common feature namespace for defining features and a common platform for computing, serving, and accessing them “by name” from within ML workflows. Feathr reduces the need for individual teams to manage bespoke feature pipelines, and enables features to be easily shared across projects, improving ML productivity. Feathr is a feature store – a term that emerged in recent years to describe systems that manage and serve ML feature data. While many other industry solutions are concerned primarily with feature data management and serving, Feathr also has advanced support for feature transformations, enabling users to easily experiment with new features based on raw data sets. (Note: Feathr has been discussed publicly under the previous name Frame.)
Feathr’s abstraction creates producer and consumer personas for features. Producers define features and register them into Feathr, and consumers access/import groups of features into their ML model workflows. Oftentimes the same engineer plays both roles, producing and consuming features for their own project, but at LinkedIn we also have some producer-focused teams that produce horizontally useful features that get imported by projects across the company using Feathr.
From the consumer’s point of view, you can think of Feathr like a software package management tool for ML features. In modern software development, engineers usually don’t have to think about how dependency library artifacts get fetched, how transitive dependencies get resolved, or how the dependency libraries get linked to our code for compilation or execution. Instead, engineers just provide the list of names of the dependency modules they want to include, include or import the classes/libraries in our code, and the build system figures out the rest. Similarly, Feathr lets feature-consumers list the names of the features you want to “import” in your model, abstracting the nontrivial details about how they are sourced and computed. Under the hood, Feathr figures out how to provide the requested feature data in the required way for model training and production inferencing. For model training, features are computed and joined to input labels in a point-in-time correct way, and for model inferencing, features are pre-materialized and deployed to online data stores for low-latency online serving. Features defined by different teams and projects can easily be used together, enabling collaboration and reuse.
On the producer side, Feathr lets you define and register features based on raw data sources (including time-series data), or based on other features already defined in Feathr, using simple expressions. User-defined functions are supported for more complex use cases. Feathr supports aggregations, transformations, time windowing, and a rich set of types including vectors and tensors, making it easy to define many kinds of features based on your underlying data. When consumers import the features in their model workflows, the features’ definitions registered in Feathr get replayed automatically over historical time-series data to compute features at specific points in time, enabling point-in-time-correct feature computation during training data generation. For model inferencing, Feathr materializes feature data sets and deploys them to online data stores for fast access.
To see examples of how to use Feathr to define features based on raw data, to compute feature values for training, and to deploy features to production for online inference, check out our GitHub page.
Feathr at LinkedIn
We have deployed Feathr for dozens of applications at LinkedIn including Search, Feed, and Ads, managing feature pipelines in hundreds of model workflows. Some of our largest ML projects were able to remove sizable volumes of code by replacing their application-specific feature preparation pipelines and using Feathr instead — this reduced engineering time required for adding and experimenting with new features from weeks to days. We often found that Feathr performed faster than the custom feature processing pipelines that they replaced by as much as 50%, amortizing investments in Feathr runtime optimization. Feathr has also enabled feature sharing between similar applications. For example, LinkedIn has multiple search and recommendation systems that deal with data about job postings. These projects used to have difficulty sharing features under previous application-specific pipeline architectures, but were able to share features easily using Feathr, leading to significant gains in business metrics.
We’ve scaled Feathr to work for large-scale applications processing petabytes of feature data. Over the years we’ve introduced optimizations that have significantly reduced processing time for our largest internal users.
Feathr has provided a foundation for feature registration, computation, and deployment, and we are continuing to develop the ecosystem around Feathr with new cutting-edge infrastructure and tools. One of our favorite examples is enabling next-level CI/CD for feature engineering, where users will be able to create upgraded versions of widely-shared ML features that will then be automatically tested against existing models that depend on that feature. We’re looking forward to sharing details more about these exciting projects in the future.
Today, we have open sourced the most-used, core parts of Feathr. We plan to keep adding to it with new capabilities developed at LinkedIn, and we look forward to seeing suggestions and involvement from the open-source community. We also are excited to announce a partnership with our colleagues at Microsoft Azure to bring native integration and support for Feathr on Azure. To get started using Feathr on Azure, check out our guide on Feathr’s GitHub page.
Many people have contributed to Feathr over the past few years. In particular we would like to thank Jinghui Mo and Hangfei Lin who spearheaded the effort to bring Feathr to the open source community, in addition to their many other significant contributions. We’d also like to thank the early contributors whose leadership built the foundation of this project, especially Joel Young, Paul Ogilvie, Bee-Chung Chen, Priyanka Gariba, Kevin Wu, Joshua Hartman, James Wu, Benjamin Le, and David Zhuang. We’d also like to thank former and current contributors Li Lu, Sasha Ovsankin, Aditya Kumar, Ray Zhang, Jimmy Guo, Qing Li, Daniel Gmach, Chen Sun, Rakesh Kashyap, Grace Tang, Kevin Kao, Lei Li, Yangchun Luo, Boyi Chen, Jian Qiao, Abhishek Ravi, Sertan Alkan, Andris Birkmanis, Hongbo Liu, Maheswaran Venkatachalam, Darren Teng, Qinyu Tong, Bozhong Hu, Ruoyang Wang, Sally Ou, Shashank Paliwal, Thomas Huang, Ben Lee, Shengqian Ji, and all our other teammates for their many contributions. We’d also like to thank Xiaoyong Zhu for spearheading our Azure collaboration, and we thank colleagues Alexandre Patry, Skylar Payne, Rohan Ramanath, Shaunak Chatterjee, Min Shen, Zhong Zhang, Ian Ackerman, Zheng Li, Hossein Attar, Xuebin Yan, Maneesh Varshney, Lance Wall, and Swapnil Ghike for their advice and contributions.