Open Sourcing Photon ML
LinkedIn’s Scalable Machine Learning Library for Spark
June 7, 2016
Machine learning is a key component of LinkedIn’s relevance-driven products. We use machine learning to train the ranking algorithms for our feed, advertising, recommender systems (such as People You May Know), email optimization, search engines, and more. For an in-depth example, check out these posts (part one and two) on how LinkedIn applies machine learning for ranking the feed.
These algorithms play an important role in determining user experience for content-rich websites, so it’s critical that we provide our engineers with easy-to-use machine learning tools that create high-quality models that are fast and scale to large datasets. To meet these needs, we have developed Photon ML, a machine learning library for Apache Spark. By combining the ability of Spark to quickly process massive datasets with powerful model training and diagnostic utilities, Photon ML allows research engineers to make more informed decisions about the algorithms they choose for the types of recommendation systems listed above.
There is potential for this library to provide widespread value to research engineers in many different fields, which is why we are proud to announce today that we have open sourced Photon ML.
What’s in Photon ML
Photon ML provides support for large scale regression, supporting linear, logistic, and poisson regression with L1, L2, and elastic-net regularization. We also support offsets, weights, and bounds for coefficients. Photon ML provides the optional generation of model diagnostics, creating charts and tables that can be helpful in diagnosing the model and its fit to your optimization problem. It also includes an experimental implementation of generalized additive mixed effect models, which we describe in more detail below.
How Photon ML is applied at LinkedIn
A typical machine learning system could be represented by the diagram below. The first stage centers on data preparation, often involving ETL of data from online systems, creating labels, and joining in features. The next stage applies machine learning algorithms to learn good scoring functions for our recommender or search systems, and to select the best models. Finally, the best models are frequently deployed to online A/B tests to measure their impact on our member and customer experience.
Photon ML is at the core of model training for LinkedIn, and has served as a drop-in replacement to other machine learning libraries, such as the previously open-sourced ADMM implementation in ml-ease. In this diagram, circles represent actions, and cylinders are datasets.
How we run Photon ML in our clusters
At LinkedIn, we run Photon ML using Spark on Yarn, hosted on the same cluster as other Hadoop Map/Reduce applications. This makes it easy for us to mix traditional Hadoop Map/Reduce programs or scripts in the same workflow. Switching our workflows from Hadoop MapReduce to Spark on Yarn has generated a 10-30x increase in the speed of model training. To support better use of Spark, the Machine Learning Algorithms Team also contributed Spark support to Dr. Elephant, which LinkedIn also recently open sourced.
Hosting Spark and Hadoop workflows together on the same cluster, along with supporting the same input and output formats of previously-used machine learning libraries developed at LinkedIn, have greatly increased the ease of adoption for Photon ML at LinkedIn. It is already in use by many teams developing relevance applications and security data science, and is used in production by several teams.
Where Photon ML is headed: GAME
Our vision is to have industry-wide impact on how people build and apply machine learning technology. To realize this vision, we have to be a part of the machine learning community — we have to share our code. While there are many open source machine learning libraries currently available, we feel that Photon ML is an important addition because of the direction we intend to take the library toward: generalized additive mixed effect models (GAME), described in more detail below.
Currently, the GAME implementation in Photon ML supports generalized linear mixed effect models (GLMix), a subset of the algorithms we intend to one day support in GAME. A GLMix model consists of a fixed effect component and multiple random effects. A fixed effect model corresponds to a traditional, generalized linear model and assumes each observation is independent. Random effects capture additional heterogeneity in residuals from fixed effects by attaching parameters at multiple granularities (users, items, segments). Shrinkage/regularization is often used to avoid overfitting in such scenarios. In addition, random effects induce marginal dependence among observations.
For example, we use GLMix models to improve job recommendations by using a random effect for members and a random effect for jobs. To be more precise, the random effect for members includes features from job descriptions, such as extracted skills or job titles. Modeling the random effect in this way allows us to better learn which jobs a highly-active member is interested in, with coefficients for job features specific to that member.
GAME solves each effect in sequence using coordinate descent.
We use coordinate descent to optimize the full problem and step through each effect (coordinate) in sequence, solving the sub-problem with an appropriate optimizer. For the fixed effect coordinate, we use a distributed regression algorithm, partitioning the data by example. Spark’s support for resilient distributed datasets keeps data local per iteration, allowing for quick optimization without having to reshuffle data between iterations. To efficiently solve the random effect coordinates, we partition the data by the random variable for that effect, allowing us to solve the random effect coordinates with single machine algorithms.
Optimization call structure in GAME. The GAME Driver applies coordinate descent in sequence coordinates which correspond to sub-problems of the larger model. The process of solving a random effect coordinate is split into one optimization problem for each value observed for the random effect in the training data.
GAME models allow research engineers to train their algorithms using a more accurate picture of the underlying dataset that better reflects the experience of individual members. We hope that increased use of these techniques in the future will lead to better algorithms for recommendation systems in general. Our own initial A/B tests have showed that GLMix models trained using Photon ML improved job recommendations by 15 to 30 percent in job applications, and improved email article recommendations by 10 to 20 percent (based on clickthrough rate). While these tests are still in their early stages, these results indicate that Photon can significantly improve recommendations for members.
We’re excited by the GLMix model support provided by the GAME algorithm in Photon ML, and we will continuously work on improving the robustness and ease of use. We develop directly against the open-sourced Photon ML, periodically releasing internal snapshots for use by teams within LinkedIn.
Our vision for GAME extends beyond generalized linear models; we already include an experimental code for a factorized random effect model, which incorporates matrix factorization to capture interactions between random effects. In the future, we expect to continue to add more machine learning algorithms within the same generalized additive framework.
We had a lot of help across LinkedIn getting the tooling and infrastructure in place to develop and deploy Photon ML, and we are grateful for their contributions. We specifically want to call out the key developers of Photon ML: Xian Xing Zhang, Yitong Zhou, Josh Fleming, Degao Peng, Namit Katariya, Ankan Saha, Alex Bain, Alex Shelkovnykov, Yanen Li, and Brendan Drew. We also appreciate the technical guidance of Bee-Chung Chen and Bo Long and the support of Deepak Agarwal and Igor Perisic.