People You May Know (PYMK) allows members to grow their network by recommending new potential connections. It’s one of the most recognizable features at LinkedIn. PYMK was invented at LinkedIn and is responsible for building more than 50% of LinkedIn’s professional graph.
Two of the challenges of PYMK are relevance and scale.
At the heart of it is binary classification problem to predict whether two people know each other or not. There are many feature or signals used for predicting whether two people know each other. For example, one of the first things to look at is friends-of-friends, or triangle closing. Here, if Alice knows Bob and Bob knows Carol, then maybe Alice knows Carol. One can then score these closed triangles with such features as whether the pair overlapped at an organization or school, their age difference (you’re more likely to know someone near your own age), their geographical distance, etc.
PYMK uses a logistic regression model for the binary classification problem to combine hundreds of features. PYMK uses LinkedIn’s open-sourced large-scale machine learning library for training models with billions of samples for training.
There are many interesting modeling challenges in feature engineering. For example, as part of our research to understand how two people working in the same organization know each other, we built a novel model factoring in the time of joining and departing an organization, the size of the organization, likelihood of knowing each other in an organization (as some organization are more social than the others) as published in WWW’13. The logic is simple: the affinity between two members who worked together in a small organization for 10 years is greater than members who've worked together for only a few months.
Another interesting modeling challenge is how to incorporate user feedback through impression discounting, that is, discount the PYMK results that are seen by users and ignored (see our KDD’14 paper for more details). The intuition is simple: PYMK results that are seen by users and did not lead to any connection are ignored by users and should be lowered in the ranking of PYMK results.
In terms of scale, PYMK system daily processes 100s of terabytes of data, 100s of billions of potential connections, and pushes new PYMK results every day. As PYMK look at second degree network (connections of connections), the rate of growth in the data processing is much faster than the site growth. This poses a unique challenge in scale that we need to keep optimizing PYMK system to deal with such high growth in data processing while keep refreshing PYMK results every day. Our ecosystem of big data for addressing scaling challenges in PYMK including many systems such as Voldemort key-value store, Azkaban Hadoop workflow management, Apache Kafka for streaming, and more.
There are many interesting ongoing work to improve People You May Know further, for example,
- Large-scale distributed machine learning
- Large-scale social graph processing
- Network A/B testing