Community building recommender for group chats in LinkedIn Messaging

Adam Leon

Engineer

April 28, 2022

Co-authors: Adam Leon and David Golland

At LinkedIn, relationships matter. On our platform, we focus on helping our members build and maintain the relationships they have formed throughout their professional career. In LinkedIn Messaging, our members form new and reunite with established communities from their network through group chats - Messaging’s one-to-many private conversation feature. However, reconnecting with a community can be overwhelming. To assist our members in creating and maintaining communities on the platform, we recommend relevant connections to members, which they can then add to group chats.

Meet Antoine, an alumnus of the Mimtome University, who has over 2,000 connections, is a member of several LinkedIn groups, and actively attends events on the LinkedIn platform.

Antoine’s network is large, with thousands of connections accumulated from a career's worth of experience: current and past work colleagues, fellow alumni, connections from networking events, and digital interactions on LinkedIn across Groups, Events, and Feed.

To reconnect with alumni using LinkedIn Messaging, Antoine has to manually search from memory through the alumni network, which could mean searching through thousands of contacts for specific alumni.

Not only is this process cumbersome, it is prone to forgetting valued members of the alumni community. Fortunately, there is structure in Antoine’s connections that we can leverage to assist in making recommendations from his network. For example, Antoine has many alumni in his network, and a subset of them also collaborate on volunteer projects. By identifying these communities within Antoine’s connections, we can streamline the process and help Antoine find the right connections to add to his alumni group chat.

Antoine’s experience can be generalized: each member's network is complex, spanning diverse origins, and consisting of many overlapping subsets. Further, each member’s network is personal—different members may group the same connections in different ways because their shared experience with those connections can differ.

In this blog post, we share how we leveraged the community structure within a member’s connections to generate personalized recommendations that dynamically update as a member adds new connections to a group chat. This low-latency solution provides a friction-free experience that helps our members reconnect with their community.

Solutions explored

Generating group chat recommendations presents several unique challenges that are less prevalent in other recommendation applications. Before arriving at our solution, we explored a number of approaches used in other recommender systems that fell short in the group chat setting.

Member-to-member affinity
One approach we considered was to use pairwise member-to-member affinity scores that indicate how likely each member is to engage with another member. These scores are widely used across many other recommender systems at LinkedIn (Personalizing LinkedIn Feed).

Although pairwise affinity scores successfully predict who a member most likely wants to engage with one-on-one, they fall short when it comes to group chats. For example, Antoine frequently messages his current manager and also frequently messages his brother; however, he never messages them together in the same group chat. Pairwise affinity scores lack the signal needed for strong recommendations in the group-chat setting and would group these two connections together.

From this, we develop the intuition that it is crucial to reference the members who have already been added to the group chat (a.k.a the ‘query set’) when recommending additional members to add. Additionally, there are combinatorially many possible group chats that each member can start; precomputing scores for all subsets of each member’s connections is intractable. Hence, the recommendation algorithm must run online to dynamically generate additional recommendations from the query set.

Shared connections
With the above intuition in mind, we considered using ‘common connections’ between the builder and query set. For instance, if Antoine has already added three out of four direct reports to a group chat, the remaining direct report is recommended because she is directly connected to Antoine and the other three. In other words, recommending connections with the highest connection overlap within the query set.

On one hand, this method is dynamically informed by the query set and can slice away irrelevant parts of a large network. On the other hand, connections and group chats are not always directly related. For instance, Antoine may use a group chat to create an introduction between two alumni with whom he is connected. Furthermore, connections are too coarse of a signal to describe subset communities (e.g. coworkers who volunteer together). Again, these pairwise connection signals fall short.

With this, we discovered that our solution should move away from pairwise connections and focus on the communities shared by group chat members.

Our solution

We take a first-principles approach that builds on the intuition that group chats are generally formed among members with shared communities. We create a personalized bipartite graph for each active member where the nodes are members and communities, with an edge between a member and each of their communities. This graph is stored efficiently and traversed online with each new member that is added to the group chat, which allows us to dynamically recommend the most relevant members, based on those already added to the chat.

Data structure

Our goal is to effectively predict the sets of communities in a member’s network and, while the member is building a community, be able to index into the correct set and recommend everyone else within milliseconds.

To achieve this, we first take the standpoint that who a member knows, from when and where, are described in their professional identity and engagement on the platform. For example, a member’s professional identity is described by where the member worked, went to school, and what LinkedIn Groups they engage with. It is through these associations that members relate to their connections.

We start with Antoine and his professional experience. He worked at Freshing and attended Mintome University, these are represented using community nodes.

To model relationships, we create a personalized data structure that maps each member’s network back to their professional experience in the form of a bipartite graph between members and their communities.

Building on the example, the Freshing community node has links to all of his connections who also worked there at the same time, such as David and Aarti. This pattern continues for all of Antoine’s professional categories. For a member’s personalized graph, we restrict the community nodes to only those from their experiences to approximate how the community builder sees the world.

image-of-member-experience-and-similar-member-connections

This structure effectively uses edges as an expression of relevance from a connection to a category of community. Moreover, it is inherently extensibile because any other notion of a community can simply be onboarded as a new node.

Storage

We serialize this personalized bipartite graph for each member into Venice, LinkedIn’s high-throughput, low-latency, highly-available, eventually-consistent, distributed key-value storage system. The key is the member, and the value is a serialized adjacency list representation of their network as described above.

The adjacency list is a mapping of each connection to a list of community nodes. We also store the reverse index: communities to members. Community nodes have the category, whether it is a current company, university, etc., encoded in its value, which we will leverage later for our machine learning strategy (described below).

Key	Value
Antoine’s memberId	memberToCommunities-> [ davidKealoha -> [current_company:1, school:101], nenneAfolayana -> [school:101]] communityToMembers -> [ current_company:1 -> [davidKealoha], school:101 -> [nenneAfolayana, davidKealoha]]

We calculate and store our data structure nightly for any member that has been active on LinkedIn in the last 30 days, bound the number of connections to ten thousand, which we found to reasonably balance the trade offs of candidate quality and storage cost, and finally, we use a compression strategy. All of which keeps us within reasonable bounds of storage.

Graph exploration algorithm

To generate candidates in a way that is consistent with the community the member is trying to build, we traverse the bipartite graph using an algorithm inspired by biased random walks.

After each connection is added to the group chat, we invoke our algorithm with the currently preselected connections (query set). The algorithm first deserializes the member’s graph from Venice. Then, the algorithm explores the personalized graph by taking two steps of a breadth-first search. The process starts from a connection in the query set, then steps to all linked community nodes. The entire process completes in less than 40 milliseconds to provide a smooth member experience.

For example, Antoine starts by adding David to the query set. The algorithm steps from David to all of the community nodes he shares with Antoine: Freshing and Mitome University.

image-of-adding-a-member-to-the-query-set

From these community nodes, the algorithm then steps to all the connections who share these community nodes. In the example, Aarti is visited by the graph traversal, but Fatimah is not.

The algorithm counts each time a candidate is visited as a measure of relevance. We repeat this process for each connection in the query set and finally rank recommendations by the most visited.

This combination of data structure and algorithm gives us the ability to accumulate scores in a way that captures the common patterns of how group chats are constructed. Our recommender can detect the subsets of communities, through repeat visits of similarly linked candidates. In other words, this method rewards the majority paths from the query set to a candidate.

Results from launch

Upon launch, our A/B framework showed a significant 40% increase in click through rate (CTR) as compared to the baseline heuristic recommender, which was a list of recently messaged connections. The CTR improvement resulted from both a significant increase in the absolute number of clicks (numerator) and a significant decrease in impressions (denominator). These results demonstrate that members are able to get quality results while having to scan a lot less.

We also saw a 1.5% increase in distinct members creating group chats and a 5.2% increase in group chats created. We are excited this recommendation approach has been successful in helping our members reconnect and engage with their communities.

Leveraging machine learning

One limitation of our solution is that, by simply counting candidate visitations, we are treating each type of community node with equal weight. Visiting a candidate through a shared company or a shared LinkedIn Group contributes a +1 to the candidate score. It was an open question whether each community node type carries the same predictive value when ranking candidates. Maybe a shared current company should accumulate a +2 while a shared LinkedIn Group should have a +1?

To answer this question, we leverage machine learning in our graph traversal algorithm to learn how to weigh the different categories of community. In addition to giving us a more precise candidate recommender, a successful machine learning strategy will give us confidence when onboarding new community categories that may have risky impact on quality, because we can learn the relative value of a new category. We gathered unbiased training data through a random ranker. We defined training observations as anything seen for greater than 500 milliseconds and define positive labels as any observation clicked on and included in a community. We used a logistic regression algorithm to learn the weights w of the features in the training data. To score a candidate with features x, we use the probability of the candidate as given by the logistic function:

We model this problem by enumerating the set of features as defined by the category of nodes between the query set and the candidate.These features can be defined in terms of paths P on the graph from a member q in the query set through categories to the candidate, c. All features we consider are defined in terms of the number of such paths.

We define the count of unique categories of a particular type linking any query member q to a candidate c. For example:

Feature	Pseudocode
Number of unique companies shared between the candidate and the members in the query set.	len(set([a for (q, a, c) in P if a.type == 'company']))
Number of unique query set members who share a school with the candidate.	len(set([q for (q, a, c) in P if a.type == 'school']))

This allows us to learn how the visitation magnitude through a given community category should be weighted in a linear equation when determining the score of a candidate.

Results

Launching our first machine learning model led to an additional 17% increase in CTR, a 5% increase in unique members using the recommender, and a 2.7% increase of new group chats created on the platform. This demonstrates that we can help members build communities by focusing on the relationships that matter.

Acknowledgments

Special thank you to the incredible engineering, design and product teams who have passionately championed this project.

Dixon Lo who established a visual representation of this project from the original proposal to this blog post. Enlightening discussions with Ian Wood. Joe Xue who relentlessly advocated for a strange idea. Lisa Li, Arthur Yu, Greg Lundien, Guanlun Zhao, Haoyang Li, Monique Yin, Brady Simpson, Nandeesh Rajashekar, consultations with Felix GV, and thoughtful peer review by Aubrey Gress Grove… Assemble!

Topics: Recommendations