Building Smart Replies for Member Messages
October 24, 2017
Primary authors: Jeff Pasternack (Machine Learning), Nimesh Chakravarthi (Product Engineering)
Coauthors: Adam Leon, Nandeesh Rajashekar, Birjodh Tiwana, Bing Zhao
Communication on the world’s largest professional network is integral to its success. At LinkedIn Messaging, we are working hard to bring value to our members by providing innovative improvements to the way they communicate professionally with others. Today, we are introducing a brand new natural language processing (NLP) recommendation engine that provides members with smart reply recommendations. In this blog post, we discuss the process we went through to build the models and infrastructure that power smart replies for members.
Smart replies suggests contextual messages for convenient member-to-member communication
Modeling and Predicting Smart Replies
When deciding on an approach to improve reply suggestions and unify the recommendation mechanism, we first had to identify a range of possible models and evaluate the pros and cons of each approach.
One way to generate smart replies would be to choose the text of the proposed reply word-by-word, as done by “sequence to sequence” models, which are often used for problems like text translation and summarization. However, we take a different approach, instead choosing the best reply from a finite inventory of possibilities and viewing the problem as multinomial classification rather than text generation. This approach has a number of advantages:
- Multinomial classification models tend to be simpler and easier to train, requiring less hyperparameter tuning to obtain good results.
- Simpler models also permit faster training and inference; the latter is particularly important for smart replies, where delaying the suggestion of replies would degrade the experience.
- Having a defined set of possible replies avoids the risk of generating a novel, offensive or otherwise inappropriate reply, and also makes ensuring diversity and evaluation straightforward (more on these later).
It’s also possible to adapt a sequence to sequence model to multinomial classification by, e.g. limiting the search space of its beam search used to produce the final output. This gives us many of the benefits above while still leveraging the power of state-of-the-art deep learning; such models are an active area of research as we continue to improve smart replies.
Creating Candidate Replies
Before we can train a model to choose from a set of candidate replies, we first need to generate the set of candidates. We begin by anonymizing an extensive set of conversations, replacing personal information with placeholders; e.g. “Thanks, Sarah” becomes “Thanks, RECIPIENT_FIRST_NAME”. We then “standardize” the messages so those with identical or nearly identical meaning and connotation will be considered equivalent, e.g. “Great; thanks!!!” is equivalent to “Great, thanks!”. From this very large collection of anonymized, standardized messages we then synthesize our candidate smart replies. Finally, the candidates are clustered into semantically-coherent groups, like an Affirmative group that contains “Yes”, “Yeah sure”, “Yes, RECIPIENT_FIRST_NAME”, etc. These groupings are important for, among other things, evaluation and diversity.
Our model is trained on a very large collection of conversations. Conversations are automatically scanned by our software (i.e. not by humans) to find replies corresponding to one of the previously-synthesized candidate replies. From these are derived the training examples, consisting of the label (the candidate reply) as well as the context in which it was used (the conversation preceding this reply, and its participants).
We use a machine learning framework developed within LinkedIn, Dagli, to build our multinomial classification model. Dagli represents a machine learning pipeline as a directed acyclic graph (DAG) defined via a Java API. A single node might be a statistical model (such as boosted decision trees or a multilayer perceptron) or a feature transformer (e.g. feature normalization), and the edges of the DAG represent the flow of results of one node to the inputs of another; because the DAG is easily configured within Java code with different hyperparameters, graph structure and types of nodes (e.g. swapping boost decision trees for a logistic regressor), we’re able to iterate quickly in our experiments and even automatically generate model permutations in pursuit of better predictions. Once a DAG is trained, deploying it is simple: we just serialize it as an ordinary Java object and copy it to other machines that deserialize it back into memory and perform the online reply predictions. We’re planning to share Dagli as an open-source project in the near future.
Hypothetical Dagli pipeline for smart replies. Circles represent inputs to the DAG. Arrows connect the result of one node to the input of another.
Inference, Personalization and Diversity
When you receive a message, it’s used, together with the preceding conversation, to predict what your responses might be so we can show you the top few highest-probability candidates. Often, these suggestions have placeholders which are used to personalize the message; for example, the model might predict that “Thanks, RECIPIENT_FIRST_NAME” is a good response to “Just sent you the document”. These placeholders are replaced with the corresponding pieces of information, so (if you’re talking to Jane) what you ultimately see is “Thanks, Jane”.
One potential issue is that there are, for example, many ways to say “yes”: “yeah”, “yup”, “sure”, etc., and if “yeah” is predicted with high probability, “sure” tends to be as well. This creates a problem in the diversity of the smart replies we display; we’d prefer not to show you three different ways to say “yes” as this precludes us from also suggesting “maybe” or “no”, reducing the chance at least one of the options is a good suggestion for you. Instead, we use the aforementioned semantic groupings of the candidate replies to check if all the suggestions have the same meaning; if so, we enforce simple rules (like “no more than two suggestions should be from the same semantic group”) to ensure a more diverse final set of suggestions.
Text generation models are typically evaluated by comparing the generated text to one or more “reference” texts using a metric like BLEU or Word Error Rate, and we could potentially use these to evaluate the replies suggested by our models, too. However, these metrics tend not to work well on the kind of very short messages used for smart replies; if the actual reply made by a user was “yep” but we predicted “yes”, either metric would consider this as bad (or good) as predicting “no”, or “zebra”, or “antidisestablishmentarianism”. While there are more sophisticated metrics that avoid this problem somewhat by considering the synonimity of words, judging the equivalence of texts is a hard problem and the resulting scores still often do not reflect the real performance of the model as might be perceived by a human.
Fortunately, because we know which semantic group each possible candidate reply belongs to, we have an even better (and much simpler) alternative: checking whether both the actual and predicted reply correspond to the same semantic group. So if the actual reply was “Certainly” and the model predicted “Sure”, we consider that correct because both replies have the same meaning, but a prediction of “Goodbye” would be wrong. While this does not capture the exact connotation (“yep” is less formal than “yes”), it nonetheless allows us to quantify the performance of the model in a way that is both robust and very comprehensible, e.g. “the percent of times when one of the top three suggestions had the correct meaning”. Such metrics are invaluable in both estimating the quality of the user experience and, especially, judging whether one model variant should be preferred to another.
Every week, massive numbers of messages are sent by our members. Every single one of these messages has the potential for the recipient to want to use a smart reply recommendation. This poses a very difficult problem of serving highly computationally intensive recommendations at the rapid rate that members demand them. At the same time, we need to ensure that the speed of the message delivery is unaffected by the recommendation engine.
In order to be able to efficiently compute and serve recommendations for every message that is sent on LinkedIn, recommendations for replies to a message are precomputed for each recipient when that message is initially sent. They are stored in Espresso, our in-house NoSQL database, that offers cheap, fast retrieval. When a member wants to view their messages, their relevant reply recommendations are read from the database and sent along with their conversation. This retrieval pattern is far more scalable than the on-the-fly computation alternative, and it limits the strain on the messaging platform by only having it send a single, completely independent, asynchronous request to the smart reply service.
Classification latency example
The smart reply service is built on the Play Framework. A large portion of LinkedIn services are built on this framework to leverage its excellent concurrent request handling capabilities. This advantage allows the system to smoothly handle all the incoming requests from the messaging platform. Once the recommendation platform receives classification requests for a message, it dispatches the classifications to the recommender in an asynchronous fashion. Because of the large volume of classifications that the system is expected to process every second, a dedicated Java thread pool is used to manage these tasks, concurrently executing them and writing their results to the database as they are completed. Each recommendation task is also heavily tracked to give us key metrics and statistics to identify when the system is overwhelmed. This enables us to continuously improve our performance.
Real Time Delivery
The nature of the LinkedIn Messaging platform demands a fast, realtime member experience. When a message is sent, the recipient should expect to see their conversation update instantly, including everything that might be affected by this new message. This includes smart reply recommendations that could be very useful the moment the corresponding message is read. As a result, the smart reply architecture has been built to serve recommendations very quickly, at what seems like the exact same time that the message is received.
Real time delivery of recommendation results is enabled by the concurrent classification capabilities of the smart replies service, as well as LinkedIn’s platform for real time communication with client devices, which is a huge scalability problem in itself. Upon the completion of each classification, the result is sent through the real time communication platform, directly to the client that each recipient is actively using. Because the classification is triggered just after the corresponding message is created, the completed recommendation is delivered to the recipient of the message only milliseconds after the message itself.
Privacy & Security
LinkedIn takes our members’ privacy very seriously. We do everything in our power to make sure that private data, such as a member’s private messages, is not exposed or used in any way that might infringe on their privacy.
As described earlier in the post, replies used in our models are anonymized. This ensures member privacy is protected not only when messages and smart replies are sent between users, but also for training data used in the development of our generalized message classification model. In addition, members can choose to opt-out of this feature.
A number of data security controls are implemented within the smart replies platform to ensure the data used to generate recommendations is handled appropriately. To protect member data, we ensure that the messages are always sent across a secure encrypted stream to the smart reply platform. Furthermore, this data is considered to be in the highest bracket of confidentiality at LinkedIn, and servers that host the recommendation engine meet LinkedIn’s highest security standards.
Finally, we have added controls to ensure that profanity is not suggested to members and does not generate suggestions.
The smart replies team would like to thank several individuals who provided invaluable help during the design, testing, and implementation of smart replies. Specifically, we’d like to thank Tim Converse, Dan Bikel, and Siva Sooriyan from the engineering team. We’d also like to acknowledge Aashish Patel from the security team and Catalin Cosovanu from legal who provided valuable input during the design process. Finally, we’d like to especially thank Arpit Dhariwal from the product team.