How natural language processing helps LinkedIn members get support easily
April 30, 2019
As members explore the many products of LinkedIn, including the feed, homepage, Learning, Recruiter, and Sales Navigator, to name a few, they often experience new exciting features, which, from time to time, may lead to questions around use. To help our members and deliver the assistance they need, we use deep learning-powered Natural Language Processing (NLP) to predict the best answers for help requests. Given our scale, NLP gives us the best shot at providing answers for our 610 million+ members. Every day, the automated system based on NLP solves over one thousand tickets. The questions can range from broad, such as on accounts settings, to specific, like enhancing a member’s profile page. To ensure members can get the help they need easily, we offer the Quick Help Widget under the Settings tab, and a dedicated Help Center search.
This blog shares more details about how we leverage NLP to process the questions from our members so they can easily use and enjoy all the features of LinkedIn.
Initial Help search system
In 2016, we built the first iteration of our Help search system, called Care Search. We did not use NLP at the time to solve this problem. Instead, we created an algorithm that calculated the score of every help article based on title, body, and keywords. This score was then used to rank the best answers to our members’ help requests.
Our Care Search was built upon Galene, which is a Search-as-a-Service (SeaS) infrastructure that powers a multitude of search products at LinkedIn. The underlying index is a Lucene index. During the search phase, we scored each hit (help center article) with the BM25F algorithm, which is a per-field based TF-IDF (term frequency-inverse document frequency) algorithm. The idea behind it is that for each article, there is a title, body, and potentially many keywords. Each field should have a different weight on the scoring. For example, for the query of "premium membership," articles with "premium" in the title should have higher relevance scoring than those with "premium" in the body. Thus, when calculating the BM25F score, we give the highest weight to hits in the title, then hits in keywords, and lastly, hits in the body. By using this strategy, we were able to return statistically-valuable results for popular and well-formed queries.
Why the system wasn’t meeting our needs
It turned out that, in Help searches, members tended to use a variety of expressions for questions on “what,” “how,” or “where.” This made it very different from queries for People Search or Job Search. In many cases, the search expression was different from the terminology used in our articles. As a result, people who searched for "how to deactivate my account" would get served articles titled "create account" on the top, since we did not have "deactivate" in the index. Another example is how a search of "uploaded my CV" would return nothing because our index only contained "resume" and "upload."
The solution we built
To better understand members’ queries and to improve the quality of the search result, we developed a NLP workflow:
Text analysis: A query like "how cancelling my premium accounts immediately" becomes normalized to "cancel premium account."
Query mapping: Based on the member’s query, we will find a popular query like "cancel premium subscription."
Enter deep learning: We identify the intent of the query using a Convolutional Neural Network (CNN).
Let’s zoom into each step of the above workflow.
Text analyzer is used to parse the text and simplify the text to extract the core information. It is the basis of all the NLP components in our system, both for query processing and index building. It has four steps as shown in Figure 1:
Figure 1: Text Analyzer
Tokenization: Break the text into words.
Lemmatization: Find the basic form of each word variation in the context. Examples are "account" from "accounts" and "break" from "broke."
Stop Word Filter: Filter out common words. In English, there are hundreds of stop words like "a," "my," and "on," to name a few, that have little bearing on relevance or meaning, and thus can safely be removed from the query in order to target the more valuable words.
Part of Speech (PoS) Filter: Read through text and give each word a PoS based on the context. There are nine parts in English: adjective, adverb, conjunction, determiner, noun, number, preposition, pronoun, and verb. We only capture nouns, verbs, proper nouns, and adjectives, as they together represent the purpose of a text.
Query mapping will convert the simplified query generated from text analysis to a rep query, which is more relevant to the article data. Inspecting members query history, we found that there were some queries with good search results, while others had not not-so-good results. The latter mostly occurred when the query did not match any of the words in the articles. To fill the gap between a member’s terminology and a given article’s terminology, we’ve built a representative query mapping to convert the raw query into more representative queries. For example, imagine a set of member queries on closing accounts: "how to unsubscribe linkedin," "leave linkedin," "delete linkedin," and "close account." Among these queries, "close account" gets the best search results, since it matches the title of the target article "Closing Your LinkedIn Account." So, we say "close account" is the rep query for the other three queries. To get the rep query for each raw query requires two parts:
Query grouping: We first calculate the edit distance of raw queries. This is the number of operations (insert/delete/substitute) needed to transform one text into another. For instance, the edit distance between "close account" and "closed accounts" is two. Then, we define the similarity of two queries with the following:
sim = 1 – d / max(q1, q2), where d is the edit distance between two queries, and max(q1, q2) is the maximum number of characters in both queries.
Secondly, Jaccard index is also employed to measure the similarity of two queries at the word-level. It represents the number of overlapping words against total unique words between two queries. For example, there are two shared words between "cancel premium subscription" and "cancel premium membership," but four unique words in total. Thus, the Jaccard index is ½.
Finally, we group raw queries with low edit distance and high Jaccard index together.
Topic mining: Query grouping puts similar queries together. However, for a given query, it does not tell which query in its group is the most relevant. To figure that out, we first utilize the text analyzer to extract all the topics from articles. Then, we use a TF-IDF algorithm to filter out popular topics. For example, when it comes to the article "Merging or Closing Duplicate Accounts on LinkedIn," the extracted highest ranked topics will be "merge connection," "merge duplicate account," "close duplicate account," and "find other account." Then, within each query group, we calculate the rep score:
rep score = max(0.2 * sim(rq, q) + 0.5 * sim(q, title) + 0.3 * sim(q, body))
This is for all queries q that are similar to the raw query, where sim(rq, q) is the similarity between the raw query rq and the more popular query q. sim(q, title) is the maximum similarity between q and one of the topics from the title, and sim(q, text) is the maximum similarity between q and one of the topics from the text.
We rank rep score, and select the top k (k = 3 works well for our case) rep queries for each query to generate the rep query mappings.
Rep query works well for common queries. However, for long-tail queries, rep query is usually empty, as there are not enough data points. To address this, we created a CNN-based deep learning model to identify the intent of each query.
CNN is used to capture local "spatial" patterns in data. It is useful for things that are closer together and more closely related than things far away, such as image recognition and text classification. Intents are extracted from articles. We first group articles based on intent. For instance, articles titled "Canceling Your Premium Subscription" and "Canceling or Updating a Premium Subscription Purchased on Your Apple Device" are considered to have the same intent of "cancel premium." Then, we extract all the intents from articles.
Figure 2: Intent Classifier
As described in Figure 2, the whole procedure of Intent Classifier is as follows:
During training time, a set of queries with intents will be loaded into CNN for deep learning.
First, each query will be transformed into a sentence matrix.
Second, CNN will use multiple filters to "scan" and do convolution through the sentence matrix, and produce feature maps.
Last, with feature maps and labeled intents, Classifier is able to group features for each intent. At the same time, Classifier will use some strategy to measure the distance between different feature maps. Feature maps with a small distance should belong to the same intent. Queries (and their feature maps) that are incorrectly categorized by Classifier will be back propagated to CNN for tuning of parameters, such as (1) convolution (number of filters, filter size, weights within each filter) and (2) pooling (window size, window stride).
During online serving time, given a member query, CNN is able to generate features for it. Then, Classifier is able to map features to proposed intents (articles) with different probabilities.
Figure 3: NLP Architecture in Care Search
Figure 3 shows the overall architecture of the revised Care Search system and how NLP weighs in on the system. During offline index generation, lemmatization is used to standardize the article content. During online search, the raw query will first be converted to rep query and then sent to both Galene and Intent Classifier. Here, Galene is for keyword search, while Intent Classifier is for relevance match. After that, hits from these two parts will be merged and scored further based on features such as view count, freshness, category, etc. Finally, the ranked results are returned to the member.
Measurements and results
We measure the performance of the NLP solutions from three aspects:
Click-through rate (CTR): The ratio of searches with at least one click on the result page to the number of total searches. This is to reflect if the search results are clear to members.
"Happy path" session rate: Defined as one session where the member clicks only one article from the search results and then leaves the page without creating a case. This path is the ideal member experience we try to achieve.
"Undesired" session rate: This metric is meant to evaluate the most undesired path, which is when a member completes a search, clicks on no articles, and creates a case directly. This metric helps us track how updates to our search algorithm affect increases in case volume, and indirectly, measure the relevance of our results.
Based on the above metrics, we observed a significant performance increase with our new search. First, CTR improved from 39% to 69%. Furthermore, the "happy path" rate increased from 16.2% to 29%, while "undesired" path rate decreased from 6.3% to 2.8%. Both measurements were also found to be statistically significant.
To benefit non-English-speaking members, we are currently rolling out a German deep-learning model and will work on other languages such as Portuguese, French, Spanish, and Italian as ordered by member base in those languages.
There are situations when a follow-up interaction is needed to better understand a member's question and optimize the search result.
- Ambiguity detection: To identify if a question is ambiguous or not, we use NLP to get PoS for each word. Based on our analysis, a good question should contain at least one verb and one noun, like [do something]. If there is just a verb such as "cancel," we will ask follow-up questions, such as "what do you want to [verb]"? Alternatively, if there is just a noun such as "account," we will frame the follow-up question to be "what do you want to do for [noun]?"
- Step-by-step guidance: For the types of "how to" questions, it is better to return step-by-step instructions directly in the search results. To do this, we annotate articles that could contain parallel sections to cover different scenarios, such as device type or product type. A follow-up interaction would focus on specifying the member’s scenario.
Typo detection and correction is essential for any successful search service. In the future, we’re looking to integrate with Microsoft Bing Spell Check to correct members’ typoes on the fly and return search results based on the corrected query.
This work is a multi-team effort across the Trust, Search AI Foundation, and Data Science teams. Special thanks to James Gatenby, Zack Mulgrew, Xiaofeng Wu, and Laura Dansbury from Trust; Weiwei Guo, Jaewon Yang, Huiji Gao, and Bo Long from Search AI Foundation; Zhou Jin, Xinling Dai, Rachel Zhao, and Tiger Zhang from Data Science; and Szczepan Faber and Ning Xu for reviewing this blog.