Voices Part II: Technical Details for Topic Mining
July 1, 2016
This is the second post of a two-part series about Voices, a text analytics platform built by LinkedIn. Part I gives a general overview of Voices, while Part II will go into further detail about the technical aspects of topic mining specifically.
Topic mining, also known as topic modeling, is the technique of extracting the most important concepts from a collection of otherwise unstructured documents. Effective topic mining facilitates the understanding of information in large sets of unstructured data in an automated fashion.
Topic mining has been a popular research task in recent decades, and many practical applications for it have been developed, such as document clustering and summarization. The main goal of topic mining is twofold: first, as the number of documents grows rapidly, it is nearly impossible for humans to efficiently read and understand all of the included text, so automating this process becomes essential; second, topic mining helps increase the effectiveness and efficiency of critical text-based applications such as search indexing, document summarization, clustering, classification, and sentiment analysis.
Mining topics from a large amount of text is a highly complex task, not only because of the multifaceted nature of our natural language, but also due to the inherent difficulty in extracting the right words and phrases to accurately represent the main concepts of documents. Existing solutions often use statistical and probabilistic methods to find significant topics. Some of the popular methods include Term Frequency - Inverse Document Frequency (TF-IDF), co-occurrence, and Latent Dirichlet Allocation (LDA). However, these methods often either introduce too many noisy topics or suffer from problems with scalability and efficiency.
At LinkedIn, our topic mining system is a pipeline of multiple Natural Language Processing (NLP) modules, as follows. We have implemented this system to handle topic mining functions in Voices, the text analytics platform we built to understand member feedback at scale.
1) Part-of-speech (POS) tagging: This function is responsible for tagging each sentence to obtain POS tags for individual words. Nouns, verbs, adjectives, and adverbs are popular POS often seen. We use a Java implementation of the Stanford Log-linear POS tagger in our work. Each input document is split into sentences by the tagger’s sentence splitting function. The tagger then produces a sequence of POS tags for each sentence. For example, the sentence “I went to Washington park yesterday” will have a POS sequence of “I/PRP went/VBD to/TO Washington/NNP park/NN yesterday/NN ./.” The POS tags used in this implementation are from the Penn Treebank English POS tag set.
2) POS pattern matching: The goal of this step is to select POS tag sequences that match our predefined POS patterns, which are represented in regular expressions and may include a recursive noun phrase, a noun phrase followed by a verb phrase, or a verb phrase followed by a noun phrase. We observe that in customer feedback (also known as Voice of Member, or VOM, data), the most important topics are often either entities (noun phrases), such as “profile” or “homepage,” or are events (verb phrases or actions associated with certain entities) such as “close account” or “payment approved.” This inspires us to look for patterns that govern such entities and events. POS patterns of noun and verb phrases are perfect for this task. Hence, we define two different POS patterns for noun phrases and verb phrases, respectively. The first pattern defines a recursive noun phrase, which is a noun preceded by zero or more other noun phrases or modifiers (we use adjectives as modifiers; numbers and pronouns can be used as modifiers too). As a result, the phrase “secondary account,” with a POS sequence of secondary/JJ account/NN, will be matched to the regular expression for the recursive noun phrase to obtain two noun phrases: “account” and “secondary account.” The second pattern defines a recursive verb phrase, which is one or more consecutive verbs. For example, the phrase “has passed,” with a POS sequence of has/VBZ passed/VBN, will be matched to the regular expression for the recursive verb phrase to obtain two verb phrases: “passed” and “has passed.” Based on the POS patterns for noun phrases and verb phrases, we created the following three POS patterns for the two different types of topics: entities and events.
a) Entity topic: A noun phrase, which represents an entity such as “email” or “credit card.”
b) Event topic I: A noun phrase plus a verb phrase, which represents an event in the form of a noun phrase followed by a verb phrase, i.e., an action associated with an entity, like “application crashed,” “account closed,” or “previous transaction failed.”
c) Event topic II: A verb phrase plus modifiers plus noun phrase, which represents an event in the form of a verb phrase followed by a noun phrase. The verb phrase may be separated from the noun phrase by a list of modifiers like pronouns or numbers, for instance“merge my accounts” or “close our old accounts.”
The POS pattern matching module scans the POS tag sequences for each sentence and detects if there is a match to any of the three defined POS patterns for entity and event topics. Each matched subsequence is treated as a candidate topic.
3) Topic pruning: Here, we are trying to reduce overlap and remove unnecessary words or phrases from a list of candidate topics that are the output of Step 2. We accomplish this by performing the following tasks.
a) Stemming. Stemming is the process of reducing inflected words to their stems. It is a key technique in information retrieval and text mining tasks. In the topic mining context, stemming of inflected words in the candidate topics may transform three candidate topics of “view profile,” “view profiles,” and “viewed profile” into the same cleaned candidate topic of “view profile.” In our system, we use a Java implementation of the Porter Stemmer. During stemming-related merging of candidate topics, words that appear most frequently among the inflected words (e.g., “view” and “profile”) may be selected for inclusion in the final cleaned candidate topic (e.g., “view profile”).
b) Removing stop words. Cleaning the candidate topics also includes removing stop words. For example, common stop words like articles, prepositions, pronouns, conjunctions, particles, or other function words may be removed from the candidate topics. As a result, candidate topics like “close the account” and “closed his account” may be processed into the same cleaned candidate topic of “close account.” In our system, we use a standard list of stop words.
c) Merging synonyms. To further facilitate cleaning of the candidate topics, a list of domain-specific stop words that do not add value to the candidate topics may also be removed. For example, domain-specific stop words associated with the use of any social network may include words or phrases such as “additional information,” “contact us,” “original message,” “same problem,” “website,” “other sites,” “clicking the link,” and “com.” In our system, we use a list of 234 domain-specific stop words that are manually identified.
d) Merging semantically-related lexical items. Finally, the candidate topics can be refined by merging synonyms or semantically-related lexical items. For example, a domain-specific synonym dictionary can be used to match synonyms such as “email address” and “email account” and to merge the synonyms into a common topic. Similarly, a lexical database such as WordNet can be used to relate or merge semantically-related words such as “link,” “connection,” “association,” “partnership,” and “relationship.” In our system, we use a list of 75 synonym pairs that are manually identified.
Use of domain-specific stop words and synonyms help prune the candidate topics. However, these two strategies are optional for a general topic mining system.
4) Topic ranking: Then, we rank the candidate topics remaining after the previous steps and choose the best ones as final. Once candidate topics are cleaned, we still need a metric to order them so that we can generate a set of topics. In our work, we rank the candidate topics in two steps. First, we calculate the TF-IDF value of each cleaned candidate topic in a document. We keep up to five (empirically set) topics with the highest TF-IDF values for each document. Since TF-IDF is designed to extract topics from individual documents in a document collection, rather than from the whole collection, the output topics from all documents should be combined properly to generate a single list of topics for the whole document set. We use document frequency to rank these topics effectively. In other words, the top five topics from each document are ranked according to their document frequency in the whole document set. The topics appearing in the top of the ranked list are the final topics for the document set.
The core intuition of this multi-module pipeline is that applying any one module by itself, e.g., TF-IDF calculation without pre-filtering, would produce noisy and inaccurate topics. Our method works well for VOM data in natural language such as forum discussions, group updates, blogs, etc. Topics generated by the system can facilitate understanding and use of information in VOM without manual review of the content. The topics may be used to provide information regarding the themes associated with the source documents. For example, account-related user complaints may include topics such as “primary account,” “merge accounts,” “close account,” “duplicate accounts,” and “secondary account.” Profile-related user complaints may include topics such as “remove connection,” “address book,” “import contacts,” “send invitations,” and “pending invitations.” These topics can be visualized using a wheel, where the inner circle represents entity topics while the outer circle consists of action topics for each given entity topic.
The topics may also be used to classify or group the user complaints for further processing by customer service representatives, to identify sentiments associated with the topics, to facilitate searching of the user complaints, or to generate summaries of content associated with the topics.
Furthermore, topics can be used develop the trending insights algorithm, where we look for topics that have a noticeable change compared to previous weeks or days. This gives us important signals in social media and community feedback. Lastly, the topics can be used as features for text classification to reduce feature dimension and improve efficiency.
We are grateful to contributions from team members across the company. Special thanks go to Ben Ma, Henry Wu, and Vita Markman for developing and deploying text mining functions in Hadoop + Spark, and to Alexis Zheng, Hu Wang, and Rachelle Morris for evaluating our algorithms and monitoring quality over time.