Voices: a Text Analytics Platform for Understanding Member Feedback

Yongzheng (Tiger) Zhang

Head of ML @ Ontra | ex-[Nextdoor, LinkedIn, eBay]

June 10, 2016

In the era of big data, corporations and businesses are increasingly collecting immense amounts of unstructured data in the form of free text, from customer service conversations to market research surveys. While it is clear that such member feedback, or “Voice of the Member” (VOM), contains valuable information, it is often less clear how to best analyze such data at scale.

It is important to find the main topics/themes in VOM data not only to understand members’ concerns and pain points, but also to derive insights so that better business decisions can be made to improve products and user experiences. Some typical examples are as follows:

In market research via Net Promoter Score (NPS) surveys, we want to understand why members recommend the brand/site to others, i.e. the drivers for lifting the NPS score for the company. A topic such as “build network” in NPS surveys gives us a clue that members like the site as an effective tool for building their social network.
From app reviews, we wish to understand the user experience on apps and seek opportunities to fix problems and improve products. For example, the topic “app crashed” in reviews indicates a potential flaw in the app.
For customer service (CS) emails, the main goal is to find the most frequently reported problems. For instance, the volume of the term “merge accounts” in CS emails helps us understand how many members have multiple profiles/accounts and the severity of the problem. All these topics fall under the patterns of entities plus associated actions.

Text mining (also known as text analytics), the computational study of unstructured text using advanced data mining and natural language processing techniques, can greatly help with the tasks above. Key components of text mining often include but are not limited to: topic mining, text classification, text clustering, and taxonomy construction.

The market space for text analytics is very crowded (see picture below). There are many vendors and open source tools available for use. So with so many choices, why do we still build our own solution? The main reason is that we want scalability, flexibility, and focus. The ideal solution should be scalable, as we are dealing with a huge volume of data from multiple channels and of different natures. We also want flexibility as we investigate and integrate different text mining functions into our system. Finally, we want to focus on a certain segment of data—for example, data that is relevant to LinkedIn. Other important factors we have considered when determining what text analytics platform to use are time, development cost, and maintenance cost.

Figure 1. Text analytics vendors and open source tools.

At LinkedIn, we have built Voices, a text analytics platform that provides easy access to member feedback about our website and key products. Voices aggregates unstructured text across both internal (e.g. LinkedIn posts, customer support cases, NPS survey results) and external (e.g. social media, such as Facebook and Twitter, news, forums, and blogs) data sources. Structured member data and unstructured textual data from various channels are ingested into HDFS and passed through a suite of text mining functions. This allows Voices to surface relevant insights by various dimensions, such as value proposition, product, sentiment, trending insights, and many other use cases.

We aggregate internal data sources and purchase external data from vendors, who pull relevant information from publicly-available data on social platforms and online news, blogs, and forums. Additional data attributes (e.g. geography, sentiment, and audience segment) enable deep dives into business domains. Voices also includes reviews for major LinkedIn apps from the Apple App Store and Google Play.

Text Mining in Voices

Text mining is the computational study of unstructured text to understand members’ feedback and gain insights for better business decisions. It would take years for a person to read millions of text documents manually, which is infeasible for any business. Hence, effective and efficient text mining functions are in great demand to deal with enormous volumes of unstructured text.

In Voices, there are three key text mining components, as illustrated in Figure 2.

Relevance Solution
Classification Engine
Topic Mining

Figure 2. Architecture of text mining in Voices.

Relevance Solution

When dealing with huge volumes of unstructured text in social media, it is critical to identify the content that is relevant to LinkedIn and our products and services. This step must be addressed before any further analytics can be conducted. In Voices, we take a machine learning approach to solve the relevance problem. We build models based on examples we have seen that are relevant and irrelevant to our business, and then we apply the learned model to new documents to predict how relevant they are separately.

Classification Engine

To accomplish relevance solution using a machine learning approach, we have developed a generic text classification framework, which builds a Support Vector Machine (SVM) model using sample documents with known labels of predefined categories (e.g. a list of customer service tickets with known products or a list of app reviews with sentiment tags). The model can then be used to predict new text documents. This framework has a lot of other applications, such as sentiment analysis, product classification, and value proposition classification.

Topic Mining

Other than the text classification engine (and relevance solution as the fundamental application), another key text mining component is topic mining. Topic mining, also known as topic modeling or theme identification, is the technique of extracting the most important concepts and associated actions from unstructured text. Our topic mining system is a pipeline of multiple Natural Language Processing (NLP) modules, including: 1) part-of-speech (POS) tagging; 2) POS pattern matching; 3) topic pruning; and 4) topic ranking. The core intuition of this multi-module pipeline is that applying any one module by itself would produce noisy and inaccurate topics.

Our method works well for VOM data in natural language such as forum discussions, group updates, blogs, etc. Topics generated by the system can be used to: 1) facilitate understanding and use of information in VOM without manual review of the content; 2) classify and/or group the user complaints for further processing by customer service representatives; 3) identify sentiments associated with the topics; 4) facilitate searching of the user complaints; 5) generate summaries of content associated with the topics; and 6) use as features for text classification to reduce feature dimension and improve efficiency.

Discussion

While developing the Voices system, we have learned many lessons we can share with the community. First, as text mining practitioners, we often face the challenge of making a choice between vendor products, open source tools, and in-house solutions. While there is no fixed answer for all scenarios, it is important to balance the key factors, such as quality, efficiency, flexibility, scalability, and cost (including developing cost and maintenance cost). Second, we need to make tradeoffs between quality and efficiency. For example, LDA is a state-of-the-art topic modeling method, but it is computationally expensive and hence less efficient. In practice, there are approaches that are suboptimal but much more efficient and scalable. A boost in these factors without too much loss in quality is often preferred in industry applications. Third, whenever possible, we will always strive to leverage big data infrastructures such as Hadoop and Spark to deliver truly scalable text mining functions. Last, but not least, visualization is also very important for telling stories from the results of text mining. For example, there are many options for displaying topics, such as a word cloud or topic wheel. Excellent visualization solutions can quickly and effectively tell a story for better decision-making, which will greatly help improve products and user experiences.

Summary

In summary, we have built a scalable text analytics platform with innovative text mining solutions via advanced machine learning and natural language processing techniques. Such a platform allows us to listen to feedback from our community, drive actionable insights for better business decisions, and eventually create impact for our members.

Acknowledgements

Voices is a great team effort. We are grateful to contributions from team members across the company. Special thanks go to the Voices team and partners including: Alexis Zheng, Andrew Park, Ben Ma, Chi-Yi Kuan, Henry Wu, Hu Wang, Justin Park, Rachel Zhao, Rachelle Morris, Sui Yan, Tiger Zhang, Vita Markman, Weidong Zhang, and Wendy Shi. We would also like to extend our special gratitude to our extended management team: Laura Dholakia, Scott Shute, Sunil Manhapra, and Kapil Surlaker, for their constant encouragement and support over the past two years.

Topics: Analytics Data Data Science