Text analytics on LinkedIn Talent Insights using Apache Pinot
June 16, 2021
LinkedIn Talent Insights (LTI) is a platform that helps organizations understand the external labor market and their internal workforce, and enables the long term success of their employees. Users of LTI have the flexibility to construct searches using the various facets of the LinkedIn Economic Graph (skills, titles, location, company, etc.). Users can generate customized talent metrics using the values present in LinkedIn’s taxonomy, with each entity in the taxonomy represented with a unique identifier. For example, the title “software engineer” is represented with title ID 9. Historically, a user’s search was constrained by the taxonomy values available. A skill is added to the taxonomy once LinkedIn detects sufficient growth in the number of members adding the skill to their profile; as a result, there’s a delay between when a skill first appears on profiles and when it is added to the taxonomy, which was limiting users’ ability to search for an emerging skill or technology with LTI.
To solve this problem, we decided to add keyword search functionality to LTI. Keyword search unlocks the ability for a user to search for any text present on LinkedIn profiles. Adding this functionality to LTI empowers users to construct searches using specific keywords and generate a wider variety of talent pool metrics to better understand the talent landscape.
Many LTI users are also users of LinkedIn Recruiter which already had support for keyword search functionality. The two products are integrated to enable recruiters to define their hiring strategy in LTI and then source candidates in Recruiter. However, the two products are built on two entirely different tech stacks. As discussed in a previous blog post, Apache Pinot OLAP datastore is the foundation for the LTI platform, because it computes all the talent metrics displayed to the user. On the other hand, LinkedIn Recruiter is built on top of Galene (a Lucene-based index), because it has a heavy focus on search relevance/ranking so that a recruiter can find the best candidates to reach out to. This use case is different from LTI, which focuses on returning accurate counts for understanding macro trends across talent pools and companies.
Given that we wanted keyword search functionality for LTI somewhat similar to the one in LinkedIn Recruiter, one initial thought that came to mind was: “Should we leverage Galene for powering LTIs keyword search functionality?” We decided against that approach, since supporting keyword search through Pinot instead would provide the most seamless integration because Pinot already powers all the analytics for the LTI platform. In this blog post, we will discuss our journey to add text search support in Pinot and how it unlocked keyword search functionality for LTI. The text search support in Pinot was directly contributed to Apache Pinot open source project.
Support for text analytics in Pinot
The LTI keyword search feature need the ability to extend the existing analytical SQL queries on Pinot with term, phrases, regex searches onto gigantic text representing an aggregated set of keywords from LinkedIn members’ profiles. To give a brief insight into the scale of the data, the average raw text column size is 400GB per Pinot table. Text is stored in a STRING column, and each column value has text containing up to 2 million characters.
For LTI, we wanted the ability to search for any text on a LinkedIn member’s profile, such as the headline, summary, titles, positions and skills. Below is a small example snippet of a text blob containing the keywords extracted from a LinkedIn member profile.
Pinot supports super fast query processing through its indexes on non-BLOB like columns. Queries with exact match filters on terms are run efficiently through a combination of our highly optimized native storage structures, such as dictionary encoding, inverted indexing, and sorted indexing.
An example query using an exact match filter:
For arbitrary text search queries, Pinot offers a built-in function called REGEXP_LIKE. However, this function can’t leverage any index and uses full table scan, which is very inefficient considering LTI’s data volume, performance SLA and nature of queries. LTI also needed the ability to combine terms (single words), phrases (multiple words in the same order), and regex through logical operators (AND, OR, NOT) to compose a single complex text search Boolean expression. Creating such a query with REGEXP_LIKE (based on Java Regex) becomes non-trivial for the end user. For example, the following query finds the number of candidates with “distributed systems” and “apache” in their skill sets. The skills could be present in any order in the column value representing the candidate’s resume text.
You could potentially use groups with lookback and lookahead to avoid writing down all permutations, but it still complicates writing such text search queries. Keyword searches on the LTI platform can have up to 1,000 different phrases, terms, and regex combined in a single search query. So, the REGEXP_LIKE function was not the perfect choice to get the desired functionality, usability, and performance.
Text index in Pinot
To efficiently handle text search queries, Pinot team at LinkedIn added support for a text index on STRING type columns. Before we dig into further details, let’s quickly compare a production text search query’s performance with and without index on a dataset of 500 million rows with a filter selectivity of 100 million. The text search query with index is faster by multiple orders of magnitude.
Latency of a text search query without index and with index
Apache Lucene, an open source full-text search library, supports the necessary features for text search, so we evaluated Lucene by doing a proof of concept implementation. We did experiments to assess the text search functionality, CPU, I/O, and memory overhead, using the profile keywords text data available at that time. (The entire production-scale profile keywords data wasn’t ready at the time of the POC evaluation.) However, we did get a fair idea of functionality and performance without any red flags, allowing us to conclude that Lucene would meet our needs after some optimizations for production scale performance.
Let’s briefly discuss how text indexes are created and queried in Pinot.
Pinot’s table storage format is columnar. Native index structures for the table are created on a per column, per segment (shard) basis. We decided to stick with this fundamental design for the text index, as evolution and maintenance are easier, allowing the user to enable or disable the text index on a per column basis.
Initially, we considered adding a new data type in Pinot (called TEXT data type). There were some advantages to doing so with respect to adding independent code, etc. However, that would have meant existing Pinot tables with STRING columns would not be able to add a text index. It would be a long migration path of adding a new column and copying data before we could enable text indexing. We therefore decided to add text indexing support to the STRING column type instead. Also, physically, the TEXT data type is not different from STRING, as the raw data is stored as UTF-8 snappy compressed.
Like other indexes, a text index is created as part of Pinot segment creation. Both offline and realtime Pinot segments support the text index feature. For each row in the table, if text indexing is enabled for a column, we take the STRING column’s value and encapsulate it in a document. The document consists of two fields.
Text field: contains the actual column value representing the body of text that should be indexed.
Stored field: contains a monotonically increasing docId counter to reverse map each document indexed in Lucene back to its docId (rowId) in Pinot. This field is not tokenized and indexed.
As mentioned earlier, the keywords text data is enormous. Across all LTI Pinot tables using this feature, the average size (per table segment) of the keywords column alone is roughly 2.5GB. Pinot uses dictionary encoding for all columns by default, thus building a dictionary for the columns during segment generation. For the huge size and cardinality of this column in the LTI tables, building the dictionary created significant heap overhead and GC pressure. We therefore decided to disable dictionary encoding on columns with text indexing turned disabled.
We also used this opportunity to address one limitation in our on-disk segment format for raw (without dictionary encoding) column data. The format didn’t support creating a segment with a raw column size bigger than 2GB. The on-disk format divides the segment file into chunks and packs 1,000 rows (snappy compressed) in each chunk. Since a single text column value can contain up to 2 million characters, the default method of packing 1,000 rows in a chunk resulted in overflow for compression buffer size.
We fixed the chunk size at 1MB and used the length of the longest column value to auto-derive the number of rows we should store in a single chunk. Secondly, each chunk’s starting offset was previously tracked in the file header using a 4-byte offset. Since the total size of text column values across all rows in the segment can be greater than 2GB, we started using an 8-byte offset for the chunk. This change was backward-compatible, allowing the new writer to read the old segment format, and was made configurable so that we don’t necessarily have to switch to the new format for all production clusters.
Plain text is used as input for index generation. An analyzer performs pre-processing steps like lowercasing, breaking text into indexable tokens, etc., on the provided input text. We currently use StandardAnalyzer, which is good enough for standard English alphanumeric text and uses a Unicode text segmentation algorithm to break text into tokens. StandardAnalyzer is also used during query execution to analyze and compile the search expression before searching the text index.
We enhanced the Pinot query parser and planner with a new in-built function TEXT_MATCH() to be used in the WHERE clause of the queries for filtering using a text index.
WHERE TEXT_MATCH(<columnName>, <searchExpression>)
Let’s take an example of profile text stored as a STRING column in a Pinot table with text index enabled on the column. We can now do different kinds of text analysis on the profile data. For instance: count the number of candidates that must have “machine learning”, “gpu processing” and one of “distributed systems” or apache:
Pinot’s execution engine was enhanced with a new filter query operator to execute the TEXT_MATCH clause. The key thing to note is that Pinot’s fundamental building blocks—like columnar format, logical plan, per-segment physical distributed execution plan, and in-memory query execution—remain unchanged. In fact, they provided a stable foundation to add support for this new feature in the engine.
Pinot building blocks
The following diagram shows the flow for the per segment filter execution for TEXT_MATCH() clauses on two different columns with a text index.
Keyword search query execution in Pinot on an offline table segment
Feature implementation and integration in LTI
Given that text search could now be easily used through a new in-built function in the WHERE clause of the Pinot query, it was fairly compatible with all the existing APIs across the LTI backend, middle-tier, and frontend (see figure below). Whenever a user makes a keyword search on the UI, the TEXT_MATCH() filter is added to the Pinot queries that generate the LTI report’s metrics. As a simplified example, the following query gets the count of all software engineers for the top 10 companies that have the phrase “tensor flow” on their LinkedIn profile.
Example of a user creating a keyword search on the LinkedIn Talent Insights platform
Production scale testing
Performance-wise, we targeted two goals on production scale data and queries:
Firstly, any latency degradation in the existing workload of non keyword-search based queries was not acceptable. Due to the increase in data volume, each Pinot server in the production cluster was memory-mapping more data. So, there was a chance that the performance of existing queries could degrade due to increased paging.
Secondly, the latency SLA for keyword-search based queries should be under 1 second P95.
In the LTI ETL process, we made sure to use the same data and fields that are used by Recruiter’s index generation process. The resulting keywords data was ingested into the new member profile keyword column of each LTI Pinot table on the performance test cluster. We sampled production Pinot queries, which were then modified to include TEXT_MATCH() filter clause(s), and created three categories of workloads to characterize text search queries’ performance in a steady state workload.
No text search query: baseline workload
Mixed workload comprising 50% text search queries: estimated production workload
100% text search queries: worst case scenario of an all-text query workload
We made several optimizations during feature development and production scale testing. These optimizations focused on achieving the latency SLA of the text search queries across different LTI Pinot tables.
Reducing heap overhead: For a search query, Lucene’s default behavior is scoring and ranking, returning the top N hits of the query sorted by score (descending). We didn’t need any of the scoring related features; we were only interested in retrieving all the matched docIds for a given text search query. Our initial experiments revealed that the default search code path in Lucene results in significant heap overhead for our volume of data. To solve this problem, we implemented the Collector interface to provide a simple callback to the search operation. For every matching Lucene docId, Lucene calls our collector callback, which stores the docId in a bitmap that can then be used by our filter execution code for computing AND/OR with other filter clauses.
Pruning stop words: Profile keywords text data had a small percentage of stop words (a, an, the, or, etc.). We pruned the stop words during text analysis at the time of index generation to reduce the text index’s size and improve the query performance. This optimization helped reduce the size of the text index by 20%.
CPU bottleneck with increasing QPS: Similar to any database, Lucene internally assigns a docId to each document indexed. However, this is not guaranteed to be the same as the docId maintained in the Pinot table format, since Lucene divides its index into multiple sub-indexes, and the docId is relative to each sub-index. This is why we associate the Lucene document with the corresponding Pinot docId. This results in a two-pass query execution in the text match filter operator
Collector callback builds a bitmap of matching Lucene docIds based on the search query.
The filter operator iterates over each docId to get the corresponding document.
Retrieve the Pinot docId from the document
Retrieving the entire document from Lucene was a CPU hogger and became a major bottleneck for throughput testing. To avoid this, we iterate the text index once when the Pinot server loads the segment to fetch all <lucene docId, pinot docId> mappings and write them in a memory mapped file. We pay the cost of retrieving the entire document just once when the server loads the text index. The mapping file is later used during query execution by the collector callback to short-circuit the search path and directly construct a bitmap of Pinot docIds without retrieving the entire document.
These optimizations and pruning the stop words gave us a 40–50x improvement in query performance by allowing the latency to scale better with an increase in QPS, as shown in the graph below.
After performance optimizations, latency scales better with increase in QPS
As part of concluding the testing and optimization phase, we also decided to add additional capacity to two production Pinot clusters of LTI tables to further help achieve the latency SLA and a reasonable level of data volume (which impacts paging) and CPU utilization per Pinot server. The graph below shows the final P95 latency numbers (with increasing QPS for all the three categories of workloads described earlier).
Keyword search query P95 latency with increasing QPS for different workloads
Readers are highly encouraged to go through our previous blog post that discussed the feature design, implementation, and optimizations in great depth for offline and real-time text search.
Production rollout and wins
After keyword search was implemented and integrated with Pinot on the LinkedIn Talent Insights application, we followed a ramp strategy that allowed for A/B testing and a safe rollout of the feature while carefully monitoring latency, CPU utilization, error rate, and other metrics. The feature was successfully rolled out to all users of the LTI platform in February 2021. It is currently enabled on four Pinot production clusters of LTI tables, allowing users to do a keyword search on the entire member profile data.
Since the launch of the feature in LTI, over 20% of searches utilize the keyword search facet. The flexibility unlocked by keyword search has significantly expanded the usefulness of LTI for hiring use cases. Users are leveraging the free text capabilities to create broad term searches, analyze talent pools for emerging tools and technologies, and build complex Boolean search queries to isolate highly specific talent pools. Users described the ability to couple standardized search with free text as a “game-changer” for talent analytics.
We are planning to extend the keyword search functionality to new datasets including job post data, which will unlock a whole new set of competitive intelligence use cases for our customers.
This feature would not have been possible without the great collaboration between the LinkedIn Talent Insights and Pinot teams. We would like to thank Shraddha Sahay, Jeremy Lwanga, Kristina Ryan, Manzar Kazi, Nikko Bautista, Juliette Xiong, Gaurav Sisodiya, Huan Chen, Joe Zhou, Farhan Toddywala, Kamel Dabwan, Gulsah Kandemir, Iuliia Kotlenko, Peter Ho, Albert Liu, Shikhar Mathur, Rui Zhang, Tian Lan, Rachael Holmes, Lu Zhang, Andy Li, Kaleb Crawford, Elise Nguyen, Ben Leiner, Alex Pavlakis, Seunghyun Lee, Jack Li, Mayank Shrivastava, Subbu Subramaniam, Dino Occhialini, and Prasanna Ravi, under the leadership of Eric Baldeschwieler, Kapil Surlaker, and Aarathi Vidyasagar. Special thanks to Kishore Gopalakrishna from the Pinot OSS community for collaborating on requirements, design, and code reviews.