Solving for the cardinality of set intersection at scale with Pinot and Theta Sketches

Vincent Wang

Staff Software Engineer at LinkedIn

April 16, 2021

Co-authors: Vincent Wang, Siddharth Teotia, Manoj Thakur, and Mayank Shrivastava

As our LinkedIn Marketing Solutions Blog recently noted, companies and marketers “are once again peering ahead, setting their plans for success in a reshaped business environment.” One of the items businesses rely on to do this are insights, including the estimated reach of an advertising campaign when you have multiple criteria for a specific targeted audience. It’s simple, right? A customer should be able to know how many members they will be able to reach.

However, it can be a complex problem to solve on the backend, especially at speed and scale. In this post, we’ll highlight our efforts to improve the speed and scalability of audience-reach estimation. Through this work, we reduced the data size by 80 percent which unlocked more frequent data pushes that ultimately led to more up-to-date data for our customers. We used large set intersection cardinality approximations with Apache Pinot and Theta Sketches, which allow us to efficiently figure out the unique size of a targeted audience when factoring in multiple criteria of an advertising campaign.

Set intersection cardinality

A targeting criterion is defined by selecting specific values of the various member attributes. For example, members who live in the U.S. or Canada, work for LinkedIn, and know Java/C++. This insight is surfaced via the LinkedIn Campaign Manager™product. The screenshot below shows a case where the customer intends to target such a group of members.

Campaign Manager screenshot: estimated reach of an ad campaign based on targeting criteria

The reach estimation use-case mentioned above can be translated to finding the unique count of members in the intersection of multiple sets. The targeting criterion here is a logical combination of various conditions, where each condition represents a set of members. For example, a set of members who work for LinkedIn, another set who live in the U.S./Canada, and yet another set of members who are proficient in Java/C++. When we create a targeting criterion by combining these conditions (or sets), we get a set intersection of members that satisfy these conditions. An engineer who works at LinkedIn, lives in Sunnyvale, California, and knows Java/C++ is part of all the three sets, but for the purpose of reach estimation, should be counted only once. At LinkedIn, we use Apache Pinot, a real time distributed OLAP datastore, for this use case.

venn-diagram-displaying-set-intersection-of-three-targeting-criteria

Venn diagram displaying set intersection of targeting criteria

The set intersection cardinality problem is common, and yet is challenging to solve for online applications at scale. The difficulty mainly comes from the fact that it requires processing large amounts of data within a limited time budget that online applications need to operate within.

In our initial approach to solve the set intersection cardinality problem, we used the count-distinct aggregation in Apache Pinot. In this approach, each row in the table represents a member (without personally identifiable information like member ID, for privacy reasons), and the columns are predefined attributes (or member dimensions) such as Country, Skills, Industry, etc. The table in Pinot is pre-aggregated to optimize for the number of rows.

For reach estimation, we perform an aggregation to compute the count of members that have matching values for all dimensions. This count is stored in the “Count” column, while the reach-estimation problem for a target criterion chosen by an advertiser essentially boils down to the following SQL query:

select sum(count) from table where country IN (‘US’, ‘Canada’) and skills in (‘Java’, ‘C++’) and company = ‘LinkedIn’

We’ll now dive into the challenges we encountered with this and additional approaches explored, and how we ultimately solved the problem by implementing Theta Sketch based count distinct algorithms within Pinot.

Challenges with existing approach
While we do have an existing solution to the set intersection cardinality problem that uses Pinot (as mentioned in the previous section), the scale at which we operate imposes certain limitations on its effectiveness:

Highly dimensional, multi-valued columns offer limited optimization opportunities for reducing record count through pre-aggregations.
New multi-value dimensions, combined with organic growth in membership, leads to superlinear growth in data size over time.
The entire dataset has to be regenerated from member data and refreshed (overwritten) daily in Pinot which results in a linear increase in latency to push data with the growing volume of data. The increasing push latency causes data freshness to deteriorate into an unacceptable range.

Alternative approach
Given the limited opportunity for drastic data size reduction while maintaining the current implementation, we explored multiple alternatives to address the challenges by using approximation strategies and rethinking the schema of the data itself.

HyperLogLog based Count-Distinct

We explored the HyperLogLog (HLL) based approximate distinct count function provided by Pinot as one potential solution. HyperLogLog is a probabilistic data structure primarily used for fast approximate cardinality estimation with high accuracy and minimal storage overhead.

In this approach, an HLL object can be used to represent all members that have a certain value for a certain dimension (e.g., country=USA). Instead of having a separate row for each unique dimension combination, we have rows of HLL objects to represent all possible values of all member dimensions. While HLL does support set union operations (e.g., “country=USA OR company=LinkedIn”), it does not natively provide support for set-intersection operations (e.g., “country=USA AND company=LinkedIn”), which is a product requirement. This problem could be solved by using the Inclusion-Exclusion Principle with HyperLogLog data structure. In general, the following formula holds true for all sets.

the-absolute-value-of-the-intersection-of-sets-a-and-b-equals-the-absolute-value-of-a-plus-the-absolute-value-of-b-minus-the-absolute-value-of-the-union-of-sets-a-and-b

However, breaking down an intersection operation into multiple terms also results in larger error margins since we end up doing multiple approximate operations on HyperLogLog. Upon further review, we were unable to achieve an acceptable margin of error, and thus did not move forward with this approach.

Theta Sketch based Count-Distinct

Theta Sketch is another probabilistic data structure similar to HLL that can be used to approximate the cardinality of set operations. The main difference between Theta Sketch and HLL is that Theta Sketch supports all common set operations, including intersections.

We were encouraged by the possibilities of Theta Sketches, but needed to further explore two major challenges before moving forward. First, Pinot did not have support for a Theta Sketch based distinctCount aggregation function; and second, we were unsure if we would be able to tune it to achieve the desired level of error margins.

In order to get realistic estimates for storage cost and accuracy, we quickly implemented a functional prototype as follows:

Using our production data, we created sketches for each value of each dimension and stored them in a key-value store.
We implemented a driver that would query these sketches, perform given set operations on them, and compute the cardinality of the resultant set.
We then ran our production queries using this driver, and compared the results to the accurate results returned by Pinot with the existing solution.

With this prototype, we were able to establish realistic estimates for both accuracy and storage cost and were ready to move forward with this approach.

Schema modification
The schema of the existing Pinot table had a column for each of the member dimensions, which, as we discussed, was limiting any pre-aggregations. Instead, we modified the schema to have each row represent the serialized sketch of a particular value of a particular dimension. Each row only contains 3 columns: Dimension Name, Dimension Value, and Serialized Sketch. The serialized sketches could be stored in Pinot as BYTES blobs. The modified schema now scales on the number of distinct dimension values, which is a significant reduction from the previous schema.

With this new schema, we were able to reduce our data size nearly 88%, from close to 1 TB down to 120 GB. This approach also had the added benefit of being able to perform more frequent data refreshes into Pinot, thereby drastically improving data freshness for our customers. In addition, the new schema would not need to be altered by the introduction of new member dimensions.

The new schema required us to change our queries as well. For example, the following original query:

SELECT sum(memberId) from Table where country = ‘USA’ and company = ‘LinkedIn’

had to be restructured by changing the where-clause in the query from a conjunction to a disjunction, to select all the necessary sketches for the operation:

WHERE
(Dimension Name = 'country' and Dimension Value = ‘USA’)
OR
(Dimension Name = 'company' and Dimension Value = 'LinkedIn')

While this solved the problem of selecting the right set of rows for the query, it introduced a new problem. Once we had all the sketches that matched the individual predicates, there was no syntax to specify the set operation to perform on these individual sketches. We addressed this by enhancing the query execution engine within Pinot.

Query execution enhancements
The new schema required us to modify the where-clause in the query to ensure all rows of interest are selected. This also meant that we needed a way to specify the final set-expression to be evaluated. Also, as can be expected, this final set-expression could only be evaluated at the Pinot-broker side. This is because performing set-intersection on the server side can lead to incorrect results, as individual servers do not have a global view of the data. Details on query execution in Pinot can be found here.

The query execution stack had inherent assumptions around aggregation functions taking single arguments. Our first order of business was to enhance the stack to eliminate this assumption. This would then allow us to specify the final set-expression to evaluate as an argument in the aggregation function.

SELECT
distinctCountThetaSketch(sketch, “(Dimension Name = 'country' and Dimension Value = ‘USA’) AND (Dimension Name = 'company' and Dimension Value = 'LinkedIn')”)
FROM Table
WHERE
(Dimension Name = 'country' and Dimension Value = ‘USA’)
OR (Dimension Name = 'company' and Dimension Value = 'LinkedIn')

We implemented this enhancement in a generic way that allows Pinot to evaluate any post-aggregation expression. This opened the door for allowing users to compute a post-aggregation expression such as:

select sum(metric1) / sum(metric2) from Table…

Using this syntax, the query execution would now look like:

Pinot-servers compute a union of sketches for individual predicates for the data they host. For our example, each server would compute two unionized sketches, one for country = ‘USA’ and another for company = ‘LinkedIn’, and return to the broker.
Pinot-broker would then evaluate the specified set-expression to evaluate the resultant sketch and its estimated cardinality.

If interested in exploring further, the specific pull-requests for implementing these enhancements are available on GitHub: 5316, 5339, 5259, and 5275.

Accuracy improvements

While our initial prototype showed encouraging accuracy results, we validated it again once we had an end-to-end implementation complete. As also seen during the prototype, we confirmed that when performing intersections of two sets where the intersection cardinality is extremely small relative to the cardinality of either set, the approximation tends to have a higher margin of error, leaning towards the lower end (zero). For example, consider the following example:

the-absolute-value-of-set-a-equals-one-billion-the-absolute-value-of-set-b-equals-three-hundred-the-absolute-value-of-the-intersection-of-sets-a-and-b-equals-ten

The above scenario demonstrates the cardinality of two sets and their intersection. Given the intersection cardinality is much smaller relative to the two sets, Theta Sketch will perform poorly in terms of accuracy. To improve our accuracy in such cases, we made the following changes.

We broke down the larger set into subsets with smaller cardinality and performed intersection with these subsets. The following demonstrates the “sharding” strategy of reducing error margins:

formula-example-illustrating-sharding-strategy

The error margin of each shard of set A intersecting with B tends to be much smaller due to the fact that their size difference is considerably less.

We also used the nominalEntries parameter provided by Theta Sketch library that provides a way to improve accuracy at the cost of storage. We tuned this parameter to suit our accuracy requirements.

We then performed accuracy analysis shadowing our production traffic. On top of the significant reduction in Pinot push time, our goal was also to keep the error rate below 20%, which is an acceptable rate given legal requirements to mask the exact count to advertisers. The following graph displays the 95th percentile error rate of our Theta Sketch approximations over two months of production traffic. During this period, we tuned the sketch shard size and the nominalEntries parameter to bring down the error rate from 32% to <20% without a significant increase in overall sketch data size.

95th Percentile error rate of Theta Sketch approximation over time
* Error rate = |(approximate - exact)/exact|

Read-time latency improvements

While the read-time latency in the first version of theta-sketch based implementation allowed us to run analysis using production data, we needed to improve it to meet our production SLAs. Once we had error rates under control, we switched gears to improve the latency of the Theta Sketch based aggregation function. The 95th percentile latency we observed from the theta-sketch based implementation when running production queries sequentially was around 4 seconds. With a series of optimizations and performance improvements within Pinot’s implementation, as well as optimizations through dynamic query rewrites at the client side, we were able to scale up to a 95th percentile read latency of 800ms at a read QPS of 20 using a single replica of 3 Pinot-servers and 1 Pinot-broker.

The graph below shows our journey of how we managed to improve performance with various techniques.

graph-showing-latency-decrease-over-time

P95 latency decrease over time due to Pinot optimizations

Conclusion

We were successfully able to use the Theta Sketches based set intersection cardinality estimation using Apache Pinot to solve the audience-reach estimation problem in production. This new solution alleviated the existing problem of data staleness by reducing data size (by approximately 80%) and capping the data size growth from super linear to sub-linear. This solution is now in production and has improved the data freshness SLA for matched audience by 40-50%.

Acknowledgements

Building this use case was a collaborative effort that required many cross-functional partners. We would like to extend our special thanks to our partners on the Ads Audience—Bill Kuang, Song Zhao, Xinruo Jing—and Pinot teams who helped with this feature.

We would also like to acknowledge the Pinot team members at LinkedIn and the broader open source community, including Jack Li, Sajjad Moradi, Seunghyun Lee, Subbu Subramaniam, Kishore Gopalakrishna, and Jackie Jiang, for their contributions and valuable inputs in the design and development of the Theta Sketches based distinctCount in Pinot.

And finally, we would also like to thank the LinkedIn Pinot SRE team for operating Pinot at LinkedIn scale, and the LinkedIn leadership, Jing Wang, Sanjay Dubey, Shraddha Sahay, Eric Baldeschwieler, Kapil Surlaker, and Igor Perisic, for their guidance and continued support.

Topics: Open Source Scalability Data