Star-tree index: Powering fast aggregations on Pinot
June 14, 2019
Pinot is an open source, scalable distributed OLAP data store that entered the Apache Incubation recently. Developed at LinkedIn, it works across a wide variety of production use cases to deliver real-time, low latency analytics.
One of the biggest challenges in Pinot is achieving and maintaining tight SLA on latency and throughput on large data sets. Existing indexing techniques, such as sorted index and inverted index, help accelerate the document search to improve query latencies. However, their performance scales linearly with the number of documents to be processed in computing the results. On the other hand, pre-aggregating the results can reduce the number of documents to be processed, thus enforcing a constant upper bound on query latencies. This can lead to limited query support (by limiting the dimension combinations) or storage space explosion (when multiple dimension combinations are considered).
For aggregation and group-by queries, we introduced Star-Tree index to utilize the pre-aggregated documents in a smart way that achieves low query latencies, while using the storage space efficiently.
Consider the following data set as an example to discuss the existing approaches (the data set has been de-duped and pre-sorted by country):
In this approach, data is sorted on a primary key (Country in the example), which is likely to appear as a filter in most queries in the set.
Below is the sorted index for column Country:
Consider the query: SELECT SUM(Impressions) FROM Table WHERE Country = ‘USA’
With sorted index on column, Country, the time complexity of searching the documents for the given primary key value ‘USA’ is reduced from linear scan Θ(n) to constant time value lookup Θ(1). Additionally, sorting provides spatial locality for documents, which reduces disk fetches significantly when fetching the documents, and thus improves latency.
While this is a good improvement over linear scanning, there are still a few issues with this approach:
Sorting can only be applied to the primary key, which means only queries filtering on that one column can benefit from the sorted index.
While search time is reduced from Θ(n) to Θ(1), aggregation cost is still a function of the total number of documents (in this example, 3) to be processed to answer the query.
In this approach, for each value of a given column, we maintain a list of document IDs where this value appears.
Below are the inverted indexes for columns Browser and Locale for our example data set:
Consider the query: SELECT SUM(Impressions) FROM Table WHERE Browser = ‘Firefox’
With inverted index on column Browser, similar to the sorted index, the document search becomes a simple value lookup for the key ‘Firefox’ to directly get the matching documents of [1, 5, 6].
Using inverted index can bring the documents search time down to constant time Θ(1) for arbitrary columns. However, it cannot leverage spatial locality, and, similar to sorted index, the aggregation cost is still a function of the query’s selectivity (in the example, this would be 3).
In this technique, we pre-compute the answer for a given query set up front.
In the example below, we have pre-aggregated the total Impressions for each Country:
Consider the query: SELECT SUM(Impressions) FROM Table WHERE Country = ‘USA’
With pre-aggregation, the query can be solved by just a value lookup, and we can directly get the final result of 1200 without extra aggregation cost.
However, with the pre-aggregation in the example, we are able to solve only queries with predicate on Country. To be able to answer queries with multiple predicates implies pre-aggregation for various combinations of different dimensions. This leads to exponential explosion to the number of dimensions in storage space (considering the query: SELECT SUM(Impressions) FROM Table WHERE Country = ‘USA’ AND Browser = ‘Firefox’ AND Locale = ‘en’).
The graph below shows the performance gain and space cost for the different techniques. On the left side, we have indexing techniques that improve search time with limited increase in space, but do not guarantee a hard upper bound on query latencies because of the aggregation cost. On the right side, we have pre-aggregation techniques that offer hard upper bound on query latencies, but suffer from exponential explosion of storage space.
We propose the Star-Tree data structure inspired by the star-cubing paper (Xin, Han, Li, & Wah, 2003) that offers a configurable trade-off between space and latency and allows us to achieve a hard upper bound for query latencies for a given use case.
In the following sections, we will first define the Star-Tree data structure, followed by a Star-Tree example on the sample data set. Then, we will discuss how Star-Tree is utilized within Pinot to achieve low latencies with high throughput.
Star-Tree is a tree data structure that has the following properties:
Root Node (Orange): Single root node from which the rest of the tree can be traversed.
Leaf Node (Blue): A leaf node can contain, at most, T documents, where T is configurable.
Non-leaf Node (Green): Nodes with more than T documents are further split into children nodes.
Star-Node (Yellow): Non-leaf nodes can also have a special child node called the Star-Node. This node contains the aggregated documents after removing the dimension on which the data was split for this level. Star-Node can be either a leaf or non-leaf node.
Dimensions Split Order ([D1, D2]): Nodes at a given level in the tree are split into children nodes on all values of a particular dimension. The dimensions split order is an ordered list of dimensions that is used to determine the dimension to split for a given level in the tree.
Function Column Pairs: The pre-aggregations to perform when generating the tree.
Max Leaf Records: The threshold T to determine whether to further split each node. This threshold is used to tune the level of pre-aggregations performed. With a larger threshold, the index size will be smaller, while more documents will need to be processed to answer the query.
The properties stored in each node are as follows:
Dimension: The dimension by which the node is split on.
Value: The value of the dimension that the node represents.
Start/End Document ID: The range of documents this node points to.
Aggregated Document ID: One single document which is the aggregated result of all documents pointed to by this node.
Star-Tree index is generated in the following steps:
The data is first projected per the dimensionsSplitOrder. Only the dimensions from the split order are reserved, while others are dropped. For each unique combination of reserved dimensions, metrics are aggregated per configuration. The aggregated documents are written to a file and served as the initial Star-Tree documents (separate from the original documents).
Sort the Star-Tree documents based on the dimensionsSplitOrder. It is primarily sorted the first dimension in this list, and then secondarily sorted on the rest of the dimensions based on their order in the list. Each node in the tree points to a range in the sorted documents.
The tree structure can be created recursively (starting at the root node) as follows:
If a node has more than T records, it is split into multiple children nodes, one for each value of the dimension in the split order corresponding to current level in the tree.
A Star-Node can be created (per configuration) for the current node, by dropping the dimension being split on and aggregating the metrics for rows containing dimensions with identical values. These aggregated documents are appended to the end of the Star-Tree documents.
If there is only one value for the current dimension, Star-Node won't be created because the documents under the Star-Node are identical to the single node.
The above step is repeated recursively until there are no more nodes left to split.
Multiple Star-Trees can be generated based on different configurations (dimensionsSplitOrder, aggregations, T). The query executor can pick the one with configurations capable of solving the query (discussed in Query Execution section)
Aggregation is configured as a pair of the aggregation function and the column to apply the aggregation.
All aggregation functions with bounded-sized intermediate results are supported:
COUNT: Intermediate result Long is bounded
MIN: Intermediate result Double is bounded
MAX: Intermediate result Double is bounded
SUM: Intermediate result Double is bounded
AVG: Intermediate result Pair<Double, Long> is bounded
MINMAXRANGE: Intermediate result Pair<Double, Double> is bounded
DISTINCTCOUNTHLL: Intermediate result HyperLogLog is bounded
PERCENTILEEST: Intermediate result QDigest is bounded
PERCENTILETDIGEST: Intermediate result TDigest is bounded
Some aggregation functions have unbounded-sized intermediate results, which are not supported to prevent storage space explosion. However, the approximation of the result can be achieved with the supported functions above.
DISTINCTCOUNT: Intermediate result Set is unbounded
PERCENTILE: Intermediate result List is unbounded
For query execution, the idea is to first check metadata to determine whether the query can be solved with the Star-Tree documents. If so, then traverse the Star-Tree to identify the documents that satisfy all the predicates. After applying any remaining predicates that were missed while traversing the Star-Tree to the identified documents, apply aggregation/group-by on the qualified documents.
In order to solve an aggregation/group-by query with Star-Tree, all the columns in filters and group-by clauses must be materialized (configured in dimensionsSplitOrder), and all the aggregations must be pre-aggregated (configured in functionColumnPairs). The query executor will pick the first Star-Tree that meets the requirements or fall back to normal aggregation if no one is qualified.
Traverse the tree
The algorithm to traverse the tree can be described with the following diagram:
Apply remaining predicates
Some of the dimensions might not be split because of the leaf records threshold. In such a case, the remaining predicates on the not-split dimensions will be applied after traversing the tree (same as normal query execution except for the use of the pre-aggregated records). The leaf records threshold is the upper limit of records to be processed for each branch in the tree.
Use this example data set and the following configurations as an example:
Dimensions Split Order: [Country, Browser, Locale]
Function Column Pairs: [SUM(Impressions)]
Max Leaf Records: 1 (We put 1 here so that all of the dimension combinations are pre-aggregated for clarity)
The values in the parentheses are the aggregated sum of Impressions for all the documents under the node.
SELECT SUM(Impressions) FROM Table
Because there is no predicate or group-by on any dimension, select the Star-Node for all dimensions (document 26). Instead of aggregating all seven documents without Star-Tree, we directly get the aggregation result 2200 by processing only one document.
SELECT SUM(Impressions) FROM Table WHERE Country = ‘USA’
Because there is only a predicate on Country, select the node with value ‘USA’ for Country, the Star-Node for Browser and Locale (document 14). Instead of filtering out and aggregating three documents without Star-Tree, by processing only one document, we get the aggregation result 1200.
SELECT SUM(Impressions) FROM Table WHERE Locale = ‘en’
Similar to the last query, select the Star-Node for Country and Browser, the node with value ‘en’ for Locale (document 23). Again, by processing only one document, we get the aggregation result 1500.
SELECT SUM(Impressions) FROM Table GROUP BY Browser
Because there is a group-by clause on Browser, select the Star-Node for Country and Locale, and all nodes except for the Star-Node for Browser (document 15, 19, 22). For Country ‘*’, Browser ‘Chrome’, since there is no Star-Node for Locale, select all child nodes instead. To get the group-by result, we need to process only one document for each group.
We have performed a benchmark comparing the performance gains from Star-Tree index and inverted index against 48 million records generated from the TPC-H tools. Results were as follows:
With such huge improvements for both latency and throughput, the Star-Tree index only costs about 12% extra storage space compared to data without indexing techniques and 6% extra compared to data with inverted index.
For more detailed benchmark result, please refer to the Performance section of this paper.
Most indexing techniques can help accelerate the document search, but fail to put a limit on the documents' need to be processed. The performance could be poor for queries with low selectivity (too many documents get selected). On the other hand, pre-aggregation can bound the query latency, but may lead to space explosion.
Star-Tree index is a technique to get the best of both worlds. By using the pre-aggregated documents and indexing data with the tree structure, Star-Tree is able to accelerate the document search and bound the documents selected, while also not exploding the space.
Star-Tree index vastly improves the latency and throughput for queries with low selectivity, allowing the same hardware to handle more queries, thus significantly reducing the cost-to-serve.
We would like to thank all members of the Pinot team for their relentless efforts to make Pinot better: Dino Occhialini, Jean-Francois Im, Jennifer Dai, Jialiang Li, John Gutmann, Kishore Gopalakrishna, Mayank Shrivastava, Neha Pawar, Seunghyun Lee, Subbu Subramaniam, Sunitha Beeram, Walter Huf, and our engineering manager Shraddha Sahay and SRE manager Prasanna Ravi. Also we would like to thank Ravi Aringunram, Eric Baldeschwieler, Kapil Surlaker, and Igor Perisic for their leadership and continued support.