Introducing Apache Pinot 0.3.0
April 27, 2020
Built at LinkedIn, Pinot is an open source, distributed, and scalable OLAP data store that we use as our de-facto near-real-time analytics service. We’ve previously discussed how and why we built Pinot to power a wide spectrum of use cases, including internal business intelligence dashboards to analyze highly-dimensional data and “Who Viewed My Profile” to deliver member-facing analytics at high throughput and low latency.
In this blog, we’re excited to share our 0.3.0 update of Pinot, bringing broader support and ease of use to the community. Pinot first entered Apache incubation in late 2018; while it had been open sourced for a few years prior to that, this is when we saw Pinot gain traction. Over time, we saw adoption of Pinot grow steadily as more and more companies found value in its capabilities to provide fast analytics on fresh data for a variety of use cases. Today, the Pinot GitHub repository has 2.5k stars and 100+ contributors, while our Slack community has close to 250 members. Uber was an early adopter, and continues to use Pinot to power analytics for a variety of use cases including UberEats Restaurant Manager. Pinot is also used by Microsoft, Weibo, Factual, and many other companies to power both internal and external analytics use cases.
Areas of improvement
Working closely with some of the power users of Pinot helped us realize what it takes to run Pinot—especially in cases where the ecosystem was different from ours. One of the most important goals of Pinot 0.3.0 release was to improve Pinot’s ease of use and extendability. To achieve this objective, we identified the following four areas of focus:
Our big data analytics ecosystem is built off of technologies that rely on Kafka, Hadoop, Avro, and Apache ORC. To be compatible with other stacks, version 0.1.x of Pinot included ways to plug in streams other than Kafka, but the implementation was still closely tied to Hadoop and Avro. While both are widely adopted, we recognized that there were certain use cases that would have equally good or even better alternatives to using Hadoop or Avro. Having compile-time dependency on these technologies made it hard to integrate with other systems (e.g., S3, GCS, ADLS), and ingest data in different formats (e.g., Parquet, ORC, Thrift).
Lack of cloud-native support
The foundational development work on Pinot predates the majority of LinkedIn’s public cloud integrations. Naturally, some of the toolings around Pinot weren’t built to embrace cloud-native technologies. These include Blob Stores, Containers, Docker, Kubernetes, etc. This made it harder for the community to deploy and operate Pinot on cloud.
Limited SQL support
One of the reasons users like to use Pinot is its query execution speed. At LinkedIn, Pinot handles over 120K queries per second while ensuring millisecond latency, pushing the limits on OLAP scale. In order to ensure the SLA, we limited support to a subset of SQL syntax and deviated from standard SQL semantics. For instance, we changed the GROUP BY behavior to order results on multiple metrics in a single query. Other features such as Joins and expressions in filters were also not supported. This was done to ensure that the latency SLA is always maintained. With SQL being the popular choice for analytics, these deviations of Pinot Query Language (PQL) from SQL syntax and semantics made it difficult for users to interact with Pinot.
Better documentation was one of the most common pieces of feedback that we received from users. While Pinot did have ample documentation, it was developer-centric, and not as friendly to users who wanted to try Pinot out. Pinot was built to power internal data analytics products—such as Who Viewed My Profile, Talent Analytics, Company Analytics, and many more—while being easy to operate. At LinkedIn, we continue to operate Pinot as a service for all verticals and have invested heavily in making Pinot highly available and operable. Complex operations such as adding nodes, provisioning new tables, making config and schema changes, or rebalancing a workload can be performed without any downtime. The issue was that users did not know right away that these features necessary to operate Pinot at scale were already built, and would often run into the lack of documentation for these operations as a common pain point.
New in Apache Pinot 0.3.0
Once we identified these priority areas of improvement, it was time to tackle them one by one to move towards a better Pinot for the community.
Introducing a plug-in architecture
Creating a plug-in architecture was not a quick fix—we had to completely overhaul Pinot’s code layout (modules and their dependencies). Over time, Pinot’s core module (pinot-core) had become a behemoth that had swallowed tons of dependencies, ranging from external systems such as S3, ADLS, Hadoop, and Spark, to data formats, such as Avro and Parquet. It was important that the layout was simplified to make it easy for future contributors to add support for further system integrations. The first order of business was to abstract these interfaces out from the core module, and provide them as plugable implementations. The graph below shows just how complex the inter-module dependencies we originally had were.
After: Pinot dependency graph
Introducing full SQL support on Pinot
In Pinot 0.3.0, we did two things to make Pinot’s query language richer and more accessible:
- PQL to SQL: We moved from custom PQL to Calcite SQL. Apache Calcite is a popular open source framework for building databases and data management systems. It includes a SQL parser, an API for building expressions in relational algebra, and a query-planning engine. We have leveraged the Calcite SQL parser to parse queries in SQL format. However, Pinot continues to support only a subset of SQL; for instance, joins and nested queries are not supported. This is a design choice in Pinot to focus on providing fast analytics on a single table.
- Presto-Pinot connector: Pinot natively focuses on providing fast analytics on a single table, but users can now get the Full SQL functionality with the Presto-Pinot connector. We owe this to our partners at Uber, who were early users of Pinot and became big contributors to the project, continuously helping push the boundaries of Pinot in innovative ways. Support for joins is one of Pinot’s frequently asked for features. In order to achieve this, while maintaining Pinot’s query execution speed for single table queries, the Analytics Infra team at Uber built the Presto-Pinot connector for users to perform joins on data in Pinot. Unlike other Presto connectors, the Pinot connector has many optimizations built in to get the best of Presto and Pinot, in addition to features such as predicate push-down and aggregation/group by push down. The Presto-Pinot connector helps achieve fast analytics on single tables, as well as deliver richer analytics with support for joins and nested queries.
Deep storage support
Prior versions of Pinot required shared storage—for example, NFS—to store a copy of data across controller nodes. We created the PinotFS abstraction to be able to plug in other deep storage systems (e.g. GCS, ADLS) into Pinot as well. We provided the implementations for Hadoop(pinot-hdfs) and Azure(pinot-adls) and NFS. And as a result of our efforts to make it easy for contributors to create new implementations, we merged in an open source contribution that adds support for Google Cloud Storage (pinot-gcs). S3 can be accessed via pinot-hdfs plugin, but we will also be adding a new plugin pinot-s3 based on native Amazon S3 APIs.
Enabling cloud-based deployment
One of the common pain points of the community was the absence of a standard cloud-based deployment model for Pinot. To address this, we collaborated with the open source community to add support for Kubernetes to help system administrators deploy Pinot across multiple cloud providers. We also added documentation with a step-by-step guide for getting the Pinot cluster up and running quickly on Kubernetes using Helm. Pinot clusters can now be spun up with just a few commands. Broader support for the cloud is also important for our internal use of Pinot as we look to build the next version of our infrastructure on Azure.
Another common pain point we wanted to tackle was the lack of concise documentation on getting started. We created a new set of documentation and videos, specifically targeting first-time users, for a smoother onboarding and operations experience.
With Pinot 0.3.0, we now have the ability to run Pinot on any cloud, build extensions with ease, and fully support SQL via the Presto-Pinot connector. If you are interested in learning more about Pinot, consider becoming a member of the open source community by joining our Slack channel or subscribing to our mailing list (to: firstname.lastname@example.org). You can also sign up for our upcoming virtual Pinot meetup on May 5.
We are excited to see how developers and companies are using Apache Pinot to build highly-scalable, low latency, real-time analytics applications. Feel free to ping us on Twitter (@ApachePinot) or Slack with your stories and feedback.
Finally, here is a list of resources that you might find useful if you’re interested in starting your journey with Apache Pinot.
- Docs: http://docs.pinot.apache.org
- Download: http://pinot.apache.org/download
- Getting Started: https://docs.pinot.apache.org/getting-started
We would like to thank all Pinot committers for their relentless efforts to make Pinot better, Jennifer Dai, Jackie Jiang, Jialiang Li, Kishore Gopalakrishna, Neha Pawar, Seunghyun Lee, Siddharth Teotia, Subbu Subramaniam, and Xiang Fu; and contributors from the open source community, Alex Filipchik, Chetan UK, Devesh Agrawal, Elon Azoulay, Haibo Wang, Ting Chen, Venki Korukanti, and Zhenxia Luo for their vital contributions to Pinot’s growth.
We would also like to thank the LinkedIn Pinot SRE team for operating Pinot at LinkedIn scale, and the LinkedIn leadership, Shraddha Sahay, Eric Baldeschwieler, Kapil Surlaker, and Igor Perisic, for their guidance and continued support.