Dali Views: Functions as a Service for Big Data
November 9, 2017
Big challenges in the big data ecosystem
At LinkedIn, we have a number of challenges managing data in our complex data ecosystem. Changes to our infrastructure are often necessary to make progress, but they are difficult to accomplish without an expensive, large-scale, coordinated effort. Analytics processing systems are constantly improved to work more efficiently with different storage formats. Presto, for example, is typically a lot more efficient when running on columnar storage formats, as opposed to row-based formats. When a new system like Presto shows up, it is incredibly difficult to introduce a new storage format without disrupting existing applications, as they are often accessing raw data directly. Similarly, when other physical characteristics of data change, such as location or partitioning strategy, applications break and the war rooms begin. Finally, at LinkedIn’s scale, we are constantly hitting the limits of Hadoop scalability. We are thus forced to federate data across multiple clusters for scaling purposes. Assumptions about data availability in a particular cluster can often break applications in unexpected ways.
There is general awareness about the infrastructure-induced challenges in the big data community. What is often overlooked is that there are similar (if not more challenging) problems associated with dataset changes. Both schema changes and semantic changes can have widespread downstream impact. For example, when a field representing granularity of time is modified from seconds to milliseconds, a number of downstream processes are impacted because of assumptions in the code that are no longer true. These changes place a steep tax on the productivity of the team owning the dataset, as well as on teams responding to these changes. This effort involved in making such changes invariably overshadows the benefits, and as a result, necessary changes tend to get put off, slowing innovation. Refactoring code is an essential part of software craftsmanship. The art of refactoring data is just as vital as managing firmware or hardware upgrades, but is often overlooked.
What is Dali?
From these experiences, we have realized that challenges within the data ecosystem broadly fall into two categories: insulation from infrastructure changes, and data agility. There have been many attempts to tame the chaos associated with infrastructure changes. The most popular theme in this vein is the creation of a layer of abstraction separating physical details from logical concerns. In the big data space, a few successful incarnations of this idea can be seen in Apache Hive, Hive’s HCatalog component, and Twitter’s Data Abstraction Layer (DAL). Dali (Data Access at LinkedIn) draws inspiration from these efforts, yet it provides a unique point in the design space by addressing both types of challenges defined above.
At its core, Dali provides the ability to define and evolve a dataset. This is not unlike efforts like HCatalog, where a logical abstraction (e.g., a table with a well-defined schema) is used to define and access physical data in a principled manner through a catalog. Dali’s key distinction is that a dataset need not be restricted to just physical datasets. One can extend the concept to include virtual datasets, and treat physical and virtual datasets interchangeably. A virtual dataset is nothing but a view that allows transformations on physical datasets to be expressed and executed. In other words, a virtual dataset is the ability to express datasets as code. Depending on the dataset’s usage, it is very cost-effective for the infrastructure to move between physical and virtual datasets improving infrastructure utilization. Heavily-used datasets can be materialized, and less heavily-used datasets can stay virtual until they cross a usage threshold. As we will see in the following sections, there are several benefits to creating virtual datasets from a data agility perspective as well.
Dali at its core is the following components: a catalog to define and evolve physical and virtual datasets, a record-oriented dataset layer that enables applications to read datasets in any application environment, and a collection of development tools allowing data to be effectively treated like code by integrating with LinkedIn’s existing software management stack.
What are Dali Views (AKA virtual datasets)?
Why Dali Views?
Let us home in further on the need for virtual datasets with two simple examples. At LinkedIn, Profile is a popular dataset that is deeply nested, making it inconvenient to use with data processing layers like Pig and Hive (that have historically not worked well with nested data). Not solving this problem in a systematic manner results in a proliferation of unmanaged copies of vastly similar code floating around wherever the Profile dataset is accessed. A change to the schematic or semantic nature of the Profile dataset will trigger a whole range of unpredictable behavior across the entire data ecosystem. To solve this problem, we needed to provide both flattened and nested views of the data. We could go about this in two ways: (a) materialize two copies of the data, or (b) provide a virtual dataset that performs the necessary flattening function that is accessible to all data processing frameworks at run time. The latter has the benefit of reducing the storage costs associated with generating a materialized copy from the nested one.
Here is another motivating example that we have run into repeatedly at LinkedIn. Every new upgrade of our mobile or desktop application invariably introduces new tracking events. We used to track all page views (irrespective of source) in the same Kafka topic/HDFS dataset. As a result, PageViewEvents had a complicated schema, owing to the fact that it had to account for all kinds of page views. We decided to change this to having one topic per kind of page view. For example, when we launched a new version of our mobile app, we decided to add a new dataset version called VoyagerPageView. Consumers who need to read all PageViews could now use a DaliView that unions both PageView and VoyagerPageView. In future, as new PageView event types are added, the view is updated to include the new event, with no changes required from consumers. When new event types are added, we just push new versions of the view. Consumers can continue to consume from the view while changes occur behind the scenes. Examples like these illustrate that dataset management requires principled solutions that go beyond data abstraction layers that have been built in the past.
Dali Views: Under the hood
So, how does it all work? Dali Views work on the philosophy of treating datasets just like code. At a high level, engineers write code that defines a view, which is deployed programmatically. An engineer defines a Dali View in a Git repository in a file containing a SQL statement. Dali Views are expressed as SQL statements using SELECT-PROJECT queries. User Defined Functions (UDFs) and other dependencies essential to the view are explicitly noted in a dependency file. From this point on, the view is treated just like any other software artifact. An engineer submits a review request, gets an approval by the owners of the view repository, and publishes the view to Artifactory. Deployed Dali Views are thus immutable and versioned, as every change results in a new version being published to Artifactory, and multiple versions of the same View can happily coexist. Publication of the View to Artifactory further triggers an automated process that deploys the View definition to Dali Catalog instances running in each cluster. Processing frameworks like Hive, Spark SQL, Pig, and Samza are now able to use the Dali Record Reader APIs to consume data from Views using the definition of the View stored in the metastore. Today, this is accomplished by embedding a SQL transformation engine in the record reader APIs of all these processing frameworks. Currently that engine is Hive, which we chose a couple of years ago based on a variety of practical concerns. However, we view this as an implementation detail and fully expect to replace it with a different engine at some point in the future.
Benefits and limitations of Dali Views
Code reuse: Logic which would otherwise appear as boilerplate code at the beginning of thousands of scripts is now consolidated, centralized, and manageable.
Decoupling the producer-consumer dependency chain: Dali Views are versioned. This is critical because it decouples the pace of evolution between producers and consumers, replacing what would be an all-or-nothing and expensive atomic push upgrade with an incremental pull upgrade initiated by the consumer. Consumers are free to inspect what changes have occured to the data, and choose to upgrade at a pace they are comfortable with.
Explicit ownership and contracts: Since Views are treated like any other software artifact, ownership can be clearly expressed through the ACLs on our source code repositories. Further, we can associate the dataset with semantic constraints (e.g., this field should always be positive) and these constraints help form the contract that the producer of the dataset is willing to support. Consumers can request changes and extensions to this contract through standard software engineering processes, like pull requests. Managing Views as software artifacts also allows us to leverage existing systems at LinkedIn for discovering artifacts and tracking and traversing the dependencies that exist between artifacts.
Infrastructure optimizations: In addition to the above listed benefits to data agility, there is a significant benefit from an infrastructure usage perspective as well. Because Dali Views are expressed in SQL, all standard optimizations available to SQL engines are freely available to Dali Views. For example, we can leverage standard SQL engine optimizations like query rewrite mechanisms, predicate pushdown, partition pruning, and column pruning. In addition, these optimizations can be global optimizations across several Dali Views, enhancing the overall utilization of the compute environment.
Dali Views today can describe filtering, column projections, and map-side joins. Currently, aggregations and big table joins are not supported, though we envision eventually extending support to include these operations once we add support for materialized views. We are also exploring leveraging Apache Calcite to provide richer operators in Dali Views (some details on this are in the next section).
What’s next for Dali Views?
At LinkedIn, Dali Views played a pivotal role in recent application upgrades of both mobile and desktop versions of the LinkedIn application. Usually, a product launch of this complexity will result in countless hours of engineering time associated with changes required to process new metrics. Dali Views made the process significantly simpler this time because only a View definition was changed, and none of the downstream scripts needed to be touched even once. We now have over 300 Views deployed in production.
While the proof of concept for Dali Views started at one end of the data pipeline, we believe that these patterns can be generalized and applied more broadly to our data ecosystem. For example, we extended Dali Views to work on top of streaming systems (Samza at LinkedIn). This gives consumers the ability to use the same datasets in both batch and stream systems. We are also looking to support this ability to consume the same virtual dataset in even more contexts (e.g., for online lookups), and from a variety of underlying serving systems. Furthermore, this could provide a unified namespace across all execution engines and environments.
Finally, we are making significant architectural changes to the mechanisms for reading records from a virtual dataset in different data processing frameworks. As noted earlier, today user logic in Dali datasets is executed by embedding a SQL transformation engine in the record reader APIs of all these processing frameworks. However, we have been very impressed with the progress of Apache Calcite and have begun using Calcite’s relational algebra as a platform-independent intermediate representation in a number of places within LinkedIn. We will talk more about these efforts in later blog post, but with regard to Dali, we are looking at representing the user logic in Dali in this same IR and translating into native representations in each platform (e.g., Presto SQL, Spark Dataframe, etc.). This has multiple benefits: it removes performance bottlenecks and query optimization barriers associated with using two different execution engines and allows us to support more complex views (with joins, aggregations, etc.). We will continue using Dali readers to hide changes in the physical representation of data.
We hope to have convinced you that infrastructure and data agility problems require a principled abstraction. In Dali, we provide one promising point in the design space by integrating standard practises from software engineering with data management concepts from the database and distributed systems communities. A single blog post can hardly capture years of work. However, we are truly excited to finally unveil the motivation and vision of the Dali project. In the months ahead, we will continue be dive deeper into other components like the Dali Readers, Dali tooling, and Dali for stream processing. Stay tuned!
Dali is a labor of love of its development team. A big shoutout to Anthony Hsu, Ratandeep Ratti, Fangshi Li, Rohan Agarwal, Sunitha Beeram, Walaa Eldin Moustafa, and Chris Chen. Complicated infrastructure projects like Dali require significant and sustained commitment from management. Suja Viswesan, Shrikanth Shankar, Kapil Surlakar, and Igor Perisic: thank you for your unyielding support and guidance. Dali has benefitted from fruitful (and often passionate) discussions with several key LinkedIn engineers. Maneesh Varshney, William Vaughan, and Shirshanka Das: thank you for the thoughtful and productive feedback throughout this journey.