Super Tables: The road to building reliable and discoverable data products
September 28, 2022
Many companies, including LinkedIn, have experienced exponential data growth ever since the Apache Hadoop adoption a decade ago. With a proliferation of self-service data authoring tools and publishing platforms, different teams have created and shared datasets to address business needs quickly. While the use of self-service tools and platforms was a scalable and agile way to unlock data value by various teams, it introduced multiple issues: 1) multiple similar datasets often led to inconsistent results and wasted resources, 2) a lack of standard in data quality and reliability made it hard to find a trustworthy dataset among the long list of potential matches, and 3) complex and unnecessary dependencies among datasets led to poor and difficult maintainability.
In this post, we present a unique approach to solving these problems via Super Tables. Super Tables is a company initiative for defining, building, and sharing high quality data products with formal ownership and commitments. In the following sections, we present what a Super Table is, why we need a Super Table, and its design principles. We also provide a brief introduction to two Super Tables in production, and the lessons learned during this journey.
What’s a Super Table?
Super Tables (ST) are pre-computed, denormalized, and consistently consolidated attributes and insights of entities or events that are optimized for common and efficient analytic use cases. STs have well-defined service level agreements (SLAs) and simplify data discovery and downstream data processing. Furthermore, STs come with enterprise-grade commitments needed for critical use cases. These commitments include high data quality and availability (i.e., users can depend on the STs), disaster recovery, proper documentation, maintenance, and governance. Since the introduction of the Super Table concept, two Super Tables have been built and are in production: JOBS and Ad Event. We will provide more details in a later section.
Our work in Super Tables fits right into Data Mesh initiative here at LinkedIn:
- Data-as-a-product is one of the major principles, and the Super Tables aim at the development of high quality data products with formal ownership and commitments.
- The entity/event encapsulated by each Super Table is owned/governed by a domain (possibly a virtual team) – a key concept in Data Mesh.
- Standardizing entities/events and the establishment of SLAs and contract commitment improve the interoperability of data products across different domains.
Why do we need Super Tables? What are the benefits?
At LinkedIn, we face challenges in multiple dimensions in the current data ecosystem, many stemming from the explosive growth in data acquisition and dataset authoring. Here are some of the most critical and common ones:
- Discoverability: Data needed for analytics is discovered and consumed from a number of Source-of-Truth (SOT) datasets as well as directly from their upstream sources (Apache Kafka topics, online databases, external and derived datasets). With the proliferation of many similar datasets (often owned by different teams), finding the right dataset becomes much harder.
- Reliability: With democratization of data authoring, data is duplicated in many similar datasets that have loosely defined and varying SLAs, data quality and freshness commitments. Computationally expensive joins are performed across these datasets in multiple downstream use cases. Some of these datasets have shared ownership across teams where each team may own a different part of the computation logic. This results in inefficient understanding and troubleshooting of any data issues.
- Change Management: Changes to the datasets or their sources are usually made without consulting with or even notifying all downstream consumers, occasionally resulting in breaking changes being deployed. This impacts business continuity and results in wasted effort to troubleshoot and fix issues.
Figure 1: Evolving the current data ecosystem to leveraging Super Tables
The Super Tables initiative aims to mitigate these challenges as illustrated in Figure 1. Our goal is to build only a small number of enterprise-grade Super Tables that will be leveraged by many downstream users. Each domain may have only very few Super Tables that are highly leveraged by users. By consolidating hundreds of SOTs into these few Super Tables, a large number of SOTs will eventually be deprecated and retired.
- Having one or two Super Tables for a business entity or event (i.e., in a domain) makes it easier to explore and consume data for data analytics. It minimizes data duplication and reduces time to find the right data.
Strengthens Reliability & Usability
- Super Tables are materialized with precomputed joins of related business entities or events to consolidate data that is commonly needed for downstream analytics. This obviates the need to perform such joins in many downstream use cases leading to simpler and more performant downstream flows and better resource utilization.
- Super Tables provide and publish SLA commitments on availability and supportability with proactive data quality checks (structural, semantics and variance) and a programmatic way of querying the current and historical data quality check results.
Improves Change Management
- Super Tables have well-defined governance policies for change management and communication and committed change velocity (e.g., monthly deployment of changes). Upstream data producers and downstream consumers (cross teams/domains) are both involved in the governance of changes to the STs.
Super Tables design principles
In this section, we highlight several important design principles for Super Tables. We want to emphasize that most of these principles are applicable to any dataset, not just Super Tables.
Building a data product requires a good understanding of the domain and the available data sources. It is important to establish the source of truth and document such information. Choosing the right sources will promote data consistency. For example, if a source is continuously refreshed via streaming, the metric calculation may lead to inconsistent results at a different time. Consumers may not realize the situation and can be surprised by this outcome. Once sources are identified, we need to look at the upstream availability and business requirements so that the ST’s SLA can be established. One may argue that we should consolidate as many datasets into a single ST as possible. However, adding a data source to the ST will increase the resource needed to materialize the ST, and potentially jeopardize its SLA commitment. A good understanding of how the extra data source will be leveraged downstream (e.g., the extra data source is needed to compute a critical metric) is warranted.
Field naming conventions and field groupings are established so that users can easily understand the meaning of the fields and frozen fields (immutable values) are identified. For example, a job poster may switch companies but the hiring company for the job posting shouldn’t change – it is important to include the hiring company information (immutable) instead of the job poster’s company. Future flow executions should not change the field values. By default, any schema changes (both compatible and incompatible) in data sources would not affect the ST. For example, if a new column is added to a source, the column will not appear in the ST by default. Similarly if an existing source column is deleted, its value is nullified and the ST team will be notified. Schema changes in a ST are decoupled from its sources so that any change will not accidently break the ST flow. Planned changes are documented and communicated to consumers through a distribution list.
The retention policy of the ST is well established and published so that downstream consumers are fully aware of the policy and any future changes. Likewise, the retention policies of its data sources are tracked and monitored.
Establishing Upstream SLA Commitment
To meet its own SLA commitment, the SLA commitment of all data sources must be established and agreed on. Any changes will be notified for proper actions.
Data quality checks on all data sources as well as the ST itself must be put in place. For example, the ST’s primary key column should not be NULL or have duplicates. Data profiles on fields are performed to identify any potential outliers. If a data source has seriously bad data quality, the ST flow may be terminated until the quality issue is resolved.
It is very important to have both dataset and column-level documentation available to users to determine if the ST can be leveraged. Very often users want to understand the sources and how the ST and its fields are derived.
High Availability / Disaster Recovery
ST aims to reach 99+% availability. For a daily ST flow, it translates to approximately one SLA miss per quarter. To improve availability, STs can be materialized in multiple clusters. With an active-active configuration, the ST flows will be executed independently and consistency is guaranteed on two clusters (JOBS flows run on 2 production clusters). In case of SLA miss and disaster recovery, data can be copied from the mirror dataset across clusters.
ST flows must be monitored closely for any deviation from the norm.
- Flow performance: the flow’s runtime trending may lead the prevention of SLA misses
- Data quality: the sources must adhere to established quality standards and data quality metrics are shown on dashboards
- Cross cluster: the datasets in multiple clusters must be compared for deviation detection
A governance body (comprising teams from upstream and downstream) is established to ensure that the ST design principles are followed, the dataflow operation meets SLA and data quality is preserved, and changes are communicated to consumers. For example, if a downstream user wants to include a new field from a source, the user will have to submit the request to the governance body for evaluation and recommendation. In contrast to the past, a field very large in size would have been automatically included and blown up the final dataset size as well as jeopardized the delivery SLA. The governance body will have to consider all the tradeoffs in the final recommendation. A minimum of monthly release cadence is established to accept change requests so that the ST has the agility to serve the business needs.
A brief introduction to two Super Tables
The first Super Table is JOBS. Before the JOBS ST was built, there were a dozen tables and views that were built over the years; each joined with the job posting dataset (more than one) with different dimensional tables for various analytics, such as the job poster information, and standardized/normalized job attributes (location, titles, industries, and skills etc.). These tables/views may be created using different formats (such as Apache Avro or Apache ORC) or on different clusters. They may have different arrival frequencies and times (SLAs). Adding more complexity to the situation is that there are many different kinds of jobs, some of which may even be deleted or archived, and various job listing and posting types. Choosing the right dataset for a particular use case requires a complicated task of understanding and analyzing various data sources, joining the right dimensional datasets, and in some cases repeating the join redundantly. As such, the learning curve is steep.
The JOBS ST combines data elements from 57+ critical data sources from the multiple LinkedIn teams across different organizations. Totaling 158 columns, this JOBS ST precomputes and combines the most needed information for job-related analyses and insights. It was our intent that JOBS ST can be leveraged extensively for the most critical use cases. The JOBS ST has a daily SLA in the morning and the daily flow is run on two different production clusters to provide high availability. Data quality of all data sources and JOBS itself are enforced and monitored continuously.
Leveraging the JOBS ST is easier and more efficient than its original source datasets. In fact, in many downstream flows, the flow logic is simplified by just scanning the JOBS ST or joining it with another dimensional table, making the logic more efficient. With the availability of JOBS, the existing dozen job-related tables/views will be deprecated and migrated to JOBS.
The second Super Table is Ad Events. Before the Ad Events ST was built, there were seven different advertisement (ads) related tables including ad impressions, clicks, and video views. They have many fields in common such as campaign and advertiser information. Downstream frequently needs to join multiple ads tables, the campaign dimension, and the advertiser dimension tables to get insights on the ads revenue, performance, etc. The duplicate fields and frequent downstream joins added some unnecessary storage and computation.
Upon analyzing the commonly joined tables and downstream metrics, the Ad Events ST is created with 150+ columns to provide the precomputed information ready for Ads insights analysis and reporting.
From designing and implementing the Super Tables, we have learned many valuable lessons. First, we were able to learn that understanding the use cases and usage patterns is crucial in determining the main benefits of using a Super Table. Before building a new Super Table, it's important to look around and see what similar tables are already in place. Sometimes it's better to strengthen an existing table than to build one from scratch. When weighing these two options, factors to consider include quality, coverage, support, and usage of the existing tables.
Next, we learned that while building Super Tables, it's important to identify those semantic logic and their owners to ensure that it is correct and understand how it will evolve over time. Data transformation includes structural logic (e.g., table joins, deduplication, and data type conversion) and semantic logic (e.g., a method of predicting the likelihood of a future purchase based on browsing history). Semantic logic is usually owned by a specific team with deep domain knowledge. Without proper communication and collaboration, semantic logic would be poorly maintained and outdated. A better solution is to separate semantic logics into a different layer (such as Dali Views built on top of ST) and be managed by the team with the domain knowledge.
With Super Table SLAs, the stricter the SLA, the less tolerant you can be of issues. This means implementing mechanisms like defensive data loading, safe data transformation, ironclad alerting and escalation, and runtime data quality checks. For softer SLAs, you can tolerate a failure, triage and resolve the issue. With strict SLAs, sometimes you cannot tolerate a single failure.
Another learning was that, in an age with an emphasis on privacy, where data inaccuracy can have disastrous cascading effects, and lowering costs is at the forefront, reducing redundant datasets is the closest thing to a panacea there is. Often, it is better to align a Super Table to fulfill a distinct use case and ensure all requirements are met, instead of tolerating duplication for the sake of speed. Likewise, consolidating multiple similar datasets (which target different use cases) into a single Super Table is extremely critical.
Lastly, after the release of ST, one of the critical tasks is to migrate users of the legacy SOTs to the new ST. The sooner the migration is complete, the more resources can be freed up for other tasks. We have learned that it is imperative to provide awareness and support to the downstream users who need to perform the migrations. To that end, we have created a migration guide that outlines all the impacted SOTs. It includes detailed column-to-column mappings and suggested validations to perform to ensure data correctness and quality. The release of JOBS ST has significantly improved the life of both the owners of jobs data sources and the downstream consumers. Before ST, the knowledge of jobs business logic was scattered across several data sources owned by different teams and none of them had the full picture, making it extremely difficult for downstream consumers to figure out what is the right data source to use per their use cases. Usually time-consuming communications were required among different teams and it could easily lead to a misuse of data but since the release of ST, we formed a governance body involving various stakeholders. The governance body manages the evolution and maintenance of the ST. For instance, the ST consumers requested a new field. The governance body discussed the best design and implementation approach, which involved gathering raw data from the sources and implementing the transformation logic in the ST. The monthly release cadence allows development agility and JOBS data are integrated with easier access and much less confusion.
Conclusions and future work
Democratization of data authoring brings agility to analytics throughout the LinkedIn community. However, it introduces lots of challenges such as discoverability, redundancy, and inconsistency. We have launched two Super Tables at LinkedIn that address these challenges, and are in the process of identifying and building more Super Tables. Both STs have simplified data discovery by providing the “go-to” tables. They have also simplified downstream logic and hence saved computation resources. The created value is also amplified due to the high-leverage nature of these tables.
The following table summarizes the benefits of building and leveraging Super Tables.
To scale the Super Table initiative across the entire LinkedIn community, we have developed a franchising model with a cookbook - any domain team can easily follow the cookbook to build a Super Table, which is essentially a high quality data product.
We would like to thank our early partners (especially Tejwinder Singh, Abe Cabangbang, and Steve Na) who provided us with valuable and timely feedback. Jimmy Hong introduced the concept of Super Tables to the team. Jimmy Hong, Sofus Macskassy, Zheng Shao, and Ketan Duvedi were instrumental and supportive throughout the entire project.