Taking Charge of Tables: Introducing OpenHouse for Big Data Management

Sumedh Sakdeo

Creator openhousedb, ex-Head of Data Lyft Self Driving Division

July 19, 2023

Co-Authors: Sumedh Sakdeo, Lei Sun, Sushant Raikar, Stanislav Pak, and Abhishek Nath

Introduction

At LinkedIn, we build and operate an open source data lakehouse deployment to power Analytics and Machine Learning workloads. Leveraging data to drive decisions allows us to serve our members with better job insights, and connect the world’s professionals with each other.

Open source data lakehouse deployments are built on the foundations of compute engines (like Apache Spark, Trino, Apache Flink), distributed storage (HDFS, cloud blob stores), and metadata catalogs / table formats (like Apache Iceberg, Delta, Hudi, Apache Hive Metastore). End-users create relational entities in the form of Tables over structured or semi-structured data using compute engines, with the metadata for a Table stored in a catalog, and data stored in distributed storage.

While functional, our current setup for managing tables is fragmented. The individual building blocks of compute engines, distributed storage, and metadata catalogs operate independently as part of an overall data plane. Unfortunately, there is currently no system in open source that unifies them through a single control plane. This unification is crucial for simplifying lakehouse management, organizing data for optimal query performance, instituting governance, and declarative metadata management - all to provide an enhanced developer experience. As a result, data scientists, data engineers, and product engineers have to juggle multiple systems and manage tables individually. It adds toil in terms of complexity and potential inconsistencies that can serve as distractions to the developers' core product focus. What developers are asking for is a way to declaratively specify the table definitions and policies using an API such as SQL, and the lakehouse should take care of the rest.

To provide an experience designed to reduce toil for product engineering and take charge of tables, we built and deployed OpenHouse, a control plane that allows our developers to interface with managed tables in our open source data lakehouse.

In this blog post, we will discuss the guiding principles outlined for OpenHouse and the northstar UX when interfacing with OpenHouse tables. We’ll also introduce OpenHouse’s control plane, specifics of the deployed system at LinkedIn including our managed Iceberg lakehouse, and the impact and roadmap for future development of OpenHouse, including a path to open source.

OpenHouse for Big Data Management

When building OpenHouse, we followed these four guiding principles to ensure that data platform teams and big data users could self-serve the creation of fully managed, publicly shareable, and governed tables in open source lakehouse deployments.

Tables (not files/blobs) are the only API abstraction for end-users. All accesses to table data must go through a table interface; no direct read/write is permitted to files or blobs on distributed storage for tabular data.
Tables are stored in a protected storage namespace that the control plane has full control over. Having full control allows the control plane to be opinionated about management aspects such as data organization, transactional semantics, security, high availability, disaster recovery, and quotas.
Tables are governed as per agreed upon company standards. This allows organizations to enforce constraints on data models, compliance annotations, and other metadata.
Tables are maintained regularly. This includes optimizing performance by adjusting sorting, partitioning, clustering strategies based on query statistics, and finally garbage collecting expired versions.

Diagram of the Northstar UX for OpenHouse

Figure 1: Northstar UX

Figure 1 shows the northstar user experience OpenHouse is building towards. This flow allows users to create a table, manipulate table metadata, load data, and share the table with a single chain of API calls, without losing their train of thought. In this user experience, most of the API calls can be made by leveraging standard SQL or Dataframe syntax.

          -- create table in openhouse
CREATE TABLE openhouse.db.table (id bigint COMMENT 'unique id', data string);


-- manipulate table metadata
ALTER TABLE openhouse.db.table_partitioned SET POLICY ( RETENTION=30d );
ALTER TABLE openhouse.db.table ALTER COLUMN measurement TYPE double;
ALTER TABLE openhouse.db.table SET TBLPROPERTIES ('key1' = 'value1');


-- manipulate table data
INSERT INTO openhouse.db.table VALUES ('1', 'a');


-- share table
ALTER TABLE openhouse.db.table_partitioned SET POLICY ( SHARING=true );
GRANT SELECT ON TABLE openhouse.db.table TO user;
      

Control Plane for Tables

The core of OpenHouse's control plane is a RESTful Table Service that provides secure and scalable table provisioning and declarative metadata management. Furthermore, it can be configured to automatically orchestrate data services that keep the tables in user configured (e.g., retention, replication), optimal (e.g., storage compaction, sorting, clustering) and compliant state (e.g., GDPR purge). Figure 2 shows how OpenHouse fits into broader open source lakehouse deployments.

Figure 2: OpenHouse Control Plane

Table service acts as a central metadata repository (i.e., a catalog). At its core, table service exposes standard catalog APIs that allow users to perform CRUD operations on managed OpenHouse tables. In many ways, this can be seen as an evolution of Hive metastore, with these additional capabilities:

Table service offers declarative table management APIs, i.e., a client only needs to provide the desired state for a managed table. The table service works with data services to guarantee that the observed state of the table is reconciled to the desired state.
Table service provides a way to securely share the tables, with built in role-based access control for table operations. Additionally, it abstracts away all the underlying FileSystem and BlobStore permissioning schemes from the end-user.
Table service acts as a gateway to enforce data quality constraints, governance rules, and data modeling standards.
Table service is opinionated about how tables are laid out in an HDFS namespace or Blob Store bucket and how quotas are managed.
Core table service APIs are designed to allow support for multiple table formats, specifically, Iceberg, Delta, and Hudi. Any format specific features are implemented as API extensions, without impacting the core Table APIs.
Table service is built to be horizontally scalable, prevents noisy neighbors, and provides granular observability into table access patterns.

Data services are a set of table maintenance jobs that keep the underlying storage in a healthy state. These include a wide variety of built-in compaction jobs that optimize table storage to reduce load on the data storage system and optimize user queries, purger jobs that keep the tables in a compliant state, and cross-cluster replication jobs for disaster recovery. The framework itself is extensible to run custom jobs.

Deployed system at LinkedIn

Diagram of the OpenHouse deployment at LinkedIn

Figure 3: Deployed System

Figure 3 shows system components of OpenHouse deployed at LinkedIn. Each component is numbered and its purpose is as follows:

Table service: This is a RESTful web service that exposes tables REST resources. This service is deployed on a Kubernetes cluster with a fronting Envoy Network Proxy.
REST clients: A variety of applications use REST clients to call into table service (#1). Clients include but are not limited to compliance apps, replication apps, data discovery apps like Datahub and IaC, Terraform providers, and data quality checkers. Some of the apps that work on all the tables in OpenHouse are assigned higher privileges.
Metastore Catalog: Spark,Trino, andFlink engines are a special flavor of REST clients. An OpenHouse specific metastore catalog implementation allows engines to integrate with OpenHouse tables.
House database service: This is an internal service to store table service and data service metadata. This service exposes a key-value interface that is designed to use a NoSQL DB for scale and cost optimization. However the deployed system is currently backed by a MySQL instance, for ease of development and deployment.
Managed namespace: This is a managed HDFS namespace where tables are persisted in Iceberg table format. Table service is responsible for setting up the table directory structure with appropriate FileSystem permissioning. OpenHouse has a novel HDFS permissioning scheme that makes it possible for any ETL flow to publish directly to Iceberg tables and securely into a managed HDFS namespace.
Data services: This is a set of data services that reconciles the user / system declared configuration with the system observed configuration. This includes use cases such as retention, restatement, and Iceberg-specific maintenance. Each maintenance activity is scheduled as a Spark job per table. A Kubernetes cronjob is run periodically on a schedule to trigger a maintenance activity. All the bookkeeping of jobs is done in House Database Service using a jobs metadata table for ease of debugging and monitoring.

Architecturally, OpenHouse is built to run in any cloud environment, using blob stores, managed compute, and cloud databases. Both the table service and data service are packaged as containers that should make it easy to deploy in a diverse environment. We are working on Terraform recipes that would automate deployment of the entire stack in minutes.

Managed Iceberg Lakehouse

At LinkedIn, OpenHouse tables are persisted on HDFS in Iceberg table format. Compared to Hive table format, Iceberg allows us to improve the reliability of tables on HDFS by providing features like incremental data processing, snapshot isolation, ACID transactions, and reproducible data flows through time travel queries.

Building a functional, scalable and easy to use lakehouse architecture with Iceberg as the table format required us to make new foundational investments. We invested in various data services that can work with Iceberg table format.

To keep the tables optimal, we automated orchestration of Iceberg maintenance jobs such as snapshot expiration, orphan file deletion, quarantine zones for deleted files, and manifest compaction.
To keep tables compliant, we have built data services that can delete data based on user requested purging and time partition expiration.
To provide data disaster recovery, we have built a data service that can replicate Iceberg snapshots efficiently across data centers.

Finally all our data services can be triggered almost instantaneously as Iceberg snapshots are committed.

Impact

In LinkedIn’s data lakes, two distinct categories of tables have emerged: centrally managed tables and self-managed tables. Centrally managed tables offer public sharing capabilities and robust table management support, including compaction and replication. On the other hand, self-managed tables are private to end-users and lack consistent management practices. Surprisingly, 65% of tables fall under the self-managed category, indicating a need for a more streamlined approach.

Our central managed platform imposes a laborious onboarding process, burdened by human intervention, resulting in significant time investment. It takes 2 to 3 weeks to onboard tables, and the ingestion is eventually consistent, creating operational complexities for both Site Reliability Engineers (SREs) and end-users.

With OpenHouse, end-users can self-serve creation of centrally managed, publicly shareable, and compliant tables in seconds. By eliminating the friction and operational complexities of traditional onboarding processes, OpenHouse empowers end-users to collaborate effectively while ensuring granular table sharing and adherence to compliance requirements, thereby transforming the way data lakes are operated.

Roadmap

OpenHouse has been deployed since late 2022 and serves a portion of our production traffic from LinkedIn’s GoToMarket systems that support LinkedIn Sales and Marketing. Our data engineers and data scientists who use dBT to create ETL flows were among the first to utilize this new system. Over the coming quarters, we will ramp production to serve the entirety of LinkedIn's data lakehouse tables. We expect to share more details in future posts as well as any further plans to open source this technology early next year.

Acknowledgements

Big thanks to team members who have relentlessly shipped incremental milestones and delivered customer impact for this multi-year initiative: Lei Sun (founding engineer), Sushant Raikar, Stanislav Pak, Abhishek Nath, Malini Venkatachari, Rohit Kumar, Levi Jiang, Manisha Kamal, Swathi Koundinya, Vishal Saxena, and Naveen Selvaraj.

Over a year and half ago, OpenHouse was incubated under the leadership of Sumitha Poornachandran and she remains our unwavering pillar of support. Also, huge thanks to continuous support to our executive leadership Renu Tewari, Kartik Paramasivam, and Raghu Hiremagalur for believing in OpenHouse.

Many thanks to the thought leadership of Eric Baldeschwieler, Owen O’ Malley, Sriram Rao, Vasanth Rajamani, and Kapil Surlaker, who helped shape the product value proposition. Also we are grateful to peer reviewers for the blog, Erik Krogen, Daniel Meredith, and Diego Buthay.

Finally, OpenHouse is a product of many passionate discussions with leads across LinkedIn: Walaa Eldin Moustafa, Bhupendra Jain, Ratandeep Ratti, Kip Kohn, Issac Buenros and Maneesh Varshney.

Topics: Data Data Management Infrastructure