Incremental Data Capture for Oracle Databases at LinkedIn: Then and Now

Saurabh Goyal

🚀 20yrs in Building Scalable Platforms for the Future of Tech - [Reliability|Data |Engineering Excellence]

November 22, 2017

Co-authors: Saurabh Goyal and Janardh Bantupalli

In our previous blog post introducing Brooklin, we outlined the reasons why we created our own framework for near real-time incremental data capture from production. This framework feeds data to our larger data ingestion pipeline for the hundreds of nearline applications processing data that are distributed across multiple stores like Espresso, Oracle, MySQL, HDFS, Kafka, and more. In this blog post, we’ll show how we use Brooklin for data capture from Oracle databases and show how this approach has evolved over time.

The fundamental requirements for any data-driven organization for designing, developing, and maintaining an efficient data ingestion pipeline from online datastores to downstream consumers like HDFS include:

Data reliability
Defined service level objectives (SLOs)
High availability for disaster recovery (DR)
Operational stability
Proactive and reactive monitoring

LinkedIn’s new version of incremental data capture, as part of the overall data pipeline, incorporates all of these key requirements. It also facilitates near-real time data ETL from user-facing Oracle databases to downstream consumers (such as HDFS and data warehouse) many site-facing applications (again consuming production data from HDFS after analytics), and internal applications and ad-hoc user queries that consume production Oracle data through HDFS.

At LinkedIn, our “Version 1.0” incremental data capture pipeline from Oracle databases consisted of batch processing powered by interval-based SQLs, and feed data to Lumos pipelines. Designed more than seven years ago, it is now at the end of its life cycle. The new incremental data capture process from Oracle databases is near real-time, and has very minimal performance overhead on user-facing source databases, is reliable, has a plugin for data auditing and data replay, and has high availability for BCP. It feeds data events to an Oracle-Brooklin pipeline.

Incremental data capture in “Version 1.0”

Incremental data capture using SQL queries

Incremental capture using batch SQL queries (Version 1.0)

Incremental database transaction capture in Version 1.0 is achieved by using a combination of SQL SELECT statements with appropriate upper and lower bounds on a modified timestamp table column; queries are executed on read-only copies of production databases. The data extract process runs multiple times per day for tables, as per defined configuration, and feeds the changes to a Lumos pipeline. The last batch of extracts are run post-midnight in order to obtain data through midnight on all tables. However, this last batch is one of the bottlenecks for reliable data delivery, and often contributes to unstable SLAs.

Our Version 1.0 pipeline is an in-house developed framework written in Perl. here are its features:

Well-defined lower and upper boundaries on timestamps for each incremental extract with overlap of 10 minutes to avoid any gaps in source transactions.
Table-level granularity for setting up extracts. A single extract executes all tables in sequence and creates corresponding files.
Each extract creates table-level meta file (table structure) and dmp file (data for defined time period).
Retention for meta and dumps files is configurable; it’s set to 5 days, currently.
Since these are SQL-based pulls, they provide the flexibility to exclude columns with sensitive information (security requirement) and/or columns not required by downstream consumers.
Extract-related metadata stored in MySQL provides the capability to set up monitoring for an extract’s state and any failures.
Operational data that is gathered, like number of rows and time taken per table for each extract, is used for analysis and auditing purposes.

Despite the fact that this approach provided us with a very useful ingestion pipeline, we found that there were several missing features in this framework, which motivated us to design a new approach.

Real-time ingestion: With extensive usage of data on HDFS for analytics and serving back to the LinkedIn website, we needed more aggressive SLAs and near real-time ingestion.
High availability: Oracle production databases running on multi-colo (active-active) have a single DR site. Data-capture extracts are scheduled on the DR site to avoid any significant performance overhead in production; this results in a single point of failure scenario for extracts in the instance of any hardware or corruption issues. The new ingestion framework, by contrast, leverages transaction logs for data capture. It runs on active production systems with minimal performance impact, so it offers the ability to run in multiple data centers, thereby providing high availability.

Data capture “Version 2.0”: Near real-time incremental data capture with quality check audit

Data-capture-Version-2-Near real-time incremental data-capture-with-quality-check-audit

Components and details of Version 2.0:

1. Oracle databases as data sources: Our Oracle databases, of course!

2. Oracle GoldenGate (OGG): OGG is used for real-time data integration and replication. It’s a proprietary and licensed software by Oracle. We successfully implemented multi-colo (active-active) for Oracle databases using OGG by replicating data across the data centers.

The components of OGG are listed below:

Extract: Runs on the source database, captures DDLs/DMLs from redo logs, and writes to a trail file, typically on the source system
Pump: Reads trail files generated by Extract and sends the data operations over the network to a remote trail, typically on the target system; adds storage flexibility and isolates the primary extract process from TCP/IP activity; can perform data filtering, mapping, and conversion, or it can be configured in pass-through mode; it’s an optional component in OGG
Replicat: Process runs on the target system, reads the trail files on that system, then reconstructs the DML or DDL operations, and applies them to the target database; uses dynamic SQL to compile a SQL statement once, and then execute it many times with different bind variables

3. OGG Big Data adapter plugin: This is a Java plugin on top of OGG Replicat to stream real-time transactional data from Oracle databases to various Big Data systems like Streaming Big Data (Kafka, Flume), Batch Big Data (Hadoop/Hive), NoSQL, and Search (ElasticSearch). It’s a proprietary and licensed software by Oracle. OGG Big Data Handler for Kafka can be used for streaming real-time change-capture data from OGG trails directly to Kafka topics and can build fault-tolerant, highly reliable, and extensible real-time analytical applications. It allows users to take transactions data directly out of a database in real-time, and stream them into a Kafka environment. In turn, Kafka feeds the data to any downstream consumers, like enterprise data warehouses, or Hadoop or NoSQL environments. This facilitates ETL in real time on a stream—rather than as a batch process.

4. Kafka: Apache Kafka is a high throughput publish-subscribe messaging system. It is distributed, partitioned, replicated, fault tolerant, and also has stronger ordering of message sequence when compared to the conventional messaging systems.

5. Near real-time data audit:

Our new framework performs near-real time data validation and discrepancy detection across colos through a system that was designed and developed in-house. It continuously monitors data quality in all prod colos by leveraging incremental data from the Big Data adapter. It also reports any data discrepancies caused by replication issues that improves overall data quality.

6. Data and schema propagation:

We designed and developed an independent framework to propagate Oracle schema changes (DDLs) to Kafka. It allows near-synchronous propagation of schema changes through integration with the Oracle release framework. It also makes APIs available for on-demand invocation by downstream (e.g., Brooklin) consumers.

Reliability: Operability, scalability, metrics

The building blocks for our data capture “Version 2.0” discussed in detail above build several key capabilities to ensure the smooth running and operation of our data capture system.

Monitoring and auto-remediation: We can now capture stats/metrics and provide an alerting mechanism for any activity above expected thresholds, through our in-house tools. For instance, we can capture data for the transactions per second for each process and every component, wherever it's applicable. We are now capturing metrics for lag generated due to process failure or extra load at minute-level granularity, and generating alerts when latency exceeds the configured threshold. Stats and metrics collection is plugged in with in-house tools like Auto-Alerts, InGraphs, and Iris for efficient monitoring and alerting.

Proactive monitoring/detection of problematic load patterns: Collected metrics like (insert/update/deletes) and operations per second give us insight to understand change patterns for any table, and alerting can be configured for instances when we have deviation to allow us to take preemptive action.

Monitoring and detection of data consistency issues: Our data capture framework is now also supported by an audit process to monitor, detect, and alert for any data inconsistency issues. Data fixes that are made to fix inconsistency flow automatically to downstream systems, making data consistent everywhere.

Auto-recovery: Every component in this framework is designed and configured with auto-recovery features, which provides resilience to the system and avoids human intervention as much as possible.

Scalability through parallel processing: OGG Big Data Adapters have the capability of parallelism at table-level granularity through key-based hashing. This helps in scaling tables with very high operations and achieving real-time data replication between processes.

Conclusion

Incremental Data Capture Version 1.0 was designed 7 years ago at LinkedIn by the DBA team, and while it served us well for a long time, it ultimately works better for a single data center environment. For companies whose data processing systems are not time-bound and produce few gigabytes of transactional data per day, this version would work well and be a cost-effective solution. However, if you have terabytes of data being produced every day and your downstream data processing system depends on near real-time data and needs reliability/scalability as data grows, the Data Capture Version 2.0 system would fit better. LinkedIn has adapted the Version 2.0 system for its critical databases and soon will be rolling it out everywhere.

Acknowledgements

We would like to take time to thank the LinkedIn Database-SRE team for continuous support; kudos and special shout outs to the members of core team (Agila Devi and Srivathsan Vijaya Raghavan) for designing, developing, and implementing Data Capture Version 2.0 at LinkedIn.

Finally, thanks to LinkedIn’s Data leadership for having faith in us and their continuous guidance; thanks for all of your support.

Topics: Data Streaming/Processing Data Management Distributed Systems