Heterogeneous data sources and sinks are a reality in all modern enterprises. Ingesting and moving relevant data is an operational necessity to gain insights that provide deep competitive advantage. At LinkedIn, we used to historically operate over 15 flows to load, transform, and transport data that is consumed by numerous groups in our organization. This resulted in significant data quality, metadata management, development, and operational challenges.
These limitations of implementing and operating numerous custom designed pipelines was the genesis for building Gobblin, which solves these challenges by:
- Focusing on generality.
- Providing an extensible set of data sources that can be extended as required.
- Providing end-to-end metrics that provide continuous data validation.
Since it was open sourced in 2015, we have replaced the vast majority of our custom flows with Gobblin Flows. One example is the retiring of Camus, a dedicated flow to ingest Kafka data with Gobblin-Kafka. With this change, Gobblin ingests the largest amount of data within LinkedIn.
Over the years, we continued to enhance Gobblin by adding the following features:
- Distcp in Gobblin: Provide enhanced distcp-like functionality that is dataset aware and can provide features like flow isolation and prioritization across datasets.
- Enhanced configuration management that allows specifying properties in a data-set/cluster overridable manner allowing flexibility of ingest and data movement.
- Exactly once semantics: Enhanced Gobblin to provide exactly once semantics in addition to the previously support at least once semantics.
- Integration with systems like Yarn for continuous data processing.
We continue to see robust open source engagement and adoption with Gobblin and in 2017 Gobblin was accepted by the Apache Software Foundation as an Incubator project.