Heterogenous data sources and sinks are a reality in all modern enterprises. Ingesting and moving relevant data is an operational necessity to gain insights that provide deep competitive advantage. At LinkedIn, we used to historically operate over 15 flows to load, transform, and transport data so that it is then consumed by numerous groups in our organization. This resulted in significant data quality, metadata management, development, and operation challenges. The limitations of implementing and operating numerous custom designed pipelines was the genesis for building Gobblin.
Gobblin solves this problem by:
- Focussing on generality
- Providing an extensible set of data sources that can be extended as required.
- Providing end to end metrics that provide continuous data validation
Since, it was open sourced in Feb 2015, we have replaced the vast majority of our custom flows with Gobblin Flows. Most recently, we retired Camus which was a dedicated flow to ingest Kafka data with Gobblin-Kafka. With this change, Gobblin ingests the largest amount of data within LinkedIn.
Over the last year, we have enhanced Gobblin to add the following features:
- Distcp in Gobblin: Provide enhanced distcp like functionality that is dataset aware and can provide features like flow isolation and prioritization across datasets.
- Enhanced configuration management that allows specifying properties in a data-set/cluster overridable manner allowing flexibility of ingest and data movement.
- Exactly once semantics: Enhanced Gobblin to provide exactly once semantics in addition to the previously support at least once semantics.
- Integration with systems like Yarn for continuous data processing.
We continue to see robust open source engagement and adoption with Gobblin and are committed to nurturing it.
In 2017, Gobblin was accepted by the Apache Software Foundation as an Incubaor project.