Gobblin Enters Apache Incubation

Abhishek Tiwari

Sr. Manager, Software Engineering at LinkedIn

January 17, 2018

Gobblin is a distributed data integration framework that simplifies common aspects of big data integration, such as ingestion, replication, organization, and lifecycle management, for both streaming and batch ecosystems.

Gobblin has been gobbling big data with ease in the open source world since December 2014. Over the years, Gobblin has evolved at a tremendous rate. It has grown from primarily being an ingestion framework for offline data that ran in mapreduce mode on Hadoop to a robust ecosystem with capabilities that range widely in different dimensions of execution environments, data velocities, scale, connectors, and other ecosystem enhancements.

LinkedIn has a rich history of open source projects that go on to become part of the Apache Software Foundation, including Kafka, Samza, and Helix. Following this trend, we believe that Gobblin is now ready to join the ranks of Apache projects. So, we proposed that Gobblin be incubated under the Apache umbrella. We are very happy to share that our proposal was unanimously accepted by the Incubator Project Management Committee (PMC), and in February 2017 we started the incubation process. Since then, we’ve undergone the required internal processes and contributed the code, officially embarking on the Apache journey.

Why Apache?

The Apache Software Foundation (ASF) is one of the most influential open source organizations. Apache projects power more than 200 million websites (half the internet) as well as forms the (technical) backbone of some of the most valuable companies on the planet. While Gobblin has already experienced widespread adoption by companies such as LinkedIn, Apple, Paypal, etc., and research organizations like CERN and Sandia National Laboratories, we believe that becoming an Apache project will ensure self-sustenance and durability, allowing the growing community to nurture it in “The Apache Way.”

What’s next for Gobblin?

Through its internal and external community contributions, Gobblin has evolved significantly since our last blog post. In summary, here are the new exciting enhancements to Gobblin:

Multiple execution modes: Gobblin can now run in Embedded, CLI, Standalone, Mapreduce, and Cluster (Bare metal, AWS, and Yarn) modes.
Stream and batch processing support: Gobblin’s core engine now supports batch (finite) as well as stream (infinite) data processing. In batch mode, we already work with standalone, cluster, Map-Reduce, Hive, and Dali, and plan to add support for Spark this year. Similarly, we will extend our native streaming capabilities with systems such as Samza and Brooklin at LinkedIn this year.
Global throttling: Gobblin now supports global throttling of resources (e.g., API quotas) in any Gobblin execution mode. This is a generic piece of infrastructure that can be used by any distributed system.
Gobblin-as-a-Service: This aims at building a PaaS (Platform-as-a-Service) for data management that encapsulates and unifies heterogenous data movement and processing deployments (Gobblin or non-Gobblin) behind a service.

Going forward, we are committed to making major strides in the development of Gobblin, and intend to help evolve the community and adapt to “the Apache way.”

Since Apache Incubation early last year, we already have seen good progress on this front. Apache Gobblin community members have proposed, built, and started to spearhead a few critical developments within the Gobblin ecosystem. These include:

Kafka 10 support
State store enhancements
AWS mode enhancements and auto-scalability
Proposed Mesos support
Proposed enhancement to Gobblin-as-a-Service
Various new connectors
Admin UI stability and enhancements.

We further invite everyone to checkout Gobblin and contribute to its Apache journey.

Shout out to the community

Gobblin draws it strength from its community. So while we celebrate the Apache Incubation, we would also like to take a moment to give a quick shout out to the key members of the community outside of LinkedIn (in no particular order):

Lorand Bendig: Lorand works at Swisscom, which is a power user of Gobblin. He himself has been a key contributor for many enhancements around JDBC connectors, Kafka 10 support, Graphite/InfluxDB reporters and compaction application.
Joel Baranick: Another long-time contributor to Gobblin! His team runs Gobblin on AWS in elastic cluster mode. While being an avid user, Joel has also made major contributions to Gobblin on Yarn, Gobblin on AWS, Admin UI, and the core framework.
Tamas Nemeth: Tamas works at Prezi, and has done some remarkable work around supporting eventually consistent file-systems. His team makes use of Gobblin for interesting use cases, including ingesting data from Kafka, compacting data on S3 via intermediary steps on EMR/HDFS, leveraging Hive-registration to register data with Hive metastore, and shipping logs out to AWS S3.

We would also like to thank Shankar, Tilak Patidar, Jinhyuk Chang, Chen Guo, Eric Ogren, Clemens Valiente, Andrew Hollenbach, and Akshay Nanavati, as well as 50+ other contributors for their very valuable contributions.

There are various ways to get involved in the Apache Gobblin community, including contributing features, evangelizing ideas, or simply updating the documentation. Please join our user or dev mailing lists here. You can also find us on our Gitter channel here.

We are hosting a Big Data Meetup on Jan. 25 at LinkedIn’s offices in San Francisco with speakers from LinkedIn, Prezi, and other Bay Area companies to talk about the exciting new developments and challenges in the data management and data integration space. Please join us there!

Acknowledgements

The Gobblin team within LinkedIn has worked relentlessly over the years to ensure Gobblin is in the state it is today. It is comprised of Vasanth Rajamani, Hung Tran, Abhishek Tiwari, Issac Buenrostro, Sudarshan Vasudevan, Arjun Bora, Jack Moseley, Kuai Yu, Lei Sun, Zhixiong Chen, Aditya Sharma, and Raul Agepati. Gobblin has a rich alumni legacy as well, including Chavdar Botev, Pradhan Cadabam, Ying Dai, Ziyang Liu, Sahil Takiar, Min Tu, Yinan Li, Henry Cai, Kenneth Goodhope, Narasimha Reddy, and Lin Qiao.

We would also like to extend our special thanks to Shrikanth Shankar, Suja Viswesan, Kapil Surlaker, Shirshanka Das, and Igor Perisic for their leadership, support and encouragement.

Topics: Open Source Data Streaming/Processing Data