Open Source

Gobblin Gobbles Camus, Looks Towards the Future

We shared Gobblin with the open source community a year ago. Since then, we’ve seen increasing interest and adoption among engineers, researchers and analysts in using Gobblin to integrate data from a variety of sources into Hadoop. In previous blog posts, publications, and talks, we’ve described our motivations for building a unified ingestion framework that is extensible, modular, scalable, fault-tolerant and provides state management and data quality as built-in features. In a blog post from last year, we mentioned that we were in the process of migrating the Camus pipeline to Gobblin. Today, we are happy to announce that we have completed the migration and fully retired Camus.

We built Camus many years ago as a special purpose pipeline to ingest Kafka into Hadoop. For a long time, Camus was responsible for landing critical tracking datasets on HDFS. In turn, these datasets power highly important metrics, A/B tests, other tools, and analytics pipelines. However, with time, we had more and more data sources that needed to be ingested into Hadoop, outside of Kafka. Because Camus was built specifically for Kafka ingestion, we were building more and more data ingestion pipelines, all of which had similar goals but different implementations, leading to significant overhead for operation, maintenance, and development of new features. This led to the birth of Gobblin, with a focus on unifying stream and batch ingestion for data.

Camus played a critical role in our design and implementation of Gobblin's Kafka adapter. Our operational experience with Camus provided us with a lot of valuable insights about how this type of data pipeline should be built and what common pitfalls should be avoided. As a result, we were able to build Kafka ingestion capability into Gobblin much faster and improve upon past mistakes.

That being said, we are very excited about the migration out of Camus. So much so, that the Gobblin and Camus teams got together for a sugary celebration when it happened, because as good as Camus is, it became limiting as our needs evolved.

Gobblin is a generic data ingestion framework for publishing and managing datasets on HDFS, which can be configured for a wide variety of use cases. For more details about what Gobblin is, our motivation of building Gobblin, and the role and the value it provides in LinkedIn’s data analytics infrastructure, take a look at the previous blog posts we've written.

There are many advantages of migrating from Camus to Gobblin, several of which we outlined below.

Gobblin is a generic data ingestion pipeline that supports many data sources, including Kafka, relational databases, rest APIs, FTP/SFTP servers, and filers, among others. New data sources can easily be added anytime. We’ve found it much more operable to use one single tool to ingest data from a wide swath of sources than to use a separate tool for each source. It can get cumbersome and confusing when different tools do similar things with very different logic, class hierarchies, config properties, log messages, deployment processes, etc. Now we have one less data pipeline to operate and support.

We treat extensibility very seriously in Gobblin. Every module in Gobblin is carefully designed with the right level of abstraction and clear responsibilities for each interface and class to make them easily extensible. For example in Gobblin, it is much easier to add a new column projection converter, change the partitioning schemes for the ingested data, and make the rollup run hourly instead of daily.

Today, Gobblin ingests thousands of topics, tens of billions records and many terabytes worth of data per day from our Kafka clusters. The performance of Gobblin in MapReduce mode beats Camus by a healthy margin due to an improved algorithm we built to better load balance mapper workloads. Additionally, we're currently working on a new continuous ingestion mode, which will further improve the performance of Gobblin.

There are many features in our latest release, Gobblin 0.6.2, that are useful for data ingestion, including Kafka ingestion, metrics and monitoring, exactly-once semantics, pluggable converters and quality checkers, handling late data for rollup, dataset retention management, Hive registration, etc.

Camus was a great tool in its own right, and its designers and developers deserve kudos for developing something that could get us this far. Camus will stay open source, but we are no longer maintaining the repo, and we encourage other companies to look to Gobblin for their needs. We put several tips together for how you can migrate from Camus to Gobblin. With the complete retirement of Camus behind us, now we're focusing full steam ahead on upgrading Gobblin ahead of our next major release.

We will soon introduce many new features in Gobblin, which are important steps in evolving Gobblin from a data ingestion tool to a data lifecycle management platform. As Gobblin ingests more data and more types of data, it is starting to provide several value-added features that help us organize our data better, optimize data formats on disk, and manage global replication and retention across multiple clusters. Gobblin’s appetite is set to further increase, and our big data is about to become a lot tastier. We will discuss several of these new features in more detail in our next blog post. Stay tuned!

Special thanks to the Camus team including Gaurav Gupta, Felix GV, Henry Cai, Ken Goodhope, and Chris Riccomini for their significant contributions to Camus.

The Gobblin team is currently led by Vasanth Rajamani and comprises of Chavdar Botev, Issac Buenrostro, Pradhan Cadabam, Ying Dai, Ziyang Liu, Sahil Takiar, Abhishek Tiwari, and Min Tu.

Special thanks to Shrikanth Shankar, Kapil Surlaker, Shirshanka Das, and Igor Perisic for their leadership, support and encouragement.