The Present and Future of Apache Hadoop: A Community Meetup at LinkedIn

Erik Krogen

Offline Compute Infra @ LinkedIn (ask me about Trino & Spark!)

February 21, 2019

On January 30, Hadoop developers gathered at LinkedIn’s offices in Mountain View to share their latest work, with presentations by engineers from community members like Microsoft, Cloudera, Uber, and of course, LinkedIn. The Hadoop community is the lifeblood of the project, and the strong community spread across numerous companies and countries is what has kept it relevant for well over a decade. As we say here at LinkedIn, relationships matter. Open source projects like Hadoop are a perfect example of how the bonds among an otherwise loosely-connected set of developers can be used to the mutual benefit of all parties involved. Having such a large and diverse community means that meetups, both formal and informal, are essential to keeping everyone on a similar page and avoiding duplicate or conflicting work. The community response to our meetup exceeded expectations, with attendees from over 20 different companies and universities, and a few flying in from halfway across the country!

LinkedIn relies heavily on Hadoop for our offline data infrastructure needs (see our many previous blog posts on the subject), and we continue to deepen our involvement in the community at large—our team now has four members of the Hadoop Project Management Committee and two additional committers, both of whom obtained their committership through their work on the project while at LinkedIn. Thus, when the community started discussing another meetup, we were eager to volunteer to host. We appreciate the opportunity to be able to give back to the community in ways beyond our development contributions, and were very happy to be able to strengthen bonds with old friends and create connections with many new ones.

TonY: TensorFlow on YARN and beyond

The day started with LinkedIn's very own Jonathan Hung (left) and Anthony Hsu (right) discussing TensorFlow on YARN, or TonY, our home-grown and recently open-sourced solution for distributed deep learning via TensorFlow on top of YARN. They discussed its architecture and implementation, as well as future goals, such as support for additional runtimes like PyTorch. You can view their slides here and a recording of their presentation here.

Hadoop encryption: KMS and column-level encryption

The TonY folks were followed by Wei-Chiu Chuang of Cloudera, who spoke to us about HDFS encryption. As the breadth of Hadoop expands and new data protection rules like GDPR are put in place, there is increasing pressure for Hadoop administrators to encrypt their data. This is placing more demand on the Key Management Server (KMS), a component of Hadoop that stores and serves encryption keys. Wei-Chiu spoke to us about how to deal with this growing load, and about the state of encryption in the Hadoop ecosystem in general. You can view his slides here and a recording of his presentation here.

Wei-Chiu split his presentation slot with Xinli Shang of Uber, who spoke to us about schema-controlled column-level access control via encryption. Many data records contain fields which do not need to be encrypted, alongside some which hold more sensitive information and do need to be encrypted. Xinli spoke to us on how to use Apache Parquet, a common big data file format, in combination with the Hadoop KMS to encrypt only the columns that contain sensitive information. In addition to reducing encryption overhead, this provides an easy way to control which users and applications have access to sensitive data by controlling access to the encryption keys. You can view his slides here and a recording of his presentation here.

HDFS scalability and consistent reads from standby nodes

This presentation was given jointly by members of the big data teams at LinkedIn and Uber.

Chen Liang of LinkedIn opened by telling us about a feature recently introduced into HDFS, the storage system for Hadoop, that allows for increased scalability. In HDFS, a single metadata server, known as the NameNode, serves all of the client requests for information about and updates to the file system. To keep the system running with high availability, there are replicas of the NameNode known as Standby NameNodes, which maintain an up-to-date view of the file system, ready to take over in case of failures. The new feature Chen told us about enables a client to read metadata from this standby node and greatly increases the number of read requests an HDFS cluster can serve.

Next, Chao Sun of Uber spoke to us about Uber's experience running this feature in production. He told us about their experiences, the issues they had to fix along the way, and the huge reductions in request latency they saw even as they served a 20 percent increase in traffic.

Finally, Konstantin Shvachko of LinkedIn spoke to us about long-term plans for the scalability of HDFS. Beyond the new read-from-standby feature Chen discussed, Konstantin told us about plans to partition the file system within the NameNode's memory to allow for greater concurrency of update operations, and described some of his initial work exploring this direction. Finally, he shared the long-term goal of a fully distributed and horizontally-scalable NameNode.

You can view their slides here and a recording of their presentation here.

HDFS router-based federation and storage tiering

Next up, Ekanth Sethuramalingam (left) and CR Hota (right) of Uber teamed up to tell us about Uber's HDFS scaling efforts, centered around router-based federation and storage tiering. Router-based federation describes a feature in which multiple HDFS clusters are hidden behind a single "router" that directs client requests to different clusters depending on which data they are trying to access. CR told us about how Uber has used this to increase their scale, while simplifying configuration management. Ekanth then walked us through a new storage technique enabled by the use of router-based federation that arranges storage in tiers of "hot" and "warm" data to be stored in different clusters. "Hot," or very frequently accessed, data is stored on fairly normal server-class machines, while "warm," or less frequently accessed, data is stored on very dense machines that have a disproportionately high amount of storage capacity in comparison to their compute and memory. This allows for lower-cost storage of larger volumes of data. You can view their slides here and a recording of their presentation here.

Overview of Ozone

After a lunch break, we heard from Anu Engineer of Cloudera, who spoke to us about Ozone. This is a new storage project within Hadoop that is closely related to HDFS except that it is an object store as opposed to a file system. This means it has semantics similar to those of Amazon's S3, rather than those of a file system. This allows for enormous scalability gains—Anu told us that the first generally available release of Ozone will support at least 10 billion objects! He gave us a breakdown of where Ozone has been, and where it's going. You can view his slides here and a recording of his presentation here.

Dynamometer and a case study in NameNode GC

Next up was a presentation by none other than yours truly regarding Dynamometer, a system we built and open sourced for testing the scale and performance of HDFS. I talked about its goals and architecture before going into a discussion of how we were able to use it to tune the garbage collection characteristics of our NameNode. I went into detail about various garbage collection tuning parameters, and what we discovered about the applicability of various garbage collection algorithms to the unique challenges of the NameNode. You can view my slides here and a recording of my presentation here.

Hadoop on Azure

Next up, Microsoft's Íñigo Goiri presented on running Hadoop on Azure, Microsoft's cloud computing platform. He spoke to us about some of Azure’s features that make running Hadoop easy, such as integration with their storage systems, an internal DNS registry, and more. He also shared with us an exciting feature that allows for low-cost Hadoop clusters by utilizing spare Azure capacity, and explained the changes made to Hadoop to allow the system to fully leverage this capability. You can view his slides here and a recording of his presentation here.

Mounting remote stores in HDFS

Last but certainly not least, Microsoft’s Virajith Jalaparti (left) and Ashvin Agrawal (right) discussed the evolution of the "provided storage" feature in HDFS, which allows for HDFS clients to transparently access external storage systems (such as Azure Data Lake Storage or Amazon S3). They described a mechanism whereby the NameNode would "mount" an external store as part of its own namespace, and clients would be able to access the data as if it resided on HDFS itself. The DataNodes, which normally store the data in HDFS, would transparently fetch the data from the remote store and serve it back to the client. They were even brave enough to give us a live demo! You can view their slides here and a recording of their presentation here.

Breakout sessions

Following all of our planned presentations, we held informal "birds of a feather" discussions about topics pertinent to the Hadoop community at large.

One session discussed the management of Hadoop releases, in particular the 2.X release series as opposed to the 3.X release series. Major version upgrades in Hadoop can be painful, and many large operators are wary of upgrading from Hadoop 2 to 3. There is some support in the community for a "bridge" release, or a final release on the Hadoop 2 release line before making the plunge for a major version upgrade.

Another session discussed Java versioning. Previously, the stance of the Hadoop community was that Java version upgrades would always be accompanied by a Hadoop major version upgrade; for example, Hadoop 2 supports Java 7 and above, while Hadoop 3 only supports Java 8 and above. However, given the changes in Oracle's release and support roadmap to a much more rapid release cycle, the Hadoop community must adapt its policies. We discussed that we will likely need to drop support for Java versions in minor, rather than major, releases of Hadoop.

Another major topic of discussion was the future of Ozone. There were deep dives into various portions of Ozone's architecture, and in-depth discussions of how various frameworks such as Apache Spark, Apache Impala, and Presto would work on top of Ozone. Finally, there were discussions of its release timelines, and how erasure coding functionality, a recent addition to HDFS, could be supported in Ozone as well.

Acknowledgments

All of us here at LinkedIn were thrilled to be a part of the engaged community present at this meetup. Thanks to all of our speakers and participants for making this a fun and fruitful event. We're greatly looking forward to the next one!

This meetup couldn't have happened with the support of our amazing events staff here at LinkedIn. I owe great thanks to our media technician, Francisco Zamora, and to the rest of the catering and event services professionals who helped us out!

Topics: Culture Open Source Data