Open Sourcing TonY: Native Support of TensorFlow on Hadoop
September 12, 2018
LinkedIn heavily relies on artificial intelligence to deliver content and create economic opportunities for its 575+ million members. Following recent rapid advances of deep learning technologies, our AI engineers have started adopting deep neural networks in LinkedIn’s relevance-driven products, including feeds and smart-replies. Many of these use cases are built on TensorFlow, a popular deep learning framework written by Google.
In the beginning, our internal TensorFlow users ran the framework on small and unmanaged “bare metal” clusters. But we quickly realized the need to connect TensorFlow to the massive compute and storage power of our Hadoop-based big data platform. With hundreds of petabytes of data stored on our Hadoop clusters that could be leveraged for deep learning, we needed a scalable way to process all of this information. Fortunately, TensorFlow supports distributed training, a useful technique for processing large datasets. However, orchestrating distributed TensorFlow is not a trivial task and not something that all data scientists and relevance engineers have the expertise, or desire, to do—particularly since it must be done manually. We wanted a flexible and sustainable way to bridge the gap between the analytic powers of distributed TensorFlow and the scaling powers of Hadoop.
Open sourcing TonY
To meet our needs, and because we know there are many others interested in running distributed machine learning who are also running large Hadoop deployments, we have built TensorFlow on YARN (TonY), which we are open sourcing today. Please check out the TonY project on GitHub for details on how to use it. Contributions and suggestions from the community are welcome!
In the rest of this blog post, we will cover the internal details of TonY, the features we have implemented and leveraged to scale distributed TensorFlow on Hadoop, and experimental results.
In our initial investigation into running distributed TensorFlow on Hadoop, we found a few existing solutions. However, we ultimately determined that none met our particular requirements, leading to our decision to build TonY.
TensorFlow on Spark is an open source solution that enables you to run TensorFlow on the Apache Spark computing engine. We were able to onboard a couple of our internal deep learning applications on this framework, but ran into a few issues, most notably a lack of both GPU scheduling and heterogeneous container scheduling. Also, any scheduling and application lifecycle enhancements we wanted to make in the future would have to be done in Spark, which is much more difficult than making the change in a self-contained YARN application.
TensorFlowOnYARN is another open source solution that runs as a separate library. Unfortunately, fault tolerance support and usability in this project did not fit our needs. Furthermore, this project is no longer maintained.
For these reasons, we decided to build TonY to give us complete control over the resources in our Hadoop clusters. Also, since TonY is running directly on YARN and runs as a lightweight dependency, we can easily evolve it with both the lower-level part of the stack in YARN, or the higher-level part of the stack in TensorFlow.
How does TonY work?
Similar to how MapReduce provides the engine for running Pig/Hive scripts on Hadoop, and Spark provides the engine for running scala code that uses Spark APIs, TonY aims to provide the same first-class support for running TensorFlow jobs on Hadoop by handling tasks such as resource negotiation and container environment setup.
Running TensorFlow on TonY on YARN
There are three main components to TonY: Client, ApplicationMaster, and TaskExecutor. This is the end-to-end process of running a TonY job:
The user submits TensorFlow model training code, submission arguments, and their Python virtual environment (containing the TensorFlow dependency) to Client.
Client sets up the ApplicationMaster (AM) and submits it to the YARN cluster.
AM does resource negotiation with YARN’s Resource Manager based on the user’s resource requirements (number of parameter servers and workers, memory, and GPUs).
Once AM receives allocations, it spawns TaskExecutors on the allocated nodes.
TaskExecutors launch the user’s training code and wait for its completion.
The user’s training code starts and TonY periodically heartbeats between TaskExecutors and AM to check liveness.
Architecture of TonY
In addition to supporting the baseline functionality of running distributed TensorFlow jobs on Hadoop, TonY also implements various features to improve the experience of running large-scale training:
GPU scheduling. Recently, Hadoop has added native support for GPU scheduling and isolation. For users, this means they can be sure that once they receive their container allocations from Hadoop, they can reliably acquire the number of GPUs they request. TonY is also aware of GPU resources, so it is able to leverage Hadoop’s API for requesting GPU resources from the cluster.
Fine-grained resource requests. Since TonY supports requesting different entities (e.g., parameter servers and workers) as separate components, the user can make different resource requests per type. For example, your parameter servers and workers likely have different memory requirements. Or, you probably want to run training on GPUs or some other specialized hardware, but using CPUs on parameter servers is sufficient. For the user, this means more control over their application’s resource requirements, and for cluster admins, this helps avoid resource wastage of expensive hardware.
TensorBoard support. TensorBoard is a tool to make it easier to understand, debug, and optimize TensorFlow programs. Since the TensorBoard process is launched by one of the workers at a location unknown to the application on job startup, normally we would not be able to see TensorBoard from the Hadoop UI. We recently contributed code to YARN to allow us to redirect the Hadoop application’s tracking URL to point to TensorBoard, so that TensorBoard can be viewed with a single click.
Fault tolerance. TensorFlow training can take several hours or days, using a large number of machines. Therefore, a long-running TensorFlow job is more vulnerable to transient errors or preemption than short-lived jobs. TensorFlow contains fault tolerance APIs to save checkpoints to HDFS and restore training status from previously-saved checkpoints. TonY facilitates the process by providing a resilient distributed infrastructure to recover from node failures. If a worker fails to heartbeat to the AM or times out, TonY will restart the application and resume training from previous checkpoints.
We ran the Inception v3 model on TonY with one to eight workers, with one GPU per worker (also one execution using CPU training on eight workers), using asynchronous training. This model is a well-known deep neural network for ImageNet, a dataset containing millions of images used for training image classification models. As in the Inception v3 distributed training example, we measured time to reach 100,000 steps with a batch size of 32. The results are below:
These results are with 40G RAM / 1 CPU per worker, Tesla K80 GPUs, on RHEL 6.6, and TensorFlow 1.9. The final top-5 error rate after reaching 100,000 steps for 8 workers with GPU training was 26.3%.
Since TonY is in the layer which orchestrates distributed TensorFlow and does not interfere with the actual execution of the TensorFlow job, we expect there to be no overhead. Indeed, we see that for GPU training, runtime scales linearly. We also see about four times speedup when running GPU training over CPU training, which is expected given the complexity and depth of the model.
TonY draws inspiration from Yahoo’s TensorFlow on Spark, Intel’s TensorFlowOnYARN project, and this pull request in the TensorFlow ecosystem repo. A big shout out to all our committers: Anthony Hsu, Arun Suresh, Chen Liang, Jonathan Hung, Keqiu Hu, and Zhe Zhang. TonY is only possible with sustained commitment from management. Zhe Zhang, Suja Viswesan, Vasanth Rajamani, and Kapil Surlaker: thank you for your unyielding support. We also want to say thank you to our Grid SRE friends who helped us set up the deep learning cluster: Thomas Moll and Tu Tran. Folks from our Machine Learning & Algorithms team: Alex Bain, Bee-chung Chen, Daniel Galvez, Florian Raudies, Mingzhou Zhou, Wei Lu, Xuhong Zhang, Yen-Jung Chang, and Yiming Ma, thank you for providing extremely valuable user feedback. Folks from the Hadoop community: Wangda Tan, Yanbo Liang, and Zian Chen, thank you for offering great help and lastly, shout-out to our logo designer: Clyde Higaki, who drew the awesome TonY logo.