TonY joins LF AI & Data Foundation
July 15, 2021
“We’re thrilled to welcome TonY into incubation in LF AI & Data. The project offers functionalities that are not currently available by any of our hosted projects, especially its capability to act as a native connector to run machine learning jobs reliably and flexibly,” said Dr. Ibrahim Haddad, Executive Director of LF AI & Data. “We look forward to working with the project community to grow its community of users and developers under an open governance framework and with the support and enablement of the various services offered by our Foundation.”
Released by LinkedIn under the open source BSD-2 license in September 2018, TonY makes it easy for AI engineers to train distributed deep learning models on Hadoop. For the past few years, TonY has been empowering all of LinkedIn’s production deep learning jobs to create more relevant content for our 700+ million members. Externally, TonY has also been adopted by companies like iQiyi, and integrated with Google Cloud.
We are also excited to announce that a peer LF AI & Data project, Horovod, is now supported in TonY, thanks to a contribution from Junfan Zhang of iQiyi. Read on for more details about this integration.
Horovod on TonY
Horovod is a popular distributed deep learning training framework for TensorFlow, PyTorch, and Apache MXNet, making it easy to take a single-GPU TensorFlow program and successfully train it on many GPUs more quickly. Horovod also achieves significantly improved GPU resource usage figures. Uber has used Horovod to support self-driving vehicles, fraud detection, and trip forecasting. It is also being used by Alibaba, Amazon, LinkedIn, and NVIDIA.
Apache Hadoop YARN is Hadoop’s resource management framework, empowering most of the workload on Hadoop, like MapReduce and Spark. Running Horovod on YARN saves you from having to set up a Kubernetes cluster dedicated for Horovod training jobs, and improves overall cluster utilization. We have extended TonY to support running Horovod in YARN.
A Horovod training job is composed of two roles: worker and driver. Driver is responsible for starting the rendezvous server and is a light-weight process with no GPU requirement. Before training actually starts, the driver needs to know all workers' addresses to start the rendezvous server. The workers will later retrieve related information from the rendezvous server, and start doing the training.
The TonY Horovod runtime works as follows:
Horovod workers and drivers will be localized to TonY's task executors.
TonY will start all the task executors, each reserving an address (port) for its driver/worker and reporting to the Application Master.
The task executor responsible for starting the driver retrieves all the address information from the Application Master, and starts the driver with the above information.
In the process of driver starting, the driver's corresponding task executor monitors the driver's state, gets all the assigned slot info, and reports to the Application Master.
Once the Application Master has gotten the slot info report, all workers will be started by the other task executors.
In order to be consistent with the native Horovod runner, TonY introduces a built-in Horovod driver to solve the problem of building the rendezvous server, which is started on the TonY task executor automatically. With the help of the built-in driver, only the number of workers and resources need to be specified by users, and the rest will be taken in charge by TonY. You can find an example of running Horovod on TonY here.