Avro2TF: An open source feature transformation engine for TensorFlow
April 4, 2019
Co-authors: Xuhong Zhang, Chenya Zhang, and Yiming Ma
Today, we are announcing a new open source project called Avro2TF. This project provides a scalable Spark-based mechanism to efficiently convert data into a format that can be readily consumed by TensorFlow. With this technology, developers can improve productivity by focusing on building models rather than converting data.
Deep learning data pipelines at LinkedIn
At LinkedIn, deep learning has been successfully applied across multiple AI systems related to recommendation and search. One of the important lessons we have learned from this journey is the importance of providing good deep learning platforms that help our modeling engineers become more efficient and productive. Avro2TF is part of this effort to reduce the complexity of data processing and improve the velocity of advanced modeling. In addition to advancing deep learning techniques, LinkedIn has been sharing our innovations in machine learning (ML) for years now in a variety of areas (e.g., recommendation systems, scaling machine learning systems, etc.). We have many different ML approaches that consume a large amount of data every day for which efficiency and accuracy are of utmost priority.
To effectively support deep learning and furthering our vision of democratizing machine learning (as seen through projects like Pro-ML), we had to first address the data processing step. Most of the datasets used by our ML algorithms (e.g., LinkedIn’s large-scale personalization engine Photon-ML) are in Avro format. Each record in an Avro dataset is essentially a sparse vector, and can be easily consumed by most of the modern classifiers. However, the format cannot be directly used by TensorFlow, the leading deep learning library. The main blocker is that the sparse vector is not in the same format that TensorFlow expects. We believe that this is not a problem unique to LinkedIn. Many companies have vast amounts of ML data in a similar sparse vector format, and Tensor format is still relatively new to many companies.
The size of the data at LinkedIn is typically large and formatted differently than the traditional deep learning library needs. This presents significant challenges because many pipelines carry a mix of data processing logic and modeling logic. It affects the flexibilities of constructing new deep learning models. Based on the feedback from our users on the LinkedIn ML vertical teams, we needed a scalable solution focused on scalable data conversion. More specifically, we needed a solution that converted our LinkedIn data types (e.g., sparse vector, dense vector, etc.) into a deep learning format (i.e., tensors).
Avro2TF bridges the gap and presents an elegant solution for ML engineers, freeing them up to focus on different deep learning algorithms. It provides a simple config for modelers to obtain tensors from existing training data. Tensor data itself is not self-contained. In order to be loaded to TensorFlow, the Tensor data is required to carry metadata. Avro2TF also fills this gap by providing a distributed metadata collection job. Inside LinkedIn, Avro2TF is an integral part of a system called TensorFlowIn that helps users easily feed data into the TensorFlow modeling process.
TensorFlowIn is a deep learning training library that is compatible with TonY, TensorFlow, and Spark. It contains end-to-end training-related utilities and frameworks. The above figure gives a high-level overview of TensorFlowIn. Since large-scale data processing is an important step that is not only critical to many LinkedIn applications, but is also useful to the larger AI community, we decided to open source this engine after receiving positive internal feedback.
Avro2TF project details
Below is a quick summary of some of the implementation features for Avro2TF.
Input data requirements: We support all data formats that Spark can read, including the most popular formats at LinkedIn, Avro and ORC. For categorical or sparse features, we require them to be represented in NTV (name-term-value) format.
Supported data types of output tensor: In Avro2TF, the supported data types (dtype) of output tensors are: int, long, float, double, string, boolean, and bytes. We also provide a special data type, sparseVector, to represent categorical/sparse features. A sparseVector tensor type has two fields: indices and values.
Avro2TF configuration: At the top level, the configuration file contains information about tensors that will be fed to the deep learning training framework. For each specified tensor, it consists of two kinds of information:
Input feature information, to tell which existing feature(s) should be used to construct the tensor.
Output tensor information, including the name, dtype, and shape of the expected output tensor.
Avro2TF data pipeline: This handles feature extraction, feature transformation (at LinkedIn, this is only in limited use cases not covered by Pro-ML), tensor metadata and feature mapping generation, converting string to numerical indices, and tensor serialization.
Avro2TF is now open source
Following the success of using Avro2TF at LinkedIn, we have released the technology as open source software. You can find the official GitHub page for Avro2TF here.
We also released an official tutorial for Avro2TF that can be found on the project wiki page.