Open Source

Spark-TFRecord: Toward full support of TFRecord in Spark

Co-authors: Jun Shi, Mingzhou Zhou

Introduction

In the machine learning community, Apache Spark is widely used for data processing due to its efficiency in SQL-style operations, while TensorFlow is one of the most popular frameworks for model training. Although there are some data formats supported by both tools, TFRecord—the data format native to TensorFlow—is not fully supported by Spark. While there have been prior attempts to bridge the gap between these two systems (Spark-Tensorflow-Connector, for example), existing implementations leave out some important features provided by Spark.

In this post, we introduce and open source a new data source for Spark, Spark-TFRecord. The goal of Spark-TFRecord is to provide full support for the native TensorFlow data format in Spark. The intent of this project is to uplevel TFRecord as a first-class citizen in the Spark data source community to be on par with other internal formats, such as Avro, JSON, Parquet, etc. Spark-TFRecord provides not only the simple functions, such as data frame read and write, but also advanced ones, such as PartitionBy. As a result, a smooth data processing and training pipeline in TFRecord is possible.

Both TensorFlow and Spark are widely used at LinkedIn. Spark is used in many data processing and preparation pipelines. It is also the leading tool for data analytics. As more business units employ deep learning models, TensorFlow has become the mainstream modeling and serving tool. Open source TensorFlow models mainly use the TFRecord data format, while most of our internal datasets are in Avro format. In order to use open source models, we have to either change the model source code to take Avro files, or convert our datasets to TFRecord. This project facilitates the latter.

Existing projects and prior efforts

Prior to Spark-TFRecord, the most popular tool to read and write TFRecord in Spark has been Spark-Tensorflow-Connector. It is part of the TensorFlow ecosystem, and has been promoted by Databricks, the creator of Spark. Although it supports basic functions such as read and write, we noticed two disadvantages of its implementation for our use cases at LinkedIn. First, it is based on the RelationProvider interface. This interface is mainly for connecting Spark and a database (hence the name “connector”). In this case, the disk read and write operations are provided by the database. However, the main use case of Spark-Tensorflow-Connector is disk I/O operations, rather than connecting a database. In the absence of a database, the I/O operations have to be provided by the developers who implement the RelationProvider interface. This is why a considerable amount of code in Spark-Tensorflow-Connector is dedicated to various disk read and write scenarios.

In addition, Spark-Tensorflow-Connector lacks important functions such as PartitionBy, which splits the dataset according to a certain column. We find this function useful at LinkedIn when we need to train models for each entity, because it allows us to partition the training data by the entity IDs. Demand for this function runs high in the TensorFlow community, as well.

Spark-TFRecord fills these gaps by realizing the more versatile FileFormat interface, which is also used by other native formats such as Avro and Parquet. With this interface, all the DataFrame and DataSet I/O APIs are automatically available to TFRecord, including the sought-after PartitionBy function. In addition, future Spark I/O enhancements are automatically available through the interface.

Design

We initially considered patching Spark-Tensorflow-Connector to obtain the PartitionBy function that we needed. But after examining its source code, we realized that RelationProvider, which Spark-Tensorflow-Connector is based on, is a Spark interface to SQL databases, making it not suitable for our purpose. Unfortunately, there does not exist a simple fix since RelationProvider is not designed to provide disk I/O operations.  Instead, we took a totally different route and implemented FileFormat, which is designed for file-based I/O operations. This was helpful for our use cases at LinkedIn, where datasets are typically directly read from and written to disk, making FileFormat a more proper interface for those tasks.

The following diagram shows the building blocks.

diagramof-spark-tf-record

Building blocks of Spark-TFRecord

Description of each block

  • Schema Inferencer: Inferences Spark data type from the TFRecord data type. We reused most functions from Spark-Tensorflow-Connector.
  • TFRecord Reader: Reads examples from TFRecord files on the disks and converts the examples to Spark InternalRow by calling the Deserializer.
  • TFRecord Writer: Converts Spark InternalRow to TFRecord examples by calling the Serializer, then saves them into the disks. We used the writer from the Tensorflow Hadoop library for the last step.
  • TFRecord Deserializer: Converts examples to Spark InternalRow.
  • TFRecord Serializer: Converts Spark InternalRow to examples.

How to use Spark-TFRecord

Spark-TFRecord is fully backward-compatible with Spark-Tensorflow-Connector. Migration is easy: just include the spark-tfrecord jar file and specify the data format as “tfrecord”. The  example below shows how to use Spark-TFRecord to read, write, and partition TFRecord files. More examples can be found at our GitHub repository.

Conclusion

Spark-TFRecord elevates TFRecord to be a first-class citizen within Spark, on par with other internal data formats. The full set of dataframe APIs, such as read, write, and partition are supported with this library. Currently, we limit the schemas to those supported by Spark-Tensorflow-Connector. Future work will expand to more complex schemas.

Acknowledgements

The authors would like to thank Min Shen, Liang Tang, Fangshi Li, Jun Jia, and Leon Gao for technical discussions, and Huiji Gao for help with resources.