MiGz for Compression and Decompression
February 20, 2019
Compressing and decompressing files with GZip normally uses a single thread. For large files, this can bottleneck dependent tasks like data processing, data analysis, and machine learning. Although there are several alternatives supporting multithreaded compression, such as pigz (command-line tool) and ParallelGZip (Java library), no Java library (for any compression format) supports multithreaded decompression.
At LinkedIn, we routinely work with data ranging in size from several gigabytes to many terabytes; compression alleviates the problem of relatively slow network or disk I/O by reducing the number of bytes that must be transferred. However, the bottleneck often then becomes the CPU as it performs the single-threaded compression and decompression.
Consequently, we’ve developed MiGz, a multithreaded, GZip-compatible compression and decompression utility available as both a Java library and cross-platform command-line tools, which we are now pleased to release as an open source project. MiGz satisfies our three key design goals:
Platform-independent Java library and command-line tools: MiGz has no native dependencies; rather, it leverages the DEFLATE algorithm implemented natively by the JVM to achieve native code compression/decompression performance.
Ubiquitous compatibility across platforms and languages: Data compressed multi-threadedly by MiGz can be read by any other GZip decompressor (single-threadedly), including those available in Python, Linux, MacOS, .Net, etc.
Fast and effective multithreaded compression and decompression.
MiGz uses the GZip format, which has widespread support and offers both fast speed and a high compression ratio. MiGz'ed files are also entirely valid GZip files, and can be decompressed single-threadedly by any GZip utility or library, or decompressed in parallel by the MiGz decompressor (please note that non-MiGz-created GZip files cannot be decompressed in parallel, however).
To use MiGz in your Java application, please obtain the code on our GitHub page or get the precompiled JAR via these Maven coordinates: “com.linkedin.migz:migz:1.0.0”.
Then, to compress and decompress data, just use com.migz.MiGzOutputStream and com.migz.MiGzInputStream just as you would GZipOutputStream and GZipInputStream, respectively.
If you’d prefer to use the MiGz command-line tools, the utilities mzip and munzip are simple to use, compressing and decompressing data provided on stdin to stdout, respectively (they are not drop-in replacements for the Linux gzip/gunzip utilities).
How MiGz works
The GZip file format allows for data to be compressed in multiple blocks (“members”), each with its own header and footer metadata. MiGz takes advantage of this by partitioning the original data into multiple, fixed-sized sections (except for the final section), and using multiple threads to compress each of these independently. The compressed blocks are then written in the proper order to the underlying stream. When decompressing, MiGz likewise decompresses each (compressed) block of data in multiple threads, and stitches the decompressed data back together to provide the caller with a coherent, properly-ordered stream using the SequentialQueue class in the concurrentli library.
GZip does not normally write the compressed size of each block in its header, so finding the position of the next block requires decompressing the current one, precluding multithreaded decompression. Fortunately, GZip supports additional, custom fields known as EXTRA fields. When writing a compressed file, MiGz adds an EXTRA field with the compressed size of the block; this field will be ignored by other GZip decompressors, but MiGz uses it to determine the location of the next block without having to decompress the current block. By reading a block, handing it to another thread for decompression, reading the next block and repeating, MiGz is able to decompress the file in parallel.
GZip header as written by MiGz. To write the size of the compressed data, MiGz adds an “EXTRA” field to the header, consuming an additional 10 bytes (colored in red).
Relation to other GZip variants
GZinga and BGZip also compress data into multiple GZip blocks, similar to MiGz, but do so for the purposes of enabling random access and splitting (e.g., Hive jobs) rather than faster compression and decompression in a single process. Both approaches use an index to translate a desired position in the uncompressed data into the location of the corresponding compressed block in the GZiped file and the appropriate offset in that (uncompressed) block. GZinga stores this index in the COMMENT field of a final, “empty” (no payload data) block, whereas BGZip uses a separate index file.
Interestingly, BGZip uses an EXTRA field to record the size of the compressed data just as MiGz does (to allow an index to be built efficiently without having to decode every block), and this allows for multithreaded decompression. However, because BGZip limits each block to 64KB of data (uncompressed), BGZip-compressed files will tend to be larger than those compressed with MiGz (which uses a 512KB block size by default). Smaller blocks are less efficient because each block is compressed independently (GZip is a dictionary-based compression scheme and the accumulated dictionary is forgotten when moving to the next block) and each has its own header and footer for metadata; in informal testing, we found that our compressed files started to become noticeably larger once the block size dropped below ~100KB.
As with BGZip, a MiGz’ed file could also be indexed for random access, although the larger default block size would make small reads more expensive (requiring decompressing more data per read); this is not implemented, however.
At LinkedIn, we use a variety of platforms for testing. In this instance, we tested MiGz on a MacBook Pro with a SSD (measured disk read speed of 1.46GB/s), 16GB RAM, and four (hyperthreaded) physical cores using an 18.1GB XML dump of the German Wikipedia, chosen in preference to the English dump due to its smaller size.
Experimental results. Each experiment was run 5 (compression) or 20 (decompression) times and the results averaged. Output (compressed or decompressed bytes) was written to /dev/null. Input files were read with cat prior to each experiment to avoid bias due to disk caching.
The default compression level used by GZip is 6; 9 is “maximum compression.” The speedup from MiGz is approximately proportional to the number of cores for both compression and decompression, with virtually identical file sizes for both MiGz and GZip at both compression levels. We also tested pigz, which, as expected, had approximately the same compression times as MiGz and decompression times as GZip. Compressed file sizes in MiGz, GZip, and pigz were all within 1% of each other.
We’ve been using MiGz to speed up our data processing and machine learning workflows at LinkedIn for a while, with excellent results. Now, with its public release, we hope it will be useful in many more projects outside the company.
Thanks to Juan Bottaro and Carl Steinbach for their review and many helpful suggestions, and to Dan Bikel and Romer Rosales for their encouragement and support of this work.