Open-sourcing PalDB, a lightweight companion for storing side data
October 26, 2015
LinkedIn’s data products give our members recommendations, analytics and insights. As we continue to invent more products that leverage data for our members, we need to push the envelope in our data processing capabilities and do more with less.
One issue that often comes up is what to do to improve the usability and memory efficiency of side data. Side data is the extra read-only data needed by a process to do its job. For instance, a list of stopwords used by a natural language processing algorithm is side data. Machine learning models used in machine translation, content classification or spam detection are also side data. When this side data becomes large it can be a bottleneck for the applications that depend on them.
To solve this problem we created PalDB and today, we present PalDB as a new open-source project developed at LinkedIn. PalDB does more with less by providing a new read-only embeddable database that makes it much easier to scale side data.
While the demand for side data grew at LinkedIn, we realized that existing solutions were either too cumbersome or overkill. Solutions relying on raw data files (CSV, JSON, Avro, Thrift, etc.) don't really solve the problem as the data has to be loaded in memory to be easily accessible. That requires custom reading and parsing code and the use of in-memory data structures, which are very costly in memory. On the other hand, solutions based on embeddable key-value stores (LevelDB, RocksDB) offer the right functionalities but lack the speed compared to in-memory data structures. PalDB aims to fill this gap by offering the advantages of embeddable key-value stores without sacrificing speed.
Below, we compare various common techniques employed to store and read side data and how PalDB fits in:
PalDB can replace the usage of in-memory data structures to store side data with comparable query performances and by using an order of magnitude less memory. It also greatly simplifies the code needed to operate this side data as PalDB stores are single binary files, manipulated with a very simple API (see below for examples).
At LinkedIn, PalDB is used in analytics workflows and machine learning applications. Its usage is especially popular in Hadoop workflows because memory is rare yet critical to speed things up. In this context, PalDB often enables map-side operations (e.g. join) which wouldn't be possible with classic in-memory data structures (e.g Java collections). For instance, a set of 35M member IDs would only use ~290MB of memory with PalDB versus ~1.8GB with a traditional Java HashSet. Moreover, as PalDB's store files are single binary files, it is easy to package and use with Hadoop's distributed cache mechanism.
PalDB's API has two simple primary interfaces, one for writing and one for reading. Typically, one would use the writer's put method to create a PalDB store and set values. A PalDB store is a single file that can then be read using the reader's get method. PalDB stores can contain all primitive types (int, string, arrays, etc.) as keys or values and act as a large hash table. Types don't have to be uniform so multiple of them can be mixed within a store (see example below). In addition, it supports custom Java classes when serializers are provided (see documentation for details).
Because PalDB is read-only and exclusively focuses on data that can be held in memory, it is significantly less complex than other embeddable key-value stores and therefore allows a compact storage format and very high throughput. PalDB is specifically optimized for fast read performance and compact store sizes. Performances can be compared to in-memory data structures such as Java collections (e.g. HashMap, HashSet) or other key-value stores (e.g. LevelDB, RocksDB).
Current benchmark on a 3.1Ghz Macbook Pro with 10M integer keys index shows an average performance of ~2M reads/s for a memory usage six times less than using a traditional HashSet. That is eight times faster throughput compared to LevelDB (1.8) or RocksDB (3.9.0).
PalDB has been released under the Apache 2 license. It is available for download on GitHub, along with its documentation. Contributions and suggestions are welcome.
PalDB was created by Mathieu Bastian, Engineering Manager in the Data Products team. Acknowledgments to Matthieu Monsch, Jay Kreps, Frank Astier and Chris Riccomini for their crucial suggestions and Igor Perisic for supporting the project.