Voldemort Collections: Iterating over a key-value store

Jason Ko

April 17, 2012

One of the many projects LinkedIn has contributed to open source is Project Voldemort, a distributed key-value storage system. I had the privilege of contributing to the project by implementing a small Collections library on top of Voldemort. The source is available on the Voldemort github.

Part of the challenge in building applications on top of a key-value store is the lack of higher level APIs to interact with the data in ways other than as a Map. This was a particular challenge when we wanted to convert our network updates (e.g., the status updates we show on the home page) from a relational database to Voldemort. In the examples that follow, I use these network updates as an simplified illustration of how we do this.

Traditional DB

In a traditional relational database management system (RDBMS), you have a table for users and a table for updates. A foreign key represents the one-to-many relationship between users and updates. Getting a user's updates is as simple as making a SQL query like "SELECT * FROM updates WHERE updates.user_id = users.user_id". The problem with a traditional RDBMS is that scaling becomes more difficult as you exceed the capabilities of the single machine.

A typical approach

One way to store sequential data in a key-value store is by creating a simple secondary index to store the list of element ids. In the example below, you have a store called index that keeps a map from a single user to a list of his or her updates. A second store, updates, keeps a map from the update_id to the actual data contained in the update (for example, "Alfred shared this interesting blog post").

Simple secondary index that stores a list of element ids
index: user_id => [update_id]
updates: update_id => update_data

The problem with this approach is that if a user is very active, the list of update ids can grow infinitely to an unmanageable size. Working with a list of this size would generate excessive disk access and increase network traffic.

Using a paged index

An alternate way to do this is by using a paged secondary index with some sort of chronologically based key. This way, we limit the number of entries contained in each list on a per month basis. In the following example, the index now stores a combination of user_id as well as month. In this way, you can make a query such as "what updates did Bruce make in the month of May?"

Secondary index paged by month
index: user_id, month => [update_id]
updates: update_id => update_data

This chronologically based key is great if the rate of incoming elements is going to stay constant over time--you can configure your keys so that the index stays a manageable size. Otherwise, it's hard to configure this number. If you store too few elements in a page, you risk requiring too many get() operations in order to operate on the elements. If there are too many elements in a page, you run the same risk of increased disk access and network latency.

VStack Collection

In order to address these issues, the VStack and VLinkedPagedList data structures in the Voldemort contrib collections library make this type of data access more natural and efficient.

The VStack<K, E> implements the Queue<E> interface and stores elements of type E, identified with a key of type K. The actual key in the Voldemort store is a record containing both the value of K, as well as an integer identifying the node ID for the stack node. For example, a stack representing user "Alfred" will have nodes with keys such as {"Alfred", 0}, {"Alfred", 1}, and {"Alfred", 2}. The value in the Voldemort store is a record containing the actual value inside the node, as well as integers identifying a forward pointer and a back pointer. The forward and back pointers refer to the node ID as previously mentioned in the key.

This way, we can use the nodes in the stack store as the elements of a doubly linked list. The first page starts at ID "0". Previous and next pages are linked via page ID in each list node. An Iterator is provided to allow sequential access over all the nodes. To bring it back to the network updates example, K represents user ID, and E represents the update ID. Using it is as simple as iterating over any standard java Collection.

VStack underlying Voldemort stores
stack: user_id, page_id => update_id, next_page, previous_page
updates: update_id => update_data

// iterating over a VStack
VStack<Integer, String> myUpdates = 
   new VStack<Integer, String>(1337 /* example userid */, 
                               storeClient /* Voldemort StoreClient */);
for (String update : myUpdates) {
   System.out.println(update);
}

This achieves our goal of sequential access to data, but it still has two drawbacks. First, since an element is stored in a separate list node, we need to execute a Voldemort get() call each time we want to see the next or previous element. This is impractical for lists more than a few elements in size, depending on the application's latency requirements. Second, it's very inefficient to have to start from the head or tail in order to reach an element that's in the middle of the list.

VLinkedPagedList to the rescue

To address these issues, we did several things. First, instead of storing a single element in each node, we store a list of elements within a single VStack node. Next, we built a secondary index, called page_index, on top of it. This secondary index associates a node ID with the highest value stored in the node. Using this information, we can jump into the middle of the stack via binary search, instead of starting from the head or tail. Also, by storing a list of ordered elements in a node instead of a single update, we rarely need to issue another get() operation to retrieve the next or previous node.

The finished structure is a VLinkedPagedList<I, LK extends Comparable<LK>>. In the context of network updates, I represents the user ID, and LK represents the update ID. The reason for LK (stands for "linear key") is that it needs to be Comparable in order to provide the searching capability. This is also helpful for paginating through a user's updates.

VLinkedPagedList underlying Voldemort stores
page_index: user_id => [{page_id, last_update_id}]
stack: user_id, page_id => [update_id], next_page, previous_page
updates: update_id => update_data

By using these new data structures, we are able to store any chronological set of elements, whether they come from a single user's update stream or posts within a high volume LinkedIn group. Currently these structures requires strict quorum where required-writes + required-reads > replication-factor. Future work on this can be done to implement a consistency resolver to resolve discrepancies at read time. Also, more data structures can be added to this library to further extend the usefulness of Voldemort in general.