Announcing the Voldemort 1.3 Open Source Release

Vinoth Chandar

March 19, 2013

I am excited to announce the availability of the Voldemort 1.3.0 Open Source release. In 2012, Voldemort experienced unprecedented levels of traffic growth, adoption rates, and data growth at LinkedIn. Along the way, we made a number of improvements to Voldemort that we've bundled into the new release:

Performance: new storage layer, non-blocking socket checkouts, better read-only performance
New features: build-and-push, avro support, Kerberos support
Operability: metadata management, auto bootstrapping, monitoring & admin API enhancements

In this post, I'll discuss how we solved these challenges as well as what we're planning in the future.

Performance improvements

This release contains a number of performance improvements around the BDB-JE storage layer, client connection management and read-only index management.

New BDB-JE Storage Layer

With this release, we have upgraded our storage engine to Berkley DB Java Edition (BDB-JE) 4.1.17, with a new on-disk format that eliminates use of BDB-JE sorted duplicates. In addition to a general speed up due to duplicate elimination, we've seen the following improvements:

More control over the BDB cache
The new layer is much more Java GC friendly since it moves data off the JVM heap (only the index sits in BDB cache) and also shields online traffic from large scan jobs (DataCleanup, Restore, Rebalance, Repair).
We have seen upto 24x speed ups for restore and rebalance operations due to partition scans.
Streaming writes are handled much efficiently by dynamically turning off the BDB check pointer for the duration of the writes.

These improvements, along with cache partitioning, provide all the tools necessary to operate Voldemort clusters with predictable performance. All the bells and whistles are clearly documented in the VoldemortConfig class.

Unfortunately, the migration path was not straightforward. Relying on BDB sorted duplicates for persisting conflicting versions of the same key was fraught with inefficiencies and did not provide a migration path towards BDB-JE 5. After working closely with the BDB-JE team at Oracle, we decided to eliminate duplicates and convert data to a format where conflicting versions of a key's value are flattened into BDB as a single byte array. In addition, by default, the keys are stored by prefixing the partition-id to support efficient partition scans.

NOTE: data conversion is necessary before upgrade. Please read the pre-upgrade readme for instructions.

Non blocking client socket checkouts

The client now masks server performance degradations much more gracefully, greatly reducing impact on online traffic. In earlier versions, the connection pool (KeyedResourcePool), had a subtle blocking issue when attempting to issue requests in parallel. Even though the network communication was done in parallel, the act of obtaining a socket from the connection pool (a.k.a a socket checkout) was done serially. Thus, when the connection pool was very small or when the server exhibited degraded performance, the design was inherently prone to blocking for an available connection before it could fire off the parallel requests.

The new release implements a QueuedKeyedResourcePool, that enables truly parallel communication to the server. Instead of blocking to acquire the socket to a server, all the pertinent state is wrapped up into a ClientRequestExecutorPool.AsyncSocketDestinationRequest, which can be serviced asynchronously. This design change effectively yields two distinct queues for acquiring sockets, one queue for synchronous requests and one queue for asynchronous requests. The impact of these changes is improved average, 95th, and 99th percentile latencies for put operations.

Read-only performance

The read-only storage engine now mlock()'s index files, effectively pinning them in memory. This change improves predictability of read-only server performance, by preventing the index from being paged out in favor of data blocks. This is especially useful when new data from Hadoop is being swapped in from a build-and-push job. At Linkedin, we have seen up to 20% performance improvement with this change for certain workloads.

New features

This release also arrives with some interesting new features.

Build-And-Push

At LinkedIn, we make extensive use of a workflow that moves data from our Hadoop clusters into our Voldemort clusters for online serving. We leverage the fault tolerance and parallelism of Hadoop to build individual Voldemort node/partition-level data files and a build-and-push Azkaban Job to load them into Voldemort read-only stores.

Check out Build and Push Jobs for Voldemort Read Only Stores for more information and a starter package.

Avro Support

We are introducing support for Avro schema evolution in Voldemort: as application logic changes, it's now possible to add new fields to data in existing stores. This is supported via a new serializer type called avro-generic-versioned (see the AvroVersionedGenericSerializer class). You can define the value serializer in the store definition as avro-generic-versioned and increment the version attribute in the <schema-info> tag to denote a new schema version.

This feature supports a rolling upgrade: that is, a client which has knowledge of the latest schema can read a record serialized with an older schema with defaults in place for the new fields. It can also write records confirming to one of the valid schemas. However a client which does not have knowledge of the latest schema will experience exceptions when it tries to read a record serialized with the latest schema.

To mitigate this situation, the clients can be bounced immediately. However, clients will eventually auto-bootstrap the new metadata (explained below). We recommend against evolving key schema, which is dangerous and can break things. Also, the schema update should always be done via the admin tool so that backwards compatibility checks can be performed, to protect against data corruption due to bad schema.

Kerberos support for HdfsFetcher

Voldemort now supports Kerberos authentication so that it can fetch data from Kerberized Hadoop grids. Voldemort server, based on the keytab file path and kerberos principal user configurations, will authenticate against the respective Kerberos server using the UserGroupInformation.loginUserFromKeytab call for HDFS/WebHDFS fetches. In this case, the FileSystem object used for fetching data is obtained in a secure doAs block. Once a valid (secure) FileSystem object is obtained, we can continue to use the same Hadoop Filesystem commands as before without a secure block. In case of the hftp protocol, the fetch is done using the old code path for backwards compatibility with old Hadoop grids.

Operational goodness

Lastly, we have improved the overall operability of Voldemort by implementing a whole new metadata management mechanism and adding more monitoring.

Metadata Management

This release introduces the notion of 'system stores' within Voldemort. System stores are special stores (like slop) with the name prefix voldsys$, that are used to store useful information about Voldemort clients, servers, and stores. The clients also have access to this information via the SystemStore class.

For example, we have added a system store called voldsys$_client_registry, which tracks useful information about every client that connects to the cluster, including last bootstrap time, context name to identify the client, deployment path, client host name, time of last activity, client library version, and all the client configurations. We believe system stores offer a generic mechanism to build more useful metadata management features and lead to much greater visibility into Voldemort usage.

Auto Bootstrapping

There is currently no well defined channel to propagate the metadata changes from the servers down to the client. Although an InvalidMetadataException will be thrown to the client when the cluster topology changes, it does not work well when a node is swapped in place of an existing host or when the store schema changes.

As the first project leveraging the power of system stores, this release includes a mechanism to auto sync the client metadata with the server by implementing a wrapper around DefaultStoreClient, called the ZenStoreClient. Specifically, each metadata version is now persisted in a voldsys$_metadata_version_persistence system store with a monotonically increasing version number. A separate AsyncMetadataVersionManager thread on the client periodically scans the server for later versions of metadata and updates the local state as needed.

Improved Server Monitoring

This release has much improved server performance and scalability monitoring, including:

Statistics about streaming operations (FetchEntries, FetchKeys, UpdateEntries, SlopUpdate, FetchFile) are available for each individual store.
New monitoring points to expose insights into the scalability of the server NIO layer.
New monitoring points exposing BDB exceptions like EnvironmentFailureExceptions, which could be useful for catching bad disks quickly.
New BDB hooks - namely getEntryCount() and getBtreeStats() - to get additional information for deep trouble shooting. Note that these hooks are not cheap and should be used with caution.

Admin API Enhancements

This release adds support for fetching data orphaned on Voldemort servers. For example, this can happen when RepairJob is not run after rebalancing. We've also implemented mirroring support for streaming stores from a Voldemort server in one cluster to a Voldemort server in another cluster, in the absence of online writes.

What's on the roadmap?

We have some very interesting projects lined up this year:

We will be REST-ifying Voldemort, including a REST-ful thin client, a REST-ful coordinator/router and a REST-ful storage server.
We will also be actively working on making Voldemort robust for multi datacenter (> 3) deployments, including tools to perform zero downtime expansions into a new data center and store migrations.
We will continue to improve the out-of-box BDB-JE performance for Voldemort and also be looking into adding BDB-JE 5 support, since the current release removes all the roadblocks around it.
Also in the works is a source agnostic, REST-ful streaming platform for Voldemort, to stream data from warehouse/Oracle/Hadoop into a read-write Voldemort store.

Call for contributions

We are looking for people from the open source community to contribute to phase II of the Metadata Management project, which is currently on hold due to the lack of resources. This involves managing client configurations via system stores and being able to dynamically override the stored client configs.

We also want to make client metadata operations around stores.xml more efficient by operating on individual stores as opposed to the entire file and adding a notification mechanism that updates relevant clients as a specific store's definition changes on the server.

We encourage you to use the Project Voldemort Forum to share ideas and ask questions.