Venice Performance Optimization
April 11, 2018
Over the last two years at LinkedIn, we have been working on a distributed key-value database called “Venice.” Venice is designed to be a significant improvement to Voldemort Read-Only for serving derived data. We’ve built out the rest of the functionality to complete the dream that motivated the construction of Venice: the ability to consume data from both batch and streaming sources, which we’re calling “Venice Hybrid.” You can get more details about how Venice works through these Venice related posts published previously.
Right now, Venice has become the primary solution for a derived data platform inside LinkedIn. Currently, Venice ingests more than 25TB (un-replicated) daily and serves over 100K QPS (per-DC). Those numbers are growing very quickly as we migrate from the existing Voldemort data platform and as new customers onboard.
Regarding read performance, Venice aims to provide single-digit milliseconds of latency for single-key lookup. For ingestion performance, the expectation is to be 50% faster than the existing Voldemort data platform due to the reduced cross-DC data transmission in Venice. These goals have been a big challenge, since Venice has only been in existence for about two years. So far, we are very close to the original expectations, after adopting various performance tuning strategies.
In this post, I am going to share some of the strategies we have adopted to optimize data ingestion rate and read performance. At the end, we will discuss the new challenges we continue to face.
This section will describe the main functionalities of the key components, which will be referenced when discussing various optimization strategies.
Inside Venice, Hadoop is used to host input files and work as the computing grid.
Venice Push Job
Venice Push Job is a map-reduce job, which will read data files off Hadoop and execute a mapreduce job to produce key/value message on the Venice data platform.
Kafka works as the source-of-truth data storage inside Venice, and every data message will flow through Kafka to be persisted in the Venice data platform.
Kafka Mirror Maker (Kafka MM)
For cross-DC replication, Venice employs Kafka Mirror Maker to transmit data from a source cluster to multiple destination clusters.
Venice Storage Node
Venice Storage Node is the persistence layer, which is in charge of reading data off the Kafka broker and persisting to a local database of Berkeley DB Java Edition.
Berkeley DB Java Edition (BDB-JE)
BDB-JE uses a log-structured disk representation for its database objects. The combination of object-level caching and log-structured storage permits fine-grain latching and record-level locking for highly concurrent applications. The log-structured design provides great write performance and fairly good read performance for applications whose data stores in SSD.
BDB-JE database structure
Credit: Berkeley DB Java Edition Architecture, an Oracle white paper
Internally, BDB-JE maintains a B+ tree index per database, and the index will be serialized into the same log files together with the actual data.
IN (internal node)
These nodes, which reference other nodes, are called internal nodes (IN) and are implemented as arrays of key/leaf pairs.
BIN (bottom internal node)
BIN is a subclass of IN providing additional support for cursors, and BINs store references to leaf nodes.
BINDelta contains the information needed to create a partial (delta) BIN log entry. The existence of this structure aims to save the disk usage when BDB-JE needs to swap in/out BIN entries because of memory restriction.
LN (leaf node)
The actual data node, which contains an individual key/data pair.
Venice Router is the gateway layer in front of the Venice Storage Node. Its main functionality is to find available storage nodes for the incoming requests, scatter out requests, gather responses, and return the consolidated responses back to the Venice Client.
Dynamic Discovery (D2)
Dynamic Discovery (D2) is a layer of indirection similar to DNS. Functionally speaking, D2 translates a URI like d2://<my d2 service> to another address like http://myD2service.something.com:9520/someContextPath.
Venice Client is a thin library, which is used in the application to communicate with Venice. Internally, it is using D2 as the load balancer.
Venice Ingestion Pipeline
This section will discuss the data flow for the Venice Ingestion Pipeline, and the tuning strategies will be broken down by components.
The above diagram demonstrates the data flow for a typical Venice data push:
Venice Push Job will run a map-reduce job to read data off a Hadoop cluster and produce a Venice data message to the source Kafka cluster;
Kafka Mirror Maker, deployed in the destination cluster, will replicate all the messages from the Kafka cluster in the source data center to the Kafka cluster in the destination data center;
Venice Storage Node will consume the Venice data message from the local Kafka cluster and persist data in the BDB-JE database.
Optimization strategies for Venice Ingestion Pipeline
Venice Push Job: Introduce a reducer phase to produce sorted messages
When persisting key-value pairs locally in Venice Storage Node, BDB-JE will first try to find the proper BIN to put the key. If that BIN is not in the memory, BDB-JE needs to swap it into memory, update it, and flush it to disk later. To avoid dumping too much duplicate data in BIN entries to disk (caused by multiple rounds of swapping in/out), BDB-JE introduces BINDelta, which only contains the delta change compared to the last write. The idea is good to reduce disk usage; however, it doesn't behave well when ingesting a very large amount of unordered key/value pairs with limited RAM. Based on our testing, sometimes, BDB-JE can produce database files that are 10x the size of the original data input files.
To solve this excessive disk usage problem, we introduced a reducer phase to the mapreduce job in Venice Push Job, which originally was running as a map-only job. With this change, all the key/value pairs received by BDB-JE database are ordered, and as a result, swapping of a lot of BIN entries won’t happen since all the continuous inserts will only operate against a very few number of BINs. This is due to the sorted nature of B+ tree.
With this strategy, the disk usage was reduced by more than six times, and the overall push time was reduced by more than five times for some use cases.
Kafka and Kafka Mirror Maker (MM): Bump up Kafka topic partition number and Kafka MM batch size
Essentially, Kafka MM contains a Kafka producer and a Kafka consumer. Internally, it maintains one batch per partition to guarantee correct ordering. The replication performance is bounded by the round-trip latency between source cluster and destination cluster. When the round-trip latency is high, it is expensive to acknowledge each batch, so it is better to send as much data as possible for each batch to amortize the overhead.
Based on the above thought, we increased the partition number per topic to improve the parallelism and increased the batch size in Kafka MM from 100KB to 1MB, and the replication rate was improved greatly. According to the test setup: we pushed 100GB to several production clusters, one of which had a long round trip time (RTT); the original push time was about 13 hours, but after this change, the push time was reduced to 40 mins, a roughly 20x improvement!
Venice Storage Node: Enable BDB-JE deferred write and optimize the thread-model for data ingestion
Besides transactional mode, BDB-JE provides another kind of database: a deferred write database. The key idea of a deferred write database is to delay the write operations as long as possible, and those changes are only guaranteed to be durable after each synchronization to disk.
Deferring writes in this manner has two performance advantages when performing database modifications:
When multiple threads are performing writes, concurrency is increased because the bottleneck of writing to the log is avoided.
Less total writing takes place. If a single record is modified more than once, or modified and deleted, then only the final result must be written. If a record is inserted and deleted before a database sync or close occurs, nothing at all is written to disk. The same advantage holds for writing internal index information.
Deferred write databases are useful for applications that perform a great deal of database modifications, record additions, deletions, and so forth. By delaying the data write, you delay the disk I/O.
Deferred write databases are a perfect fit for Venice data pushes, since Venice performs a large amount of record additions in a short period. Based on our testing, the ingestion rate doubles compared to a transactional database.
Originally, Venice Storage Node was using a typical thread model to consume Kafka messages: spinning up one thread for both polling from Kafka and persisting to a local BDB-JE database purpose. There are two shortcomings of this thread model:
When there are only a few store pushes ongoing, this thread model is not efficient, since a lot of time is spent on waiting for messages from Kafka and CPU resources are underutilized;
When there are a lot of store pushes ongoing, this thread model introduces a lot of context switches. The BDB-JE persistence logic is CPU-intensive, so the overall performance will degrade.
To improve the ingestion efficiency, we modified the threading strategy.
As you can see from the diagram, the new threading model uses a fixed size of thread pool to persist data messages to database, so the thread number will be reduced a lot when there are a lot of pushes executing at the same time.
With the new thread model, the memory/CPU usage is more predictable during ingestion and the full storage node capability is utilized even when there are only a few ingestion jobs running.
Venice read path
Venice uses a three-layer architecture to serve read requests, as illustrated by the following diagram.
Optimization strategies for Venice read path
Venice Client: D2 sticky routing with router cache
According to the operating experience of Venice, we noticed that several customers encountered hotkey problems, which are caused by a skewed access pattern, and this hotkey issue has made the quota allocation much harder, since the quota is per storage node. To alleviate this problem, we introduced the router caching layer.
Internally, Venice Client internally uses D2 as the load balancer and the built-in sticky routing feature inside D2 is useful for caching. D2 will try its best to send the same request to the same instance. With this routing strategy, each router will only serve a unique set of keys based on key hashing, so the cache inside each router should have minimal duplication, and the whole router cluster will work as a distributed cache layer.
Based on our experiment, the hotkeys hit the router cache all the time, and the latency of cache-hit requests is reduced by about 40% compared to cache-miss requests.
Venice Router: Pool of connection pools, router-to-storage sticky routing, and long-tail retry
To relay client requests to a storage node, Venice Router used to maintain a global connection pool to alleviate connection setup overhead. This connection pool was shared by all the requests coming to the same router, and connection checking in/out was synchronized by the same lock, which caused a lot of contention during high QPS. We didn’t figure out a good way to avoid this lock contention by using one single connection pool, so we tried another strategy: using a pool of connection pools. With this strategy, the lock contention behavior was mitigated, and both the throughput and latency were improved.
Venice is a distributed storage system, and it maintains multiple replicas for each partition. Without enforcing any scattering pattern, each replica will serve a full set of key space, which causes the memory inefficiency of the Venice Storage Node. By enabling the sticky routing feature in Router (similar idea as D2 sticky routing), each replica only needs to serve a small portion of key space, which has demonstrably improved the memory efficiency of the Venice Storage Node.
Whenever the application is implemented in Java, GC will be a problem to tune all the time. The Venice Storage Node also has GC issues during high-volume data ingestion, and GC tuning will be a long-term effort. GC hurts the end-to-end latency during GC pause. To mitigate this GC impact, we have adopted a long-tail retry strategy when GC in Storage Node happens. The basic idea is to retry the request to another Storage Node hosting the same partition when the waiting time of the original request exceeds a predefined threshold. With this feature, the P99 latency has been reduced greatly and becomes more predictable than before. Long GC pauses are not happening all the time, but just during large volumes of data ingestion, and the long-tail retry feature gets triggered during long GC pauses, which is why P99 latency looks much better now.
Venice Storage Node: BDB-JE shared cache
BDB-JE allows each database to setup the size of the cache, which could be used to store the index, and maybe data entries as well for fast lookup. Initially, Venice was assigning fixed-size caches to every store, which was tedious and vulnerable because the required cache size varied from time to time, and it is also very hard to assign the optimal cache, size since it depends on both the actual database size and access pattern. To avoid this complexity, Venice modified the behavior to only allocate a big chunk of cache, which is being shared across all the stores hosted in the same node. This way, the cache allocation is much simpler and can be utilized efficiently even when only a small number of stores receive high volumes of read requests.
GC pauses in Venice Storage Node
We haven’t gotten to the bottom of the GC issue in the Venice Storage Node, and it is worth exploring in the future, since it still causes some slightly high latency for P99.
Latency improvement for batch-get request
So far, most of the above optimizations only apply to single-get request, and Venice cannot make any latency guarantees about batch-get requests. We would like to optimize for batch-get latency in the near future.
Performance tuning is hard, and is a continuous process. As data volume and traffic increases, we may need to adjust those optimization strategies accordingly in the future. For now, Venice is very close to the original goals, and we will continue this effort to make it more stable and efficient.
Brainstorming and verifying these performance tuning strategies has been a team effort. Here I would like to thank my colleagues Charles Gao, Felix GV, Mathew Wise, Min Huang, Sidian Wu, and Yan Yan for their invaluable contributions and feedback. Special thanks to Jiangjie Qin, Dong Lin, and Sumant Tambe from the Kafka team for their input and suggestions. Thanks to the leadership team for supporting our efforts: Siddharth Singh, Ashish Singhai, Ivo Dimitrov, and Swee Lim.
Finally, thanks to our incredible SREs who keep the system running: Abhishek Bhargava, Ali Poursamadi, Amit Balode, Arjun Shenoy, Karthik Appigatla, Kian Chung, Nirupam Mohari, Vinoth Govindaraj, Zonia Harris.