Code

Building Venice: A Production Software Case Study

We build a lot of our own infrastructure systems here at LinkedIn. Many people have heard of Kafka, our distributed message buffer. We also run various databases, blob stores, and stream and image processing systems, all of which we develop, deploy, and maintain in-house. One of the systems we've been working on recently is Venice, a distributed derived data-serving platform. In this post, I want to share some of the considerations that mark the difference between a proof-of-concept or prototype tool and something that is production-ready for use serving our over 450 million members. I'm going to explore three main areas: high availability, operability, and security.

High availability

Sharding and replication
This next section talks about sharding and replication, which many distributed systems engineers are already familiar with. Feel free to skip it if these are not new concepts for you. For further reading, I highly recommend Amazon's Dynamo paper.

Imagine you're running a small web app. You have an application server running your app, and you're running a MySQL database on the same server. Your whole application is contained on a single machine. This works great! It's simple, easy to reason about, and the mean time between failures for a typical PC is around 3.4 years. If you operate like this, on average you might have a hardware-related site outage every three years or so. But then imagine you get some more users, maybe 450 million of them. All of your user information just doesn't fit on that single machine's MySQL instance anymore. If you want to keep using MySQL, you can do what a lot of companies over the last 15 years have done—you can shard your servers.

For illustrative purposes (don't actually do it this way), you could say all of your users have a first name, and shard the data based on the first letter of the users' first names. With 26 independent servers each running MySQL, you could put all the user information for users with first names that start with the letter "A" on the first server, the letter "B" on the second server, etc. The application knows which MySQL server to use for which users, and you can scale up the amount of data you have very effectively. But now we have to re-evaluate that MTBF (mean time between failures). If each machine has an MTBF of 3.4 years, then 26 machines together will have an MTBF of only 1.5 months. That is a much higher failure rate. With 450 million members, you can't really afford to have a data loss event every seven weeks. In practice, you would probably be sharding across several hundred machines; the MTBF for 200 such machines is less than a week. In building such a distributed system, we must treat hardware failure as a normal operating condition. Machine failure is an expected event, and our designs must take that into consideration. The standard way to deal with this need is to build in redundancy through replication.

Let's return to the illustration of sharding by the first letter of our members’ first names. This strategy breaks up our database into 26 partitions. Instead of just putting partition "A" on machine 1, and partition "B" on machine 2, we can put partition "A" on machines 1 and 2. Then we can put partition "B" on machines 2 and 3. Likewise, partition "Z" can go on machines 26 and 1. In this manner, if any machine fails, we still have another copy of that partition.

If you are building a distributed database, sharding and replication are the first things you implement. For the rest of the article, we take it as a given that we're working with a sharded and replicated system, then explore what else we need to make that system operable in a production setting.

Cluster management
Sharding and replication is great! Now we have a distributed system that is able to scale out to as much data as we need to put into it, and we won't lose any data when a node fails. But what do we do when a node fails? Now we have one fewer replica for every partition that was on the failed node. Our system is under-replicated, and that means we're at higher risk of data loss. This is why we need some sort of cluster management. While there are a lot of approaches, the important thing is that you plan for this and have a process to bring the replication factor back up to where it belongs.

One solution would be to have a fairly fixed listing of which partitions live on each node. We described one such mapping earlier, where we had 26 partitions and 26 nodes. Partition A went on nodes 1 and 2, partition B went on nodes 2 and 3, and so forth. Now if node 2 fails, your plan could be simply to replace it. You page your operations on call: he fires up a spare node, identifies it as node 2, and lets the system restore the missing replicas.

With machines failing as often as every week, it would be nice if we didn't need to wake anybody up in the middle of the night to manually replace failed nodes. Automated cluster management takes the fixed list of partitions and nodes and turns it into a dynamic list that can be modified automatically by the cluster. Now when a node fails, we can let the system automatically redistribute the replicas so they can be restored on other nodes that are still in the system. This is a rebalance operation.

As you can imagine, cluster management is a common need in various distributed systems. LinkedIn wrote and open-sourced Apache Helix as one solution for automated cluster management, and we use it in several of our production systems.

Zero-downtime upgrades
If you are building a web app, it probably links up to a database on the backend and you run the app on multiple machines behind a load-balancer. You don't care which machine serves a request because all the data, the state, lives in the database. This web app is an example of a stateless system. When you want to upgrade a stateless system that serves live traffic, you can start up some new servers running the new version of your app, then switch your load-balancer to point to the new servers, then shutdown the old servers. If you want to be fancy, you might even do a controlled ramp with an A/B testing system from the old code to the new code to make sure there aren't any problems.

But what if you want to upgrade the database? You could do something similar where you bring up a second database, replicate your data over, change your app to start writing to both databases, then change your app to start reading from the new database and then decommission the old one. But in practice that's a very expensive proposition in terms of both dollars and time, especially when you're talking about a large distributed database in a multi-tenant environment where many apps are using the same database.

What we do with Venice is support incremental upgrades of the cluster. Both of these approaches use a cluster of storage nodes. Given that our database is both sharded and replicated, we can bring down one storage node, upgrade it to the new version of the code, and bring it back online without affecting the usability of the system. Doing this one node at a time with each storage node in the cluster lets us do a zero-downtime upgrade.

But it isn't that simple. In order to support this style of upgrade, different versions of the storage nodes must be able to talk to each other. They must communicate over a protocol that is forwards and backwards compatible. That means specifying a versioned protocol so that when the protocol needs to change, the nodes can be version-aware and talk across different versions. We also need to make sure that when a storage node comes back online, we have a way to populate it with any writes that it missed.

We can take down a live node for maintenance because we have an extra replica of each partition on that storage node. If we only have those two replicas then we risk data loss should we have a failure of the node hosting the second replica. This is why so many distributed systems default to a replication factor of three; we can take one node down for maintenance and have a concurrent failure and still having an extra replica to serve from. We refer to this as having an upgrade domain and a failure domain.

Operability

Metrics
Let's go back to our previous example of a web application running with a database backend. How do you know if your database is operating as expected? You can log into your web application and see if behaves correctly—maybe see if your profile loads? This might be good enough for your application, but now imagine you're responsible for the database being used by hundreds of applications with hundreds of millions of users. You're going to need some metrics.

Metrics are one of those features that you must have in order to operate a production system. You must be able to look at a chart and know if the system is behaving as expected, or if reads still look good but writes have suddenly stopped. Maybe everything looks pretty good, but one machine in your cluster has stopped reporting. The teams that rely on our databases also rely on us to help them understand when something isn't working well.

Once you've decided to start instrumenting your system, hopefully you have a good set of tools. At LinkedIn, we use the Autometrics system that we have built. This handles the collection and processing of our instrumented metric events. But we still need to decide what to measure so that we can monitor the system and track down issues without being buried in metrics that we never look at. Metrics are a balance, and usually one that we don't get right at first. Over time, however, while operating a system, we are able to identify the holes in our metrics collection get to a really good place.

Quotas
Venice, like Voldemort before it, is a multi-tenant system. We have many different teams at LinkedIn that store data in these distributed databases. The responsibility is on us as database developers to make sure that no one team using the system is able to adversely affect the usability for other teams. We've talked about metrics, which we can use to monitor how different teams are using the system, but quotas let us make sure we have the capacity to support everyone and ensure that a team has the resources we promise.

Let's say from testing we know that our cluster can support one million queries per second (QPS) and we have three stores that each need to handle 200,000 QPS to the applications they serve. By allocating this capacity to each store, we know that we can support more stores—up to another 400,000 QPS. But what happens if one of the applications has a bug, where it makes 10 requests each time that it should have made just one request? Now our database is trying to serve 1.4 million QPS. It will fail under that load, and the bug-free applications trying to read separate stores will be very negatively impacted.

The quotas let us ensure we have the right capacity, but we can also use them to enforce limits on each store. By refusing to serve more than 200,000 QPS from the affected store, we will drop some queries but also protect the other teams relying on our database for their applications.

Administrative tooling
The last part of operability is the tool or tools for changing the system's state. An administrator or SRE needs to be able to query the state of the system (what stores are in the cluster?), and modify that state (create or delete a store). For Venice, that state is complex and includes schemas and quotas for all of our stores. It's all too easy for the administrative tooling to be thrown together as an afterthought, but it is a fully-featured tool in its own right that needs to be used by people, and thus needs a well-designed user interface, a good deployment strategy, and comprehensive documentation. All of the discussion below about security applies to the administrative tooling as well.

Security

Security is a very complicated topic, so much like the other sections in this article, I will fail to give it the depth it deserves. Any conversation around security needs to start with an understanding of what you want to guard against. Most of the security work we focus on at LinkedIn is around protecting the privacy of our members' data.

Encryption on the wire
Venice and Voldemort operate in multiple data centers, and data needs to flow between them. At LinkedIn (similar to other large internet companies), we have a requirement that cross-data-center communication should be encrypted when sensitive information is in transit.  We leverage TLS for encryption-in-transit.

Authentication and authorization
Encryption by itself isn't really enough for secure communication. To guarantee secure communication between two computers, you need to be sure that the computer on the other end of the connection is really who it claims to be. In the case of an application reading from a Venice server, the application needs to be sure it is talking to a real Venice server. This is called server authentication, and TLS gives us that protection. TLS is complicated and I won't get into all the details here. In short, the server presents a cryptographically signed certificate to prove that it is a real Venice server and this certificate is used to establish a secure connection. TLS also allows the client to present a certificate so the server can authenticate the client and verify that the client is authorized to access the requested resource.

In closing

Building databases is fun, particularly distributed databases, with all of the concurrency challenges and the power that comes with scalability. Beyond just building something that works, there are a lot of other features that go into building a system that is ready to handle real-world traffic. These aren't always the most glamorous parts of the system, but they are just as critical to success as a solid understanding of distributed consensus algorithms like Paxos or Raft.

As of February, we have started rolling Venice out into production! There's still a lot of work ahead of us, but none of what we've built so far would have been possible without all the excellent work from Felix GV, Gaojie Liu,  Arun ThirupathiSidian Wu, and Yan Yan on the development team; all the support from Akhil Ahuja, Greg Banks, Steven Dolan, Tofig SuleymanovBrendan Harris, and Warren Turkal on our site-reliability team; and the leadership from Siddharth Singh and Ivo Dimitrov.