Rest.li 2.x and a Protocol Upgrade Story

Karan Parikh

March 26, 2015

A few months ago we announced that LinkedIn had hit its Rest.li moment. Today we are excited to share another major milestone - the release of the next major version of Rest.li! Pegasus 2.2.5 has been released to Maven Central and is available for use. The source is available on GitHub. Give it a try and let us know let us know what you think.

This post outlines the major changes in this new version of Rest.li, and how we upgraded around 100 services to the new version of the Rest.li protocol.

Why 2.x?

The Rest.li team had numerous changes we wanted to make to our framework. However, a majority of these changes couldn't be made without breaking backwards compatibility. These changes include:

Changing our URL format to make the URLs more compact and easy to read and parse.
Changing our developer API to make it more consistent.
Removing code that has been marked as deprecated to simplify our codebase.

Shiny new URLs

The most significant change in Rest.li 2.x is the introduction of a new URL format. A complete description of the new format can be found on the Rest.li protocol documentation page. Let us look at a small example to see how the URLs have changed. Suppose we have a resource Foo with the base URL as /foo and the following key -

https://gist.github.com/karanparikh/2c12c637d71e3f5265f5.js

Here is how the URL for an HTTP GET request for the above key would look like in each protocol -

https://gist.github.com/karanparikh/134d90ce6d4045eeb3c9.js

Our main motivation to include this change was to make the URLs more compact and easy to read and understand. We also wanted to remove the inconsistencies between how compound keys and complex keys are represented on the URL. Lastly, we wanted to ensure that empty lists, maps, and strings can be correctly represented on the URL.

The Migration

Once the team had decided that we were going to release a new major version and what the new URL and wire format would be we decided to tackle the problem of rolling out such an upgrade across all Rest.li services in LinkedIn's data centers. This is a non-trivial problem because the Rest.li 2.0 protocol is backwards incompatible with the 1.0 protocol. This problem is further compounded because of:

The large number of services in our microservice based architecture
Complex call graphs involving dozens of services

In particular, the problem that we were trying to solve was that of a client running the new version of the Rest.li protocol communicating with a server that only understands the old version of the protocol. If the client sent out a request in the new format it would fail as the server would not be able to understand the new protocol (a server running the new protocol can understand both versions of the protocol).

To solve this problem we needed to do two things:

Figure out what version of the Rest.li protocol is supported on the server prior to sending the request.
Generate the URL for the request only after the protocol version is known.

Hi $service. Meet $client

In order to figure out what version of the Rest.li protocol is running on the server prior to sending the request out over the network we built a protocol handshake mechanism. The highest protocol version that each service is capable of understanding is stored in D2. The Rest.li client fetches this data and decides on the request protocol version by using a combination of the remote protocol version and a set of rules.

In order to generate URLs only after the protocol version is known we had to significantly change the internals of our Rest.li code. Request objects now simply hold the data for the URL but not the URL itself. Once the protocol version is known in the Rest.li client we then generate a URL using this stored data.

The rollout

With the above handshake mechanism and code changes in place we decided to do a two stage rollout - a closed beta that programmatically forces the next version of the protocol to be used followed by a controlled rollout (that uses the handshake mechanism) to approximately 100 services.

Part 1: The closed beta

For the closed beta we programmatically forced Rest.li 2.0 protocol communication between clients and servers. This bypasses the handshake mechanism on the client side and always uses the next protocol version (which in this case was 2.0) while making the requests. This can be very dangerous in a production environment as there is no guarantee that the server is running the appropriate version of Rest.li! However, we made sure that all servers involved had been upgraded to a version of Rest.li that understood the 2.0 protocol thus making this a safe operation.

Our goal for the closed beta was to run 2.0 protocol traffic for services that are high traffic and exercise almost all the features of Rest.li. Even though we had plenty of unit and integration tests in our codebase for the new protocol we wanted to ensure that everything worked perfectly by testing it with real site traffic.

For the beta we chose to send Rest.li 2.0 protocol traffic between the following services -

All communication between the homepage feed frontend and backend services.
All communication between the profile page and the graph services.

We used our internal A/B testing framework to control the protocol rollout to steadily increasing percentages of traffic. After running it at 100% site traffic and seeing no issues for several weeks we decided to move on to part 2 of the rollout.

Part 2: The controlled rollout

Part 2 of the rollout leveraged the handshake mechanism in order to turn on Rest.li 2.0 traffic between services. Each week a small batch of services were upgraded. To upgrade the communication protocol for all clients of a service all we needed to do was to modify its configuration in D2 by including the highest version of Rest.li protocol supported by this service. Once the configuration change was made it was automatically pushed out to all clients of the service by Apache ZooKeeper. Clients for the service running a new build of Rest.li would use the handshake mechanism and start sending Rest.li 2.0 traffic. Older clients not supporting Rest.li 2.0 would simply send 1.0 traffic to the service. We upgraded around 100 services over the duration of Q4 2014 using this mechanism.

Conclusion and future plans

We currently have around 105 services serving Rest.li 2.0 traffic in our data centers. This migration was done without active involvement of the service owners and in a short amount of time. Several pieces of LinkedIn's infrastructure like D2, InGraphs, service call events, our A/B testing service, etc. made this a smooth transition.

We plan on releasing a build of Rest.li that makes 2.0 the default communication protocol chosen by clients (1.0 is still the default) once all LinkedIn services are running a build of Rest.li that fully supports the 2.0 protocol. This should happen within the next few months.

Acknowledgments

I would like to thank Goksel Genc, David Hoa, and Steven Ihde for reviewing this post.