Evolution of Couchbase at LinkedIn

James W.

Engineering Manager in Observability

May 1, 2018

Author's note: My colleague, Michael Kehoe, wrote a blog post on the Couchbase Ecosystem at LinkedIn. I encourage you to read it if you haven’t already! The following aims to provide an evolved perspective of Couchbase as it evolves to be a standard caching platform at LinkedIn, provided by a Couchbase SRE who has been working on Couchbase at LinkedIn since 2013.

Introduction

Couchbase is a highly scalable, distributed data store that plays a critical role in LinkedIn’s caching systems. Couchbase was first adopted at LinkedIn in 2012, and it now handles over 10 million queries per second with over 300 clusters in our production, staging, and corporate environments. Couchbase’s replication mechanisms and high performance have enabled us to use Couchbase for a number of mission-critical use cases at LinkedIn.

Over the years, as our Couchbase usage continued to grow, we had to ensure that it scaled operationally, so we developed an entire ecosystem around Couchbase. We started our journey using Couchbase Server 1.8.1 Community Edition, and today we are using Couchbase Server 5.1.0 Enterprise Edition. This blog post aims to:

Tell the story of how our use of Couchbase and our support model evolved over the years.
Describe the challenges with scaling and operating Couchbase at our scale and what we’re doing do solve it.
Get a glimpse of where we want to take Couchbase into the future.

The Memcached days

Memcached was first introduced to LinkedIn in the early 2010s as a fast, in-memory, distributed caching solution when we needed to scale our source-of-truth data stores to handle increased traffic. It worked well for what it provided:

Single-digit millisecond GETs and SETs for applications that needed caching in front of their source-of-truth data stores.
Provisioning a cluster was simple, and the process of getting started was very fast.

However, as the number of Memcached clusters grew with the number of applications that started using it, we quickly ran into operability issues, some of the main ones being:

Lack of persistence. Memcached was an in-memory store. Although fast response times was Memcached’s greatest strength, it was also brutal when we needed to restart the Memcached process for maintenance reasons, as we would also lose the entire contents of the cache.
Breaking the hash ring. Resizing clusters (i.e. expanding the cluster with more nodes) was impossible without breaking the hash ring and invalidating portions of the cache. Consistent hashing algorithms helped, but didn’t solve the issue completely.
Replacing hosts. Our hashing algorithm was based on the hostname of the nodes in the cluster, so any time we would replace a host with a different hostname, it would not only invalidate the cache on the replaced node, but also would disrupt the hash ring.
Lack of cache-copying functionality. Say we wanted to build out a new datacenter and we wanted to copy the contents of the cache. This was not simple. We ended up building some tooling around brute-forcing keys to populate the new cache, but this wasn’t ideal nor elegant.

You can see why Memcached was so easily adopted at LinkedIn (provided quick and easy wins), but at the same time why it was so annoying to work with at scale. We started to look for alternative solutions in 2012.

Enter Couchbase

In 2012, Couchbase grabbed our attention because it advertised itself as a drop-in Memcached replacement that also provided nice features like persistence, replicas, and cluster resizability. I was part of one of the first teams that started to use Couchbase more widely at LinkedIn, and that’s how I began my adventure with Couchbase.

Initially we connected to the Couchbase Server with our existing Memcached client through Moxi (which is a Memcached-protocol-to-Couchbase-protocol translator). However, we quickly realized that there were numerous stability issues with Moxi, and even Couchbase recommended that we connect to the Couchbase Server using the native SDK.

This is where our LinkedIn wrapper around the Couchbase Java SDK was born. Its initial goal was to implement the same interface as our existing Memcached client, so that all clients would need to do is swap out the underlying implementation and they would be good to go.

Considering how easy it was to switch from Memcached to Couchbase, adoption exploded at LinkedIn. At one point, we had over 2,000 hosts running Couchbase in production with over 300 unique clusters. Every team would swap out the underlying cache implementation to Couchbase if they were using the existing Memcached client and work with their SREs to get a cluster provisioned and cut over to use Couchbase.

The Couchbase Virtual Team

Because so many teams were independently using Couchbase, a council was spun up within LinkedIn to meet regularly to set best practices, standardize processes, and share learnings. The Couchbase Virtual Team (or CBVT for short) was born. Many great things came out of this team, such as automation for cluster build-outs, designating a standard to store cluster metadata internally, and tools to further integrate with our LinkedIn ecosystem.

As Michael Kehoe mentioned, we designed our cluster management and orchestration around Salt for remote execution, state management, and range for storing cluster metadata. We had several other tools, like amf-cbstats, which was an agent that would sit on Couchbase cluster nodes and query metrics from Couchbase Server and would then emit it to our in-house monitoring solution, inGraphs. We provided tools so that clients could easily load items from offline workflows like Hadoop to Couchbase, and we even had a cluster overview webapp called Macys that would allow us to get a bird’s eye view of all clusters at LinkedIn. We also had several enhancements to our LinkedIn Java SDK wrapper to include features like client-side encryption and compression using flags.

Limitations of the virtual team model

The Couchbase Virtual Team was actually the very first “virtual team” at LinkedIn, and our success led to many other virtual teams or working groups being internally spun up to tackle similar challenges. To name a few, we had a Salt virtual team and an ELK virtual team, each of which enjoyed varying degrees of success. The Couchbase Virtual Team enjoyed its model of operation for several years until the cracks finally started to show, namely:

The number of clusters exploded. Since each individual team was in charge of running and operating their own clusters using the tooling provided by the Couchbase Virtual Team, we had a lot of under-utilized hardware. At one point, we had over 300 clusters in production spanning over 2,000 hosts.
Not all teams were interested in operating Couchbase. Some teams would have individuals that would be very involved in operating Couchbase––these people usually ended up becoming core members of the Couchbase Virtual Team––but most teams just wanted a caching solution “out-of-the-box” that “just worked.” Because of this lack of interest, the number of site outages that involved Couchbase started to increase, not because Couchbase itself was unstable, but because a lot of these clusters were not maintained properly using best practices. For example, we saw numerous instances where a node was auto-failovered because it had a failure, which led to a smooth transition of automatically promoting the replicas, but then no one ever went back to look at the cluster again until we had a second bad node and there was user impact.
No standardized/official caching solution. It was up to each team to opt into using Couchbase. It wasn’t the “official” caching solution at LinkedIn and there were a few competing services for caching internally. It wasn’t always clear when people should use what technology, and it led to people using the wrong solution for their application.
Difficult to make changes to all Couchbase clusters. Since each team managed their own clusters, it was very difficult to get teams to upgrade to a newer version of Couchbase Server or to adopt a new way of doing things.
Difficult to get larger projects completed. Because the virtual team model was volunteer-based, where individuals would chip in when they could find time, it was difficult to complete larger projects that required dedicated time from contributors.

Transition to a dedicated team model

In 2017, a dedicated team was finally funded, and I quickly moved on the chance to work on Couchbase full-time. We were officially called the Caching as a Service team, or the CaaS team, but most people just referred to us as the Couchbase SRE team.

We had three main charters:

Centralize management. We would own and operate centralized Couchbase clusters and offer Couchbase as a service for any team to use. We also purchased Enterprise Edition, so that we could run a properly supported version of Couchbase in production and also get access to the Couchbase support team. We would migrate all existing team-owned Community Edition clusters onto our platform.
Enhance security. We would integrate with LinkedIn’s existing security libraries and use certificate-based authentication for access to Couchbase buckets. We actually worked with Couchbase for this feature, and certificate-based (X.509) authentication was added into Couchbase Server 5.0.
Improve Cost to serve. We would decrease our hardware footprint by better packing buckets into multitenant clusters and use SSDs (instead of additional nodes) to further reduce cost.

The team hit the ground running in April 2017, and since then, Ben Weir, Usha Kuppuswamy, Todd Hendricks, Subhas Sinha, and myself, led by Hardik Kheskani, have advanced Couchbase at LinkedIn significantly. Notably:

We’ve migrated more than 50 different use cases from legacy Community Edition Couchbase Clusters to our CaaS platform. We’ve migrated more than 10 different use cases from Memcached (yes, we still had a few stragglers) to our platform. And during all of this, we also on-boarded more than 15 different new use cases who were not using Couchbase before but wanted to use Couchbase.
We’ve integrated Couchbase more deeply with our deployment system, properly utilizing our in-house topology, deployment system, and configuration management. In the past, Couchbase was handled as a special snowflake using tooling like Salt and Range.
We’ve completely automated upgrades. For the longest time, the majority of our Couchbase clusters were stuck on Couchbase Server 2.2.0 Community because of the lack of necessary automation to upgrade the cluster safely. We’ve invested in our tooling so that we can notate in our configs the version of Couchbase we want and our tooling will gracefully failout each node out of the cluster, upgrade Couchbase server, and rebalance it back into the cluster, completely without human intervention.
We have widespread use of certificate-based authentication.This not only required coordination from the folks over at Couchbase to properly support certificate-based authentication on both the server and the respective SDKs, but it also required work from our side to integrate with our in-house certificate management system.
We also launched a LinkedIn wrapper around the Python SDK. As more and more teams are building Python apps that use Couchbase, we thought it was time for a wrapper library to exist in Python as well. It provides niceties like client-side metrics, integration with our configuration management system, client-side encryption, and compatibility with the flags used by our Java wrapper.

Challenges

One of our biggest challenges with spinning up the dedicated team was that we suddenly had a flood of requests of people wanting to get onto our platform ASAP for a large variety of use cases. This was challenging for a couple of reasons:

We were already heads down trying to migrate existing Community Edition buckets to our platform.
People basically read “Couchbase is being offered as a service! I want to use N1QL or XDCR or Views!” The problem with this was that Couchbase was brought into LinkedIn as a caching, key-value use case, as there were already other technologies internally for other types of use cases. We’ve been running Couchbase as a cache for years, and we knew caching was where it excelled. These other features of Couchbase are great and we want to support them one day, but for now, we need to focus on what we have been good at while we finished becoming the centralized owners of all Couchbase clusters at LinkedIn.

We know that a large part of this is a communication issue, and we’ve been working hard to bring a consistent and firm message across the company as to exactly what services our team provides.

Looking Forward

Tracking the evolution of Couchbase at LinkedIn makes a lot of sense when you look through where it has been through the years. Each step of the process was definitely necessary, and LinkedIn as a company learned a lot about using Couchbase at scale and in high-performance scenarios. We still have many things that we want to do with Couchbase here, so there’s no shortage of work. Some aspects we want to focus on are:

Improving our multi-tenant solution and reaching better resource utilization.
Working towards making our platform completely self-service. We already have a self-service tool internally called Nuage for provisioning data stores, and we are planning to integrate Couchbase so that clients can focus on getting their Couchbase buckets provisioned and we can focus on maintaining and automating infrastructure.
Ultimately, making the cache invisible to our clients by partnering with source-of-truth platforms to build tighter integrations (i.e. out-of-the-box invisible caching).

At the end of the day, Couchbase is a caching solution that ended up working for us and we’re excited to work with them on a mutually beneficial product roadmap.

Topics: Developer Experience/Productivity Data Infrastructure