Leveraging SaltStack to Scale Couchbase

Issa Fattah

Former Systems Development Engineering, Site Reliability and DevOps @ Amazon

April 25, 2016

Editor's note: This blog has been updated.

Our caching layer at LinkedIn is responsible for serving massive amounts of data at low latencies, making it an important part of ensuring a quality experience for our members. One of the key aspects of our caching layer is Couchbase, which accomplishes this feat in a robust and resilient way. Many engineers at LinkedIn leverage Couchbase to meet the needs of a demanding technology stack. For instance, here is an example of a Couchbase cluster that serves one of the highest-performing critical backend services at LinkedIn.

Figure 1: Aggregate QPS of Couchbase Cluster

Figure 2: Average Response Time of Client GET Calls

This cluster sustains about a combined 8,000 QPS workload and provides responses on an average of about 1 millisecond. Clients of this Couchbase cluster receive data with impressive speed, an advantage that pays dividends for downstream services that also depend on this data being accessed quickly. Many of our services require 95th percentile latencies on the order of milliseconds.

With more than 300 clusters running Couchbase, and with some clusters taking over 1 million QPS, the need for automation and tooling to help integrate Couchbase into our ecosystem became overwhelming. We needed the ability to build and scale both new and existing clusters quickly and reliably.

We chose to accomplish the buildout and scaling of Couchbase by leveraging SaltStack, an open-source automation and orchestration platform. At LinkedIn, we run SaltStack on all of our production hosts, which means tens of thousands of nodes. Written in Python, SaltStack is an open-source platform that provides the following features:

Configuration Management
State Definition and Enforcement
Orchestration and Remote-command Execution
Efficiency and Scalability, via a high-speed communication bus

To better explain how we use SaltStack at LinkedIn, let’s start by defining some Salt-specific terminology:

A Salt Master is a central host that manages hosts that run agents, called Salt Minions.
The method of configuration management provided by Salt consists of Pillars and Grains.
A Salt Runner is an application of convenience executed by the Salt-run command on the Salt Master.
An Execution Module is similar to a Salt Runner, except it is executed on the Minion host.
Range is a distributed metadata store that contains information about clusters of hosts.

Our automation for Couchbase is essentially a combination of a runner and an execution module that contains the necessary logic to build, deploy, automate, and scale our Couchbase infrastructure. Our runner allows the user to perform the following operations:

Setup Cluster: Build a Couchbase cluster (including buckets) with provided metadata
Expand Cluster: Increase Couchbase cluster size and membership
Reduce Cluster: Decrease Couchbase cluster size and membership
Uninstall: Completely remove Couchbase and all data

Executing our runner looks like any other Salt Runner. We only require one argument, which is the range cluster:

Figure 3: Salt Runner Output for Setup Cluster

Pillars provide an interface for storing global data that can also be passed to minion hosts. The information that is transferred via the Pillar interface is guaranteed to be delivered only to the minions being targeted. We also make use of Salt’s GPG rendering system to encrypt/decrypt Pillar data that is sensitive or otherwise should not be stored in plaintext, such as the administrator Couchbase password. To help explain how our Pillar data is organized and how setup_cluster operates, I’ve included this example couchbase_pillar, which uses YAML formatting:

Figure 4: Sample Couchbase Pillar (ifattah_couchbase_9999.sls File)

We also allow our Pillar interface to query Range for metadata that historically doesn’t exist in Salt, such as the list of hosts on which we will operate. You’ll notice that some metadata is duplicated in both the Pillar above and Range files below. In our case, Range takes precedence over values specified in the Pillar. However, our automation also functions perfectly well without any Range metadata at all. For instance, you could replace host_range above with a list of hosts and you’d still be able to build/install a Couchbase cluster, with the same buckets as well.

Figure 5: Sample Range Metadata

No explanation of how we use SaltStack would be complete without mentioning grains. Grains are metadata that is associated with the hosts running Salt Minions. Examples of the kind of data that you might find stored in grains for a given host include: OS version, kernel version, and total memory in the system. In our case, we use the ‘mem_total’ grain to calculate how much RAM to allocate to the Couchbase process. From that amount, Couchbase buckets are sized based on the client’s requirements and the use case. This also guarantees a consistent amount of RAM overhead for critical functionality like maintaining read/write disk queues in timely manner.

In another example, we set the grain ‘couchbase_cluster’ to ensure that the correct cluster’s admin password is being accessed. In this grain, we store the name of the SLS file whose contents are shown in our example couchbase_pillar above (see Fig. 4). When the Pillar is rendered, our sensitive Pillar data is only targeted by minions that have this grain set. In other words, only the Couchbase host being worked on will have access to this particular cluster’s admin password. When our runner is executed, we obtain the list of hosts from Range, set the grain value and use it to retrieve the resultant Pillar data (which comes from a combination of rendering the Pillar and querying Range metadata).

In short, any metadata we require to build a Couchbase cluster can come from our Salt Pillar, our Range metadata, or a combination of both.

Salt’s state system is used to automate and enforce the deployment of systems. We use states to make sure all of our Couchbase clusters are built in a consistent way. While our Salt states are templatized for various reasons, I’ve included a simple, generic Salt state that describes what the host looks like after having this state enforced:

Creates local ‘Couchbase’ user and group
Creates location for Couchbase data to be stored
Installs RPM based on supplied version (a hard-coded version is shown below)
Checks that the service is running after installation
Not shown: Method for calculating how much RAM to allocation to Couchbase

Figure 6: Simplified Salt State for Couchbase Installation

As our runner executes this state for the list of hosts provided via Range, we must also construct and execute a series of Couchbase commands to ensure the cluster is configured completely as described by our metadata. These operations include:

Create Couchbase (or memcache) buckets with given name and sizes
Provide SASL support and password (if specified)
Rebalance cluster after bucket creation (and poll for successful completion)
Create a read-only account for stats gathering
Enable auto-failover timeout (If a host doesn’t respond to a heartbeat after 120 seconds, it is marked as failed and a rebalance of your cluster occurs)

Most of the logic necessary to perform these post-installation commands exists in our runner. However, our module that is executed on the minion must construct all of the necessary couchbase-cli commands and REST API parameters that are needed to fully build a Couchbase cluster. We must also consider the nuances between different versions of Couchbase, since we support versions ranging from 2.2.0 to 3.0.1. In the case of expand_cluster and reduce_cluster, we automated extra checks to ensure the cluster (and the critical parts of the site) remain available.

Expanding Couchbase clusters is generally safe, but requires a calculation of how much memory to allocate to the Couchbase process.
Reducing clusters requires that we have enough overhead to sustain incoming traffic while the cluster is being reduced and cluster membership is changed.
In both cases, we validate that the status of each host being added and the remaining hosts are healthy, all while the cluster continues to receive incoming requests.

Last but not least, I want to touch briefly on the execution path and event bus employed by Salt:

Figure 7: Execution Path Used by Salt

The event bus employed by Salt ensures the reliable creation and execution of Salt jobs on a set of hosts. As an example, Salt’s event bus plays an integral role in LinkedIn’s continuous delivery deployment model, as well as the automatic remediation of alerts. As shown in Figure 7:

User executes a Salt command via the CLI. Interrogation of the command is performed and the primary generates a targeted list of expected minions to return.
If the command is correct, then a work thread is created to process the request. The command is then published on the publisher port of 4505.
All minions receive the published command because each is listening on port 4505 on the primary. However, only minion hosts that match the targeting criteria from Step 1 will process the request.
1. In a self-electing process, the minion host will process the incoming request only if it matches the targeting criteria. If there is a positive match, the minion will fork a process, enabling the incoming request to be processed asynchronously.
2. The minion retrieves resources from the primary that are not already locally cached, such as SLS files, Pillar data and execution modules.
3. The minion calls the Salt function and executes the request.
4. The minion then sends the result to the local event bus and back to the Salt Master on port 4506.
Once the data transfer is returned to the primary, a router/dealer pattern is used to allocate worker threads to prevent blocking, since a primary can have thousands of minions responding at the same time.
1. The workers process each request from the minion and do any post-processing necessary.
2. The publisher listens to the event bus to determine what to publish.
3. The minions listening to the primary's event bus will also see the result of the primary's event from the publish port.
The CLI from Step 1 will display the return data as it is received.

Salt provides important value to us at LinkedIn by enabling us to quickly and dynamically provision caching layers for many of the services that make up our site. It provides us with a versatile platform for automation and orchestration whose potential is virtually limitless.

The Couchbase Virtual Team at LinkedIn is currently supported by individual contributors across different SRE and Engineering teams who bring their unique use cases and solutions to the team. We are building the Salt integration so that it could be released as open source if there is sufficient public interest. If you’re interested in learning more, please reach out to me.

Topics: Open Source Infrastructure