Couchbase Ecosystem at LinkedIn

Michael Kehoe

Cloud Architect @ Confluent | Author

December 6, 2017

Couchbase is a highly scalable, distributed data store that plays a critical role in LinkedIn’s caching systems. Couchbase was first adopted at LinkedIn in 2012, and it now handles over 10 million queries per second with over 200 clusters in our production, staging, and corporate environments. Couchbase’s replication mechanisms and high performance have enabled us to use Couchbase for a number of mission-critical use cases at LinkedIn. These include backend caching for our recruiter and jobs products, counters for security defense mechanisms, and a Source-Of-Truth (SoT) store for internal applications.

Over the years, as our Couchbase usage continued to grow, we had to ensure that it scaled operationally, so we developed an entire ecosystem around Couchbase. In this post, I will discuss some of the tooling that has helped LinkedIn scale our Couchbase deployment successfully.

The below figure does not fully capture the various data pipelines and topology at LinkedIn, but serves to illustrate the key entities in LinkedIn’s Couchbase deployment and how they interact together.

Core Couchbase services

Couchbase server
LinkedIn runs a mix of Couchbase Community and Enterprise Editions over our infrastructure. The deployments of Couchbase range from 3 nodes per cluster up to 72 in our largest cluster. Currently, we deploy this as a standard RPM, but in the future, we’ll move towards coupling Couchbase deployments with our in-house deployment system LID.

Salt + range
As detailed further in this post, LinkedIn installs and manages Couchbase clusters using an array of SaltStack tooling. We utilize range as a cluster configuration store to assist with installation and monitoring of the cluster. This data is also used in our fleet management system, Macy’s (see below).

Libraries

Li-couchbase-client
Li-couchbase-client wraps the open source Java couchbase-client with our own modifications. In our version, we’ve added the ability to monitor the client statistics and capture metrics like Queries Per Second (QPS), latency, and errors, which allows us to have great insight into the use and performance of Couchbase.

Couchbase-python-client
At LinkedIn, a number of our infrastructure tools are written in Python and use Couchbase as a backend. Currently, we use the open source Python couchbase-client. In the future, we plan to write our own wrapper library that will allow us to include the emission of metrics and the automatic discovery of Couchbase servers. This will help improve the operability of Couchbase for Python users at LinkedIn.

Monitoring services

Amf-cbstats
While installing Couchbase on servers, we also install a daemon called “amf-cbstats.” Active Monitoring Framework (AMF) is a framework at LinkedIn to send metrics to our monitoring system from applications. Amf-cbstats polls the standard performance metrics from a Couchbase server every minute and sends them to our metrics collection system. We also have a second daemon, “amf-couchbase-aux,” that collects metrics about backup we perform on certain clusters. You can see more about Couchbase monitoring from my Couchbase Connect 2016 presentation.

inGraphs/AutoAlerts
After collecting all of these metrics, we need to visualize and set alerts against them. At LinkedIn, we use our in-house metrics visualization system, inGraphs, to display this data. After installing a Couchbase cluster, engineers use an internal utility called “couchbase-dashboard-generator” to generate a set of dashboards and alerts for the cluster. The dashboards include a set of important graphs we consider key to monitoring the health of the cluster, as well as all the other statistics we collect. We automatically include alerts with these dashboards and they are tunable via manipulating the range data mentioned earlier.

Macy’s
In order to see a higher-level view of our Couchbase deployment, the Couchbase-SRE team built a utility called “Macy’s.” Macy’s allows us to view the configuration of the clusters from a bird's-eye view. It also collects cluster utilization metrics and reports inconsistencies with deployment or monitoring of the infrastructure. This helps us to ensure that our Couchbase deployments are set up and used in an optimal manner.

Conclusion

Our ability to build an ecosystem around Couchbase has largely enabled its success as a caching platform at LinkedIn. We have been able to build tooling around client access, monitoring, deployment, and management, which has allowed us to smoothly scale our Couchbase deployment. As LinkedIn continues to grow, we will continue to look for opportunities to further automate operations for Couchbase infrastructure.

Acknolwedgements

A big thank you to the Couchbase-SRE team at LinkedIn: Ben Weir, Hardik Kheskani, James Won, Usha Kuppuswamy, Todd Hendricks, Subhas Sinha, and Bhaskaran Devaraj, for all the great work you have done to make Couchbase successful at LinkedIn.

Topics: Developer Experience/Productivity Infrastructure