Production testing with dark canaries

David Hoa

Staff Software Engineer at Databricks

September 10, 2020

The internet software industry has moved away from long development cycles and dedicated quality assurance (QA) stages, toward a fast-paced continuous-integration/continuous-delivery (CI/CD) pipeline, where new code is quickly written, committed, and pushed to user-facing applications and services. Doing so has dramatically increased iteration speeds, and at LinkedIn, it is not uncommon for new code to be committed and pushed to production within an hour, multiple times a day. While this has been incredibly successful—it has improved developer productivity by allowing us to make changes more quickly—it has also caused problems such as site or service outages when bad code, configuration, or AI models have been pushed to production. Even with complete unit and integration tests, sometimes there is no substitute for production-level testing using both production data and production volume, especially with regards to system performance metrics, such as memory consumption, CPU/GPU usage, concurrency, and latency. But to do production-level testing safely without negatively impacting customers, teams often need both production requests and production data. Frameworks such as record-and-replay offer a solution, but oftentimes are not suitable for quickly changing scenarios, such as testing new AI models, which require different input features to produce an output. Gathering new input at production volume requires a potentially expensive and time-consuming step to re-record production requests and responses that is infeasible with multiple new models being A/B tested daily. This blog post introduces dark canary clusters as a way to detect problems before they hit production. What is a dark canary? The canary terminology comes from mining, where miners would lower an unfortunate canary in a cage to detect the presence of dangerous gases poisonous for miners to breathe. These canaries would serve as an early warning signal to the miners that it wasn’t safe to go into the mines. In software engineering, a canary is an instance of a service that runs new code, configuration, or AI models at production level so that we can verify that the new code is safe before deploying it to more instances. If the new code is bad (e.g., has a performance degradation or increased error rate), requests to that canary are potentially “sacrificed” until the bad code is rolled back. This may mean a small fraction of user requests will fail. A “dark” canary is an instance of a service that takes duplicated traffic from a real service instance, but where the response from the dark canary is discarded by default. This means the end user is never impacted even if something goes wrong in the dark canary, such as errors, higher latency, higher CPU/memory consumption, etc. Only read requests without side effects, such as tracking events that might pollute business metrics, should be duplicated. For instance, if there is a request to read a profile, a side effect would be to write to a database counting the number of profile reads. You could mitigate this side effect by detecting that this was a dark canary request and using a separate database counter, or not writing to the database in this case. Calls to downstream services also need to be without side effects or else their impact needs to be mitigated. Other downstream effects also need to be carefully considered before using dark canaries. If a request to a service results in hundreds of requests to downstream services, then there might be a non-negligible monetary cost for these “extra and non-essential” requests because it might require more server capacity across all those downstream services. History of dark canaries at LinkedIn Back in 2013, one of our large backend services wanted support in Rest.li for dark canaries. The service, at the time, involved duplicating requests from one host machine and sending it to another host machine. This was added via a Python tool to populate the host-to-host mapping in Apache ZooKeeper along with a filter to read this mapping and multiply traffic. As operational complexity grew (due to additional data centers, dark canaries being used in midtier and even frontend services, and dynamic scale up-down of instances), this became more complex to maintain. More teams were using dark canaries, but our developers and SREs were still hindered by how difficult it was to onboard and maintain dark canaries. For example, when dark canaries suddenly stopped receiving traffic or disappeared because hosts were swapped out from underneath them, engineers had to recreate the tedious host-to-host mapping in every data center. It was clear we needed a new solution. Introducing dark canary clusters A dark canary cluster is just like a dark canary except that there are multiple dark canary instances. Sending traffic to dark canary clusters requires the user to discover and spread traffic among the members of the cluster. LinkedIn uses the Dynamic Discovery (D2) service discovery mechanism in Rest.li to send requests from service to service, so this was the logical place to add first-class support for dark clusters because D2 already had a way to send requests to clusters. This integration enabled an entire cluster of regular service instances to send duplicated-but-ignored requests to a dark cluster of instances. The benefits of doing this were that: We could easily send a substantial level of traffic to dark canaries while minimizing any impact on sending instances because the additional QPS on any instance would be low. LinkedIn developers were familiar with the D2 service discovery mechanism and had a clear supported path for reviewing code changes, and promoting changes across staging areas and data centers. The forking mechanism (where we duplicate the request, send it to the dark service, and ignore the response) could be aware of both the source and target cluster sizes so that we could maintain comparable levels of traffic between regular service instances and dark instances, such as with EKG. Our teams at LinkedIn have found that dark canary clusters are an easy way to validate changes, especially for testing out new online AI models, where the model’s system performance at production level traffic is unknown. Common across all use cases is that this helps keep the LinkedIn site more stable and developers more productive by freeing our engineers from fearing new changes. Architecture In a typical service-oriented architecture, services will call other services to fulfill the user request. In the below diagram, Service A has multiple instances, takes inbound requests, and calls other services (including Service B) to retrieve necessary information. This is a typical layered system in which Service A is a mid-tier service and Service B is a back-end service. If Service A is the service that we want to validate, then we can set up a cluster of Service A dark canaries.

Dark cluster architecture diagram

While there are many ways to dispatch traffic to the Service A dark canary cluster, LinkedIn uses a client-side library that is able to discover the dark cluster(s) that Service A should fork traffic to.

In this case, we store a self-updating mapping from a source cluster (“ServiceACluster”) to a set of corresponding Dark Clusters ([“DarkServiceACluster”]) in Apache ZooKeeper. In addition, we can store other metadata such as how to multiply traffic to the Dark Cluster and other configurations we need for forking traffic. Now, whenever Service A receives a request, an inbound request filter will read this ZooKeeper data, detect that it needs to send dark traffic to DarkServiceACluster, and fork the request appropriately.

Sample D2 Configuration, simplified for brevity. Configs can be structured in XML/JSON/etc.

Using Apache ZooKeeper in this case is not strictly necessary, but it does help because it allows us to know the exact count of both regular clusters and dark clusters. This enables us to scale up and down both regular and dark clusters independently with the dispatching instances adjusted automatically. One strategy for sending dark traffic is to make the queries per second (QPS) between a dark cluster instance and a regular cluster instance identical (or proportional according to some multiplier), so they can be compared by validation tooling. If we add instances to DarkServiceACluster, then we would expect the total outbound dark requests from ServiceACluster to increase, but the QPS on each dark cluster instance to stay the same. Conversely, if we added instances to ServiceACluster, then we’d expect the per-instance QPS to ServiceACluster to drop (as the total QPS remains the same and becomes spread over more instances). As a result, we’d want the per-instance DarkServiceACluster QPS to drop a corresponding amount.

To follow this strategy, let’s illustrate how a Service A instance can calculate the ratio of requests it should send.

First, a definition:
Multiplier: number used to proportionally scale the traffic a dark cluster instance receives compared to a regular cluster instance. Setting the multiplier as 1.0 means the dark cluster per-instance QPS should be the same as the regular per-instance QPS, while 1.2 means the dark cluster per-instance QPS should receive 20% more QPS than the regular per-instance QPS.

All QPS fields below are per-instance unless explicitly specified otherwise.

ratio-of-dark-requests-equals-number-of-dark-service-a-cluster-instances-divided-by-number-of-service-a-cluster-instances-multiplied-by-multiplier

And

outbound-dark-request-QPS-is-equal-to-ratio-of-dark-requests-multiplied-by-inbound-QPS

This means that if the inbound QPS is 100, the multiplier is 1, and there are 2 DarkServiceACluster instances and 100 regular ServiceACluster instances, then the outbound DarkRequest QPS per regular ServiceA instance is:

two-QPS-equals-two-instances-divided-by-one-hundred-instances-multiplied-by-one-multiplied-by-one-hundred-QPS

With this formula, the traffic on the dark cluster instances will stay proportional to your regular service instances, and you don’t have to worry about overloading them just by scaling the respective clusters up or down.

Ongoing work

Many teams at LinkedIn are working to realize the full potential of dark canary clusters, including approaches to automate the onboarding and spinning up of dark canary clusters, orchestrate the testing of AI models on dark clusters, and build better validation metrics and test suites that weren’t practical earlier to run in production.

Conclusion

Dark canary clusters are something we believe many companies can benefit from, regardless of whether the cluster management is done using D2. Not only is both onboarding and maintenance of dark clusters easy with this approach, but the strategy to compare regular instances against dark cluster instances can also be used in automated validation, helping to speed up the development process by removing manual validation steps. By removing error-prone manual validation steps with automated validation, you remove the expertise required to correctly validate code (especially against graphs) and build the confidence needed to speed up development velocity. We hope that this can help others establish best practices for validation and deployment within their companies.

Acknowledgements

Many thanks to Sean Sheng and Chris Zhang from the Service Infrastructure team for their development support and brainstorming, to Erik Krogen and Sumit Rangwala for reviewing this blog post, and to the brave first adopters: Srividya Krishnamurthy, Ramon Garcia, Peter Chng, and Yafei Wang. Thanks to the Pro-ML Leadership team for their continued support and investment in this cross discipline area: Eing Ong, Josh Hartman, Joel Young, and Josh Walker.

Topics: Developer Experience/Productivity Artificial intelligence A/B Testing/Experimentation Automation Infrastructure