Optimization

Learnings from the journey to continuous deployment

As an engineer, your goal is for every commit to seamlessly land in production and provide a delightful experience for your customers. While frequent releases give you the ability to iterate and apply feedback quickly, they also require significant time, effort, and cost to achieve. In this blog post, we’ll explore the components of our development ecosystem that help support continuous deployment, a release strategy where each code commit goes through the build pipeline and deploys in the production environment automatically.

LinkedIn is powered by thousands of microservices that are managed by self-organized, cross-functional teams that collaborate and build software incrementally. A microservice architecture breaks an application into smaller, autonomous services that are loosely coupled, highly maintainable, and independently deployed. However, teams have traditionally relied on a manual release process that delayed our teams from delivering value to our members and customers. 

At LinkedIn, values like “Members first” and “Take intelligent risks” inspire us to challenge ourselves to achieve our goal of true continuous deployment. This results in multiple deployments over the course of a day and drives the following advantages:  

  • High-quality software: Automated tests catch regression defects on every commit, resulting in quality products that are production ready. Incremental changes also mean highly-maintainable products. 

  • Low-risk and faster releases: Releasing regularly brings value to customers and provides early feedback on future tasks. Releasing with smaller changes at regular intervals is safer compared to big, infrequent releases. Releases are faster, as the pipeline is automated.

  • Value for customers and satisfied engineers: Members and customers are happier, as products are shipped with higher quality. Moreover, the engineering team gains confidence in finding regression bugs by using automated tests. Developers are more engaged, leading to higher productivity that ultimately increases velocity in releases.

The epitome of success for an engineering team practicing an agile process is the ability to release features to production in a short amount of time. This strategy requires a change in thinking to existing, controlled manual releases, and not many companies have mastered it. A key point is that quality cannot be sacrificed for speed; the team should be confident about the quality of the product to claim it as a success. 

Continuous deployment relies on continuous integration (CI), which helps to always keep the trunk of a software codebase healthy. Naturally, the next step is to automate the deployment activity reliably. 

Ground zero 

The first step towards our goal was to understand the existing challenges in our build and deploy pipeline and the infrastructure tools required to reliably overcome them. We identified three pillars (and tasks defined within each) that would enable us to transition from continuous delivery—where we maintain the trunk of the repository to be ready for deployment, but deployment is still manual—to continuous deployment, where validations and deployment are system-triggered based on results from the previous step. 

Quality pillars 

The pillars were defined such that each of them could be improved in parallel and then tied together after achieving the end goal for each pillar. The three pillars were:  

  1. Enhance code quality  

  1. Improve integration testing in the staging environment, which is a production environment for testing  

  1. Refine pre-production validation

stages-in-continuous-deployment

Stages in continuous deployment
 

Improve code quality
The primary cultural shift when moving to continuous deployment was to educate developers about writing quality tests and have them set up comprehensive test strategies for the entire build and deploy pipeline. 

The first step towards improving quality was to refine the review process to emphasize that reviewers should keep a watchful eye on code changes, test quality, and documentation. Code quality metrics, such as health score (inclusive of code coverage), dependency freshness, and code style were collected on each build and were used to guard against any regressions. 

Tools were built for developers to obtain insights into tests that failed frequently and to find flaky tests. Additionally, frameworks were developed to run contract and integration tests with external services during continuous integration. 

These test frameworks provide an early signal to developers, as the interaction between dependent services is tested during the development phase. They also improve the reliability of the staging environment because only well-tested services are deployed, while diagnostic software provides per commit/historical insights into failing tests.

Two such frameworks are: 

  1. Rest.li Test Framework (RTF): RTF is based on a record/replay mechanism, where developers record calls to dependent services that are hosted in staging environments. These recordings are saved and then replayed during the build phase on the CI host. This test framework enables the developers to validate interactions between multiple components early in the development cycle, enhancing code quality and confidence in the software.
record-mode-diagram
replay-mode-diagram

Modes of RTF (record and replay) 
 

    2. Simple integration test: The simple integration test enables developers to mock the calls to external services programmatically in the test code. 

The primary goal in this phase of work was to provide early feedback to developers to identify and fix code issues. One major benefit: the amortized cost of fixing issues is cheaper in the development environment than at a later stage. 

Integration testing in staging environment
A successful build step will publish an artifact that is deployed to the staging environment to detect issues related to dependencies. A suite of tests simulating user scenarios are then executed by interacting with services running in the staging environment. Staging environments are well-suited for testing integrations with data stores (both online and offline) and dependent services, and the continuation of the deployment process is based on the success of these tests. The staging environment can be unreliable and services may be unavailable because engineers do not constantly monitored services' health. This is why we have multi-stage integration testing and one reason why integration test frameworks, such as Rest.li Test Framework and Simple Integration Test, were developed to execute tests on the build step.  

Canary testing
As part of the deployment process, one last validation is to certify the latest version. At LinkedIn, we rely on automated canary testing. In canary testing, host(s) are updated to the latest version of software and a small percentage of users are routed to these hosts. The analysis runs for a preconfigured duration on a canary host and metrics generated are compared against metrics generated on a control host. Upon detecting any regressions/anomalies with the latest version, the changes are immediately reversed such that the impact is limited. 

Additionally, we’ve developed solutions to validate the performance of a service in canary testing across metrics like response latency, throughput, and load. 

Production
In microservice architecture a product is comprised of multiple services and it is possible for one under performing service to degrade the user experience. A monitoring dashboard provides information of service health and behavior, while monitoring critical parameters such as system load, API latency, and throughput to assess the health of software. Additionally, these frameworks are developed for running integration tests in the production environment without affecting the system stability. 

Summary 

By making these changes, we have improved the release cadence of key services from having a few releases per week to multiple releases per day. The fundamental first step to continuous deployment was to develop quality tests and automate the execution of tests during the build step. This guarantees product quality. From there, it’s easier to establish an automated deployment strategy. 

Acknowledgments 

This has been an amazing team effort from Anisha Shresta, Ayeesha Meerasa, Yusi Zhang, Walter Scott Johnson, Sajid Topiwala, Gururajan Raghavendran, Alisa Yamanaka, Bill Lin and Graham Turbyne. It would not have been possible to execute successfully without the immense backing and support from Pritesh Shah and John Rusnak. The vision of our team is to achieve continuous deployment for all products within LinkedIn. 

We would also like to thank the rest of the management, who have constantly been a source of encouragement and support, including Jeff Galdes and Dan Grillo