Engineering Infrastructure at Scale: Overview

Ning Zhang

VP of Engineering @ Coupang

October 27, 2016

This blog series will describe the engineering infrastructure (technologies, processes, tools, and culture) that enable several hundred engineers across LinkedIn to innovate and release software continuously with agility, quality, and productivity. This post gives an overview of the overall architecture, workflow, and scale.

As shown in above diagram, LinkedIn has a native app for iOS and Android, and the linkedin.com website for mobile and desktop browsers. The four clients all call the same shared API frontend to exchange data with various middle tier and backend services at LinkedIn. There is high consistency and parity of UX and features across the four clients.

There are four main code repos, one for each platform: iOS, Android, Web (mobile and desktop web share the same repo, hence most of the logic), and API. Each main repo depends on a few dozen library repos written for the particular platform. The four main repos each have over 5,000 files, more than half a million lines of code, and about 200 total committers. The peak commit rate to one repo is about 15 per hour, 60 per day, and 250 per week. We use the trunk development model, i.e., there is only one branch for each repo where everyone commits to and we ship from, many times a day.

Below is overall code flow, iterative in each step and across the whole flow:

Big features have architecture design docs and must pass design reviews.
All features have automated tests; most features have sphinx documentation. Features, tests, and documentation are all committed to the same code repo.
All code changes go through code reviews, and need “ship-it” from both owner ACL (access control lists with owners for each file) and platform ACL (access control list with experts to ensure code integrity, consistency, and best practices for the particular platform).
Each commit must pass static analysis, build, and automated tests in the repo (unit/layout/scenario, about 4,000 tests in total for each main repo, and growing) during pre-commit phase before it gets committed to the repo. If any step in pre-commit fails, the commit is rejected. Pre-commit flows run in parallel for concurrent commits.
Once committed to the repo, the commit must pass the same (or bigger) set of tests again in post-commit phase to ensure the merge of all commits still produces good, shippable builds for each commit. If post-commit fails, either the commit is auto reverted, or an on-call engineer will fix it immediately. If post-commit succeeds, the build is published to Artifactory for release to member devices (iOS and Android) or for deployment to production sites (web and API).
For iOS and Android, each successful build is pushed to the alpha channel (the mobile team) immediately via the app’s self-updater. Every week, one iOS build and three Android builds are pushed to the beta channel (the company and public beta users) via MDM (Mobile Device Management) system, the app’s self-updater, TestFlight (iOS), or Play Store (Android) beta channel. On every Wednesday, if there are no blocker issues found, one of the beta builds is released to the production channel (all LinkedIn members) via Apple’s App Store and Google’s Play Store.
For Web and API, each successful build is auto deployed to EI and Staging (test environments) where automated tests are run. If passed, the build is auto-deployed to canary boxes in production, where it serves live requests and is compared against other production boxes for statistical analysis on various metrics (HTTP 200/400/500 returns, exceptions, fanout, latency, etc.) called EKG. If EKG passes, the on-call engineer will deploy the build to production, which also allows the next build in the queue to be canaried for the next deployment. We deploy around 9:00 a.m., noon and 3:00 p.m. every workday.
All new features and WIPs (Work In Progress) are guarded by LiX’es (LinkedIn eXperiment) so that they are available only to the feature teams while under development, are gradually ramped to the company or public for feedback and A/B testing, and are ramped down immediately if critical issues are found. Once fully ramped, the LiX’es must be removed to keep the codebase clean and maintainable.

We call our engineering process “3x3”: Release three times per day, with no more than three hours between when the code is committed and when that code is available to members!

Topics: Culture Scalability Product Design Data Management Infrastructure