Building in the cloud with remote development

Shivani Pai Kasturi

Staff Software Engineer at LinkedIn

November 4, 2021

Co-authors: Shivani Pai Kasturi and Swati Gambhir

Imagine developing on your laptop, but with the computing power of the cloud! Here at LinkedIn, we were successfully able to reduce the initial setup and build times from 10-30 minutes to just 10 seconds for most of our products with a new remote development experience. In this post, we’ll describe our journey to arrive at this point.

As members of the Developer Productivity and Happiness team at LinkedIn, we have often heard from developers how slow build speeds and environment setup problems affect their productivity. LinkedIn has a vast product ecosystem spanning different technologies that cater to different needs, such as Java, Python, C/C++, Go, JavaScript, iOS, and Android. Having an expansive ecosystem has its benefits, but each of these technologies may have different setup requirements and, often, new developers end up spending a good amount of time setting up their development environment.

During the pandemic when we were all working remotely, it became even more challenging to develop on laptops that had limited processing power, memory, and disk space. Compared to desktops or server-class computers, laptops typically have fewer CPU cores, less memory and disk space, and suffer from thermal throttling. Additionally, other software running in the background and suboptimal networks can further impact performance and contribute to slow builds. Given the scale of builds handled daily by the LinkedIn continuous integration (CI) pipeline, CI build failures and inconsistencies between local and CI builds have also been a key problem area for support engineers.

The Remote Development initiative at LinkedIn aims to solve these problems with a vision to provide all developers with remotely accessible, reliable, consistent, predictable, fast-to-build, and easy-to-setup remote development environments for their projects, regardless of their local device and network connection. These remote development environments, which we call RDevs, are containers set up for a particular product, with all the necessary tools and packages required for its development. RDev instances are created on powerful hardware in our private cloud with low latency to services required during network operations, like cloning and downloading dependencies (as shown in Figure 1).

graph-of-time-to-download-a-single-dependency-measured-over-multiple-iterations

Figure 1: Time in seconds to download a single dependency measured over multiple iterations.

We have integrated these RDevs with developers’ favorite IDEs that use remote SSH capabilities to provide a seamless development experience that feels just like developing locally. The average build time for one of LinkedIn’s large Play applications is shown below in Figure 2. As is evident, the build times are much shorter in an RDev.

graph-of-average-build-time-for-one-application-on-various-operating-systems-and-numbers-of-cores

Figure 2: Average build time for one of our applications on various operating systems/number of cores.

In this blog post, we’ll go over how we implemented this remote build and development environment based on containers, leveraging our existing infrastructure and product lifecycle. We also share details around how we reduced the initial setup time using RDevs and achieved consistency throughout our development and CI lifecycle.

Anticipating developers’ needs with pre-built RDevs

We maintain a pool of pre-built RDev environments by predicting the developers’ needs based on past RDev usage patterns and assign the RDevs to developers on demand. Pre-building an RDev involves spinning up a container, checking out the product, setting up the environment, building the product, and having the app running so that developers can start working immediately, without having to think about starting the application up. This saves a lot of time for developers, as shown in Figure 3 below.

local-vs-pre-built-rdev-comparison-for-one-applications-clone-and-build-times

Figure 3: Local vs pre-built RDev comparison for one of our application’s Clone and Build times.

The build process can vary depending on the type of product, as some products have a special continuous build process that watches the filesystem via inotify and keeps the build going (for example, Ember builds for JavaScript products). Even for regular products where the build process returns an exit code, the build’s output needs to be recorded. This is achieved by running the build inside a tmux session that developers can access after they get assigned an RDev.

Extending the benefits of RDev to Continuous Integration pipeline

The ability to develop (in RDev) and build and deploy (in CI), all using the same container, allows additional benefits of consistency and reproducibility.

To reap these benefits, we updated the build step in our CI pipeline, and delegated it to run the existing CI tasks inside the container. This CI container is created from an image generated and maintained by LinkedIn’s image infrastructure (as explained in the next section), and is used both for remote development and build-in-CI workflows. This methodology is very similar to how GitHub actions with “runs-on” and “container” directives work.

How it works

Let’s go over how we dropped build times by two orders of magnitude using a bunch of clever tricks.

Figure 4 shows the major components of the Remote Development Ecosystem.

illustration-of-remote-development-architecture

Figure 4: Remote development architecture.

Base image infrastructure
Base image infrastructure integrates building container images with our CI pipeline and helps developers easily create and publish custom images to an internal LinkedIn container image registry. We have a set of template images for certain technologies like Python, Java, and JavaScript that developers can directly use or extend from.

For each CI build of an “image” product, a dependency graph is created, which contains information on all the RPMs of that image and parent base image information. This dependency graph backs an image dependency updater service that keeps all the RDev images up to date. It picks up any available changes to the internal RPMs and rebuilds the images with those updates. Any image containing those RPMs directly is updated, along with any dependent images. These images are used both in RDev configuration and CI to create development containers and CI build containers, backing a consistent development and build environment.

RDev configuration
We follow VS Code’s container configuration format. The basic container configuration, like image name, environment variables, and ports to be forwarded from within the container, are described declaratively in a .devcontainer/devcontainer.json file at the root of a product repository.

RDev CLI
RDev CLI is a Python CLI that is distributed to all developers’ machines and has the necessary commands to create, connect (via CLI or IDE), and manage these remote development environments.

RDev server
RDev server is a Rest.li Python service that acts as a broker between the CLI and the Kubernetes operator. It is responsible for forwarding requests to the Kubernetes operator, querying it for results, and also interacting with the database where we store developer preferences and metadata (like dotfiles).

RDev operator
We extend the Kubernetes API by leveraging the Kubernetes Operator Pattern and defining LinkedIn specific Custom Resource Definitions - CRDs.

We define two CRDs: Rdev and RdevPool. Rdev CRD represents a single instance stateful application, with a specification that has enough information to recreate itself from scratch. RdevPool CRD wraps the Cloneset CRD in order to maintain a pool of pre-built RDevs. RDev operator leverages the operator SDK Kubebuilder framework, and acts as a controller for these CRDs to reconcile its current state to the desired state.

Figure 5: Pod architecture

As shown in Figure 5, RDev is associated with a Service that is necessary to expose ports outside the Kubernetes cluster. A NodePort is used to expose the server.

The Persistent Volume Claim (PVC) is necessary to reserve a Persistent Volume (PV) in order to store non-volatile data; in this case, that data is the home directory of the RDev. This is essential in cases when the Pod, described below, needs to be moved to another node or is accidentally deleted.

Each RDev is backed up by a Kubernetes Pod that is composed of three immutable containers: rdev-init-workspace, rdev-sshd, and rdev-sidecar. It also has two main volume mounts, Home and Rdev Info, along with other necessary volumes related to certificates and security.

Containers:

rdev-init-workspace: This is an init container that prepares the developer's workspace and preferences.

rdev-sshd: A container that provides login service to the RDev. This container is created from the image specified by the product’s devcontainer.json file and contains all the tools necessary for development in the container and runs sshd.

rdev-sidecar: A container that is responsible for checking out and installing dotfiles, and also runs the Startup Probe (described in the next paragraph). This probe is used to determine if the RDev Pod is fully built and ready to be assigned to a developer.

Volume mounts:

Home volume: Home volume, as the name suggests, is the developer’s home and will have the product checked out, the developer’s dotfiles installed, environment variables set, and user profile configured for the developer.

Rdev info volume: Rdev info volume contains host and port details populated using the labels and annotations of the pod, leveraging the downward API.

As mentioned previously, RdevPool is a Cloneset that maintains a pool of RDevs based on the number of replicas configured. Once the RDev Pod is created, the PostStart container hook triggers the build command in the rdev-sshd container. The Startup Probe that is running in the rdev-sidecar container keeps probing to check if the build has finished successfully. It determines if the product is built either by looking for the file in which the build output is recorded, or by fetching the URL that is provided in the configuration file using curl. After the Startup Probe succeeds, the RDev Pod is marked as “ready” to be assigned to the developer.

When a developer requests an RDev, the RDev controller will look for an unassigned Pod that is fully built, take ownership of the Pod, and remove it from the RdevPool controller. The RdevPool controller will notice one of its Pods is missing and create a new one to maintain the number of replicas provided in the RdevPool Spec.

Looking forward

With remote work becoming a pervasive part of modern life, we believe remote development is going to be a fundamental enabler for LinkedIn's developers to get best-in-class development experiences wherever they are.

We are excited about the upcoming capabilities backed by Remote Development, such as:

Reproducing failed CI builds and simplifying the debugging experience by providing developers a corresponding RDev for each failed execution.

Associating an RDev with each GitHub pull request to help reviewers visualize the changes and thus improve the review experience.

Acknowledgements

Creating an infrastructure like this at scale can't be done without the help of many engineers across many teams. Special thanks to Oscar Bonilla for his vision and the entire Remote Development team—Brian Dittmer, Evgeny Barbashov, Garv Mathur, Hasanat Kazmi, Jie Li, Loren Carvalho, Qishen Li, and Tao Wang—for their valuable contributions. We would also like to thank members from our partner teams— Alasdair James King, Anusha Nagarajan, Leonid Lyamanov, Mike North, Ronak Nathani, Suchita Doshi, Todd Maeshiro, and Vanessa Borcherding—for their continuous support. Additionally, we would like to thank Brian Beck, David Herman, Pritesh Shah, Scott Holmes, and Yiming Wang for their guidance and feedback throughout this project.

Topics: Developer Experience/Productivity Cloud Computing