Empowering our developers with the Nuage SDK

November 19, 2019

LinkedIn delivers value to more than 660 million members via thousands of microservices, most of which depend on data infrastructure or mid-tier infrastructure platforms. This means that in order to launch a new application, developers traditionally had to request and set up an online database as a business data source of truth or stream data via Kafka topics, as well as configure the alerting mechanisms and monitor system health and traffic patterns. Such tasks used to take up to several days or weeks to complete, requiring support from platform developers or SREs. This all changed when we introduced Nuage as application developers are now able to complete these tasks via self-service in minutes.

This was achieved with integration by standardizing the management API of various platforms. Integrating a single platform used to take the efforts of multiple Nuage developers over several quarters. Today, Nuage builds and maintains a set of standard libraries and services that we call Nuage SDK. Platform developers are now able to self-serve the integration of their own platform and what once entailed a multi-developer effort over several quarters has been reduced to a single developer being able to achieve it all in a single quarter.

In this blog, we will focus on these self-service integrations. We will review the productivity blockers experienced, as well as explain the key ideas behind our strategy and go over the features provided by the SDK.

Problems

Let’s start by taking a look at our original architecture.

Figure 1: The original architecture with each box representing an application. Every time Nuage integrates a new platform or adds new features to existing platforms, we (Nuage developers) add new logic to both “back-end server” and “front-end UI” apps. Both apps grow in complexity and are prone to becoming a single point of failure.

Unfortunately, we experienced that as the Nuage code base grows in complexity, collaboration becomes difficult. Improving documentation and introducing runbooks and routines helped, but we continued to encounter a number of challenges as our platform grew:

A growing number of partner platforms to be integrated with the Nuage portal.
An influx of ongoing feature requests from integrated partner platforms.
An increase in maintenance costs (bug fixes and user support) that was out of scale with our team size.

Furthermore, collaboration among Nuage developers and platform developers became a major bottleneck of the process.

The-old-collaboration-platform-before-Nuage-SDK

Figure 2: The old collaboration pattern. Domain knowledge and requirements are transferred through human interactions. Platform developers explain their data model and business logic first. Nuage developers convert that knowledge into Nuage code and ask for their feedback before release.

Better division of labor and specialization

Letting Nuage and platform developers work on their respective areas of expertise means a more efficient integration. Nuage focuses on generic logic that applies to all platforms, while platform teams are able to focus on platform-specific logic, such as choosing which cluster to provision a database.

Hence, we distilled the generic logic from Nuage’s “backend server” and rebranded it as Nuage SDK, consisting of a collection of micro apps and libraries. The expectation was that platform-specific logic was to be authored by platform developers moving forward.

The-backend-server-of-our-original-architecture

Figure 3: All logic depicted above, regardless of generic or platform-specific, was found in the “back-end server” component of Figure 1. Now, the major focus of Nuage developers is on Nuage SDK.

By leveraging Nuage SDK, it’s now possible to deploy platform-specific logic as separate micro apps, helping us achieve the goal of removing single points of failure.

Figure 4: The shift of architecture does not affect our users (LinkedIn developers). Each platform integrated with Nuage now has its own deployable backend server. This means better isolation of errors, more flexible deployment schedule, a simpler code base to work with, and a much shorter integration test time and debug time.

Take “nuage kafka back-end,” for example. It supports Kafka topic creation, configuration, and schema management, etc. Features are implemented using both Nuage SDK logic (by Nuage developers) and Kafka-specific logic (by Kafka developers).

Though the architecture changed and the structure of logic changed, development still follows the design and implementation process. This means we need a protocol between Nuage and platform developers to communicate requirements, as well as a framework to provide structure to generic logic and platform-specific logic.

Design reviewed as code

As illustrated in Figure 2, requirements from platform developers were communicated in design meetings and then translated by Nuage developers into a back-end API. This process traditionally took several design sessions. Within LinkedIn, we turn to Rest.li for API design and Pegasus for data modeling. It was only logical that platform developers be able to express themselves using these standard tools. To everyone’s benefit, the design artifacts using Rest.li + Pegasus are also both human and machine readable.

The-new-collaboration-pattern-with-rest-li-and-pegasus

Figure 5: The new collaboration pattern, using Rest.li and Pegasus as common languages to communicate API design.

Platform developers now design their own Rest.li API and Pegasus data model. The artifacts are then peer reviewed by Nuage developers. What took multiple design meetings is now replaced with a straightforward code review.

Here is a sample data model as a Pegasus file:

Figure 6: Sample PDSC data model used for both Rest.li server and front-end UI auto-generation.

Figure 7: A sample Rest.li spec for creating a Pinot table. The Nuage front-end calls Rest.li server defined by this spec and also auto-generates the UI.

Once the design is finalized, platform developers plug in their platform-specific logic into API hooks defined in Nuage SDK.

Figure 8: Nuage developers build features as SDK that platform developers are able to then use.

Auto UI generation

As mentioned above, Rest.li and the Pegasus data model are readable by both human and machine. This means that we were able to leverage both characteristics to communicate more clearly and innovate more quickly by developing a rendering engine.

The purpose of this engine is to ingest the Pegasus data model file and Rest.li spec, and auto-generate UI elements. Preliminary use of this engine has shown it to dramatically reduce our UI efforts to integrate with a platform.

Nuage SDK features

In this section, we will cover a few features in Nuage SDK to showcase the interaction between SDK logic and platform-specific logic. For each scenario, we identify what SDK provides and what platform developers need to do.

CRUD operations
For most platforms, the basic API operations are: Create, Read(Get), Update, and Delete (CRUD). These are exposed as standard Rest.li methods.

For example, Kafka topic creation has 2 phases: provision topic in Kafka cluster (create in platform) and create metadata entry (create in Nuage). SDK will create the required metadata (ownership, ACL, alert configurations, etc.) out of the box. LinkedIn Kafka developers implement a SDK API to provision the topic in the Kafka cluster. It is also easy to extend above this 2-step execution pipeline to include extra steps, such as preCreate, create in platform, postCreate, and create in Nuage.

Async Execution Management
Short Running Async Job: ParSeq is an open source Java async library used extensively within LinkedIn. Our SDK adds a thin layer over ParSeq and provides platform developers a verb: “submit async task.” Anyone with previous ParSeq experience is able to use it the same way. Behind the scenes, the Nuage SDK persists execution status periodically until task completion or failure. We also built an application to support query and visualize execution status with key words. As you can see, the only thing platform developers need to do is to submit vanilla ParSeq task to our wrapped engine.

Approval Workflow By Humans: Certain requests (increasing quota, for example) require approval from resource owners. Such cases are asynchronous by nature. The SDK again provides the verb: “createApprovalWorkflow(requester id, payload, target resource id).” Platform developers can implement business logic (e.g., trigger an approval workflow if the requester is not authorized). Behind the scenes, the Nuage SDK persists the original request. We built an application to support “approval/ rejection/ query queued tasks.” Upon approval, the persisted tasks are executed on behalf of the original requester.

Summary

There’s an economic principle that says gains in efficiency come from better division of labor and specialization. This is why we created a framework built around self-service and empowerment of our teams to focus on the code logic they know best.

We found that the cost of integrating more infrastructure platforms is not linear to the number of platforms we support. One of the key observations in solving for this was that natural language discussion (human-to-human) can be an inefficient channel to transfer domain knowledge and requirements. Therefore, Rest.li spec and Pegasus data models are used as standard protocol to communicate API requirements among Nuage and platform developers, reducing costs. The Rest.li spec + Pegasus files are also used to auto-generate UI.

Such effort has brought down the cost of integrating infrastructure platforms to Nuage and enabled our team to deliver on our vision at a much faster pace.

Acknowledgements

Thank you to Changran Wei, Terry Fu, Yifang Liu, Nishant Lakshmikanth, Hunter Perrin, Tyler Corley, Darby Perez, and Micah Stubbs for contributing to the success of this project over the past year and a half. It would have been impossible to turn this idea into reality without everyone’s perseverance and effort. Thank you Mohamed Battisha and Eric Kim for your strong support as managers. Thank you Ke Wu and the ISB team for being the first team to pilot our SDK early on. Your success boosted our confidence in our own work. Finally, thank you Yun Sun, Hai Lu, and Lei Xia for taking the time to review this article.

Topics: Data Infrastructure