Solving manageability challenges at scale with Nuage

Engineering Manager at LinkedIn

September 16, 2019

Introduction

LinkedIn is committed to providing economic opportunities for every member of the global workforce, and we’re growing at a rapid pace. Our platform is built on a collection of large-scale multi-cluster services functioning in harmony to offer a unified product experience to members.

The aim of several backend engineering teams is to allow other teams to stay focused on LinkedIn’s business goals without having to worry about the varying complexities of different services. We essentially want to make infrastructure invisible. However, these large-scale services pose many challenges, including scalability, availability, reliability, efficiency, resiliency, serviceability, and manageability. We honed in on a few of these to ease the pain of service builders, specifically regarding serviceability and manageability, because we had a clear view of our manageability challenges, which inspired us to create a dedicated cloud management service, Nuage.

The goal of Nuage is to provide a unified experience to the product teams, abstracting out all the heterogeneity of underlying hardware and software for different services to provide a centralized and coordinated view of any distributed system.

The product teams build different features at a rapid pace, creating many compute-, I/O-, and network-intensive applications. These applications often use more than one infrastructure offering, such as:

Espresso: Highly scalable distributed database
Galene Search: Real-time search index with live updates from a database/streaming service
Kafka: Pub-sub to emit asynchronous events
Bulkhead: Quota throttling system to prevent rogue caller
Helix: Distributed cluster management framework
Fuse: Anti-abuse system to prevent system DDoS attacks
Couchbase: Caching solution to reduce latency
Ambry: Blob-storage to store/retrieve images/blobs
Pinot: Real-time distributed OLAP datastore
Samza: Stream processing service
Pro-ML: Machine learning model management service
Brooklin: Change-capture streaming service
Venice: Derived data platform with near-line and offline data injection

Nuage provides the ability to set up these infrastructure services, managing their lifecycles and thereby maintaining the pace at which the product team can attain their goals.

Nuage plays a vital role in many critical areas, including:

Security (authentication and authorization)
Resource provisioning (finding the best cluster to host the resource)
Monitoring and alerting
Cluster management
Auditing and compliance
Making databases/stores auto-discoverable via dynamic discovery
Notifying the interested audience about sub-system changes
Maintaining the SLAs of different resources by enforcing quotas
Tracking the capacity usage of resources and clusters and many more

All of the above challenges are further compounded if we have to consider resource management for disparate, large-scale data systems running geographically distributed clusters. Over the last couple of years, the Nuage team has adopted many strategies to seamlessly extend manageability across many of LinkedIn’s infrastructure services. We first introduced Nuage via the Engineering Blog last year in a post discussing the different systems we are managing. Readers of this post will learn more about the manageability and serviceability challenges we experience and how Nuage helps us combat these issues.

Tackling heterogeneous services

At present, Nuage is managing more than 15 unique major services and it’s not realistic or efficient to build different solutions for each. Our solution: platformization. We built a platform for Nuage on which any service can be integrated by following our integration guidelines. In this process, we created an application template (a generic deployable) that depends on an SDK containing many utility modules. Each module can be selected, ignored, or even customized—a plug-and-play architecture.

Standard templates versus customization
With regard to customization, too much can sometimes prove to be counterproductive and time-consuming, so we mostly encourage customization only if it is absolutely necessary. For example, if the integrating platform has a complex workflow (represented as a Directed Acyclic Graph) that needs to be triggered on certain conditions, then Nuage provides the solution out-of-box. However, if a few steps require human intervention with a special approval process, then this needs a bit of customization. Also, regarding the provisioning logic, Nuage provides a basic provisioning algorithm: best fit. But there are many platforms that customize this part of the process as per their needs.

Centralized versus decentralized architecture

For Nuage, we started with a centralized architecture years ago, when we had just one service to support. Soon after, however, we realized its pace of growth was unmanageable and suffered from many sub-component failures that affected the whole system.

In recent years, there has been a trend towards decentralized architecture across the industry for large-scale systems. The notion of decentralized architecture is to pursue local autonomy and communicate with the other components only when needed. In the below architecture, the Nuage management microservices interact with feature services and underlying services only when necessary. This architectural change also has many benefits, including easy integration with heterogeneous systems and increased availability, scalability, performance, and fault tolerance. However, we do have to make important decisions around things like the level of autonomy of multiple components, decision making in case of partial system failures, coordination, and consistency for certain centralized decisions.

Nuage’s decentralized architecture

The above diagram shows a simplified, high-level view of Nuage’s decentralized architecture. A request originates from Nuage UI by the user-driven actions. The request then hits the routing layer, which is then forwarded to the corresponding microservice—either to a management microservice or feature service. The management microservices serve as a mid-tier that fans out the request to one or more underlying infrastructure/feature services.

Reactive versus proactive

Nuage has primarily been a reactive system, meaning that the majority of the actions are user-triggered. We realized that, in order to stay on top of unforeseen issues, we needed to be more proactive. This change in strategy has helped Nuage to quantitatively understand every component of the underlying systems and has mandated Nuage to start collecting more real-time metrics about various system components to make the right recommendations at the right time.

This more proactive approach has helped with things like: predicting capacity crises caused by an inorganic growth of hosted resources in a cluster, identifying databases with hot partitions to prevent request throttling, automatically adjusting provisioned capacity for a resource, predicting a cluster expansion, or predicting how an upstream quota change might affect the downstream consumers.

Increased automation
Historically, there were many manual operations in Nuage that required a lot of human intervention. When we changed to a more proactive approach, it was the management layer that took over automation of many routine processes. Operations that previously took a few weeks and were often error-prone yet highly critical, requiring heavy lifting and including multiple services to coordinate, are now being taken care of by Nuage.

For example, in the event that a database is dropped, Nuage kicks off a workflow that performs the following operations:

Stop accepting any write traffic.
Detect and pause all the change-capture streams.
Notify all the downstream consumers via change-capture streams or offline snapshots
Revoke all security privileges.
Have a cool off period for downstream consumers to react.
Trigger the hard-deletion to clean up the dataset online/offline.

Visualizing Nuage as a matrix

Manageability challenges like integrating with new services on a regular basis, having to be able to support many new features for existing services, and accommodating changes to existing features are what led us to adopt the above-mentioned strategies for Nuage and to build a bunch of generic features that are common across many services.

Matrix: The columns representing Service1..N denote LinkedIn’s large-scale services like Kafka, Samza, Ambry, Venice, Espresso, etc. Each of the rows in the matrix represents a generic feature that the Nuage team has built to efficiently manage such massive systems.

Let’s dive deeper into some of the generic features that we have abstracted out and built in the last few years. These features by themselves are scalable, fault-tolerant, adaptable, cross-service pluggable, and used extensively within LinkedIn by product teams.

Resource provisioning
Nuage manages a large fleet of clusters, and getting a coordinated and aggregated view of the geographically-distributed clusters is the key in resource provisioning. Nuage provides a few default provisioning algorithms, which suit most of the use cases. However, the algorithms are designed to be extensible with service-specific provisioning logics. Some of the customization includes considering the granular details of all the hosted tenants and their expected rate of expansion while provisioning.

Security (authentication and authorization)
When a new resource is created, Nuage mandates the creation of necessary access control rules. Who can access and who is allowed to change the rules are already set during the resource creation time. If an unauthorized caller wants to access the resource, then Nuage forwards the request for approval process to the respective resource owners. This is achieved by an interceptor framework that analyzes all incoming requests and determines if the request needs a formal review process or not. Formal reviews are created within Nuage and sent to the resource's owner, along with a ticket that captures required information. Critical resources are reviewed and approved with further due diligence by the Security Organization. All these rules are frequently polled and consumed into different services’ routers to enforce authorization at scale. The resource owners can also monitor real-time authorization decisions, grouped by the number of callers who have been allowed/denied for the last few weeks. Such real-time stats increase the operability and debuggability of many production services.

Workflow management
Since Nuage is a management portal for a spectrum of complex services, some of the operations can take from a few minutes up to even days to complete. Workflows are either triggered because of user requests or by some of the cron jobs, which often perform CPU/ IO-intensive workloads. This means that we need a distributed, asynchronous workflow management solution. Nuage uses Helix Task Framework for management, and the workflow can be represented by a Directed Acyclic Graph (DAG) of jobs. Jobs, in turn, can contain one or more tasks, where a task is the most granular runnable entity. This framework ensures: load distribution, that the maximum number of tasks are executed per instance, retries with delayed execution, task prioritization, and instance grouping to assign tasks to a designated group of instances.

Nuage also supports a special kind of workflow that converts any service call into a reviewable approval process. This satisfies additional security checks and allows for human intervention in cases where it is mandated by the in-house security team of LinkedIn.

Quota enforcement
Most services at LinkedIn have pre-defined SLAs and are provisioned with limited capacity needs. In order to be functional with the pre-defined SLAs, quota enforcement is widely adopted by most complex services. Most applications at LinkedIn are built with the Rest.li framework, an open source REST framework for building scalable RESTful backend servers. Nuage enables any application to enforce quotas on the caller through a quota throttling system. The quota values are configured via Nuage by product teams, and are persisted in a datastore. The quota enforcement system is built as a library that is readily consumed and run by routers/application servers that need quota enforcement. The library caches the relevant quotas and fetches the initial value from the storage layer during the bootstrap. In the case of live updates, Nuage writes to Zookeeper and the library will be notified via a callback. The library is set to listen for changes in the relevant Zookeeper nodes, which helps to refresh the target quota instantaneously in the case of live updates. Throttling is supported on different metrics like aggregation window, percentiles, and a few others. The enforcement can also be set based on either the cost of a request or simply on the count. The adoption within LinkedIn has been huge and most engineering teams have a clearly defined SLA and manage the interservice calls through the quota enforcement system powered by Nuage.

Resource usage profiling and prediction
Understanding the current usage of a database/cluster empowers the management layer in several ways. This plays a key role in capacity management, provisioning new resources, understanding the set of all over-provisioned and under-provisioned resources, generating the cost-to-serve reports for services, calculating the headroom and making suggestions to decrease/increase the capacity with recommendation numbers, tracking availability, and reporting to service SREs to indicate the overall usage on a regular basis. In order to achieve this, Nuage collects the raw usage data, decolatoring it with mathematical models tailored for various needs, and then stores the information in a way that allows for fast accessing. This space has a lot of challenges, which include profiling accuracy, aggregated and coordinated usages of multi-tenant clusters, and avoiding staleness in cached data. Within LinkedIn, Resource Intelligence is a fairly new initiative focusing on cost and capacity intelligence required to meet our rapid business expansion in a cost-effective way. Nuage also plays a significant role by automatically triggering remedial and cost-cutting actions based on the usage data. These actions include automatically adjusting quotas based on the usage pattern, retiring unused resources, and evaluating the resource owner’s capacity needs in terms of business value proposition.

Cluster metadata management
Nuage manages a fleet of cluster metadata for different services, and each cluster is geographically distributed, replicated, fault-tolerant, scalable, and highly available. Most services manage their clusters using Apache Helix, a generic cluster management solution, in order to achieve the above design goals. Nuage manages many cluster attributes, like a list of all advertising availability zones (AZ), possible data-flow among AZs, types of supported workloads (small, medium, bursty), number of hosts, and max available capacity per AZs. These configurations are dynamic in nature, and are controlled by underlying services. Nuage stays up-to-date by listening to cluster changes, as these parameters directly affect many decisions like provisioning, reporting, alerting, usage tracking, and visualization of data flows between AZs.

Future extension

Currently, we are still shifting Nuage from being reactive to more proactive and relying on automated data-driven decisions going forward. This requires collection of real-time statistics from different systems within LinkedIn, as well as aggregating and experimenting with many algorithms to make informed decisions on provisioning, capacity management, headroom calculation of a resource, sending warning notifications, and detecting and recovering from failures. As the adoption of Nuage increases, we also want to guarantee availability as it becomes critical to the business. We are thinking of new strategies to operate seamlessly even with many dependent component failures. This requires the management layer to automatically initiate self-healing workflows to recover from most failures and also learn to operate with partial system failures.

Acknowledgements

Special thanks to the stellar team for the tireless contributions: Vishal Gupta, Terry Fu, Ji Ma, Yifang Liu, Changran Wei, Yinlong Su, Darby Perez, Tyler Corley, and Micah Stubbs. Thanks to the leadership team for their continued support and investment in boosting productivity across all engineering teams: Mohamed Battisha, Eric Kim, Parin Shah, Josh Walker, and Swee Lim.

A huge shout-out to all our partner teams who helped us in extending the platform across many services: Kafka, Espresso, Ambry, Quotas, Graph, Samza, Pinot, Venice, DataVault, and Search.

Topics: Scalability Data Infrastructure