Nuage: Making Data Systems Management Scalable

Meng Zhang

Staff Software Engineer at LinkedIn

August 7, 2018

With more than a half billion members on LinkedIn, we have had to create new ways to scale our infrastructure and support the tremendous growth in data. We’re always looking for the best ways to manage our growth, from our internally-built systems to leveraging technology from Azure for external cloud integrations.

One of the ways we manage this data influx is through Nuage—French for “cloud”—which is LinkedIn’s cloud management team. They are responsible for many large distributed data systems, and the Nuage platform itself is architected as a distributed system. In this post, we’ll introduce a few LinkedIn data platforms and how Nuage makes system management and scalability easy.

Before Nuage, when application developers would develop and release a new product, a lot of human effort was needed to adequately provision, configure, and continuously manage the underlying data infrastructure. This process involved close cooperation and communication between the product team and infrastructure team, which often took a long time and could cause significant delays for a product release. Moreover, engineers from data infrastructure teams had to spend a lot of time handling these joint projects, as well as monitoring all these users’ platforms. This work was not only error-prone but also took time, making it hard for the engineers to focus on platform development itself.

Nuage came out of the vision to make data platform provisioning and management scalable. It is a service that exposes underlying database functionalities through its user interface and Rest APIs. For application developers, Nuage dramatically reduces the need to interact with data platform teams directly. It provides developers with a self-service portal with all the features needed to manage the data platform resources they own, such as CRUD operations, promotion (deploying a database from staging to a production environment), metadata management, schema management, and quota management. Figure 1, below, shows how application developers can use Nuage to create a Kafka topic easily.

Figure 1. Nuage page for creating a new Kafka topic

Meanwhile, for service owners (including platform developers and SREs), Nuage makes data platforms easier to operate, monitor, and debug. Through Nuage, these engineers can conveniently see their resource usages, adjust quotas, and perform capacity predictions. Figure 2 shows how platform admins can view Espresso database resource usage collected by Nuage. Nuage also monitors the health status of underlying platforms and sends out availability alerts if something unusual happens. Moreover, Nuage has comprehensive logs for both application developers and service owners to perform diagnosis and troubleshooting.

Figure 2. Nuage page for displaying Espresso resource usage

In short, Nuage saves a lot of hassle for both application developers and platform service owners. It makes our data platforms much more “invisible” to our application developers and increases the operability of data platforms for our service owners.

Under Nuage: Large-scale distributed systems

Nuage manages the majority of LinkedIn’s data platforms, including storage services like Espresso and Venice and streaming services like Kafka and Brooklin. These data platforms are designed to be horizontally scalable. Nuage integrates with these platforms in different ways based on their unique features and requirements. Figure 3 depicts the overall architecture of Nuage interacting with other LinkedIn entities.

Figure 3. Nuage architecture and integration with LinkedIn Data Infrastructure

Kafka
Kafka is a distributed messaging system used for real-time data activities, like application communication, metrics collection, and log aggregation. It was developed at LinkedIn in 2010 and is the heart of LinkedIn’s data pipeline. The idea of Kafka is simple, which is critical for its scalability. It is a publish-subscribe system, with an arbitrary number of producers and consumers.

Producers generate messages to a Kafka topic, while consumers retrieve messages from this topic. The messages are partitioned and guaranteed to be delivered in order within one partition. Kafka does not throttle the producers’ writing speed so that scalability is not impaired. Instead, to solve the problem of the discrepancy between producer and consumers’ speeds, it provides a commit log and enables each consumer to read at its own pace.

Nuage manages almost all the Kafka topics at LinkedIn. Application developers can create or update their topics through Nuage with necessary information, like Kafka pipeline type, partition number, topic owner, and retention time. Nuage also provides them with the ability to view the metric dashboard, configuration information, topic schema, and cluster information.

Venice
Venice is a distributed key-value storage system. It handles derived data at LinkedIn and leverages Lambda architecture to reduce the problem of batch processing data. It is much more favorable to horizontal scalability compared to a traditional relational database because it uses partitioning and replication. Furthermore, Venice uses efficient caching so that customers do not experience latency in the distributed environment. Venice leverages Helix for its cluster management, which makes it very easy to operate and scale the cluster.

Nuage not only enables users to perform create, read, update, and delete (CRUD) operations of a Venice store but also allows them to evolve their schemas and promote their store. Nuage also provides the resource-monitoring and quota-management functions for the Venice administrators. Application developers can request quotas, including both traffic quotas and disk quotas, from Nuage, and Venice administrators may approve or reject their requests through Nuage. Nuage also continuously monitors resource usage for all the stores. If any overutilization or underutilization happens, Nuage will send alerts to the store owners asking them to make the corresponding adjustment.

Espresso
Espresso is LinkedIn’s distributed document database. It currently hosts lots of important LinkedIn applications, such as member profile, messaging (LinkedIn's member-to-member messaging system), and portions of the homepage and mobile applications, among others. The purpose of Espresso is to fill in the gap between traditionally-used relational database management systems (RDBMSs) and current key-value stores.

Espresso is much more flexible than a traditional relational database for many reasons. For example, it supports easy schema revolution and data center failover. It also incorporates richer features than simple key-value stores, has a change capture stream, and has cross-colo replication for maintaining data consistency.

Espresso provides a hierarchical data model, such as database to table to document. Nuage allows a user to manage all these data layers and their corresponding schemas for his or her own Espresso database. Nuage also provides health monitoring, job management, and quota management for Espresso service owners. Espresso also relies on Helix to provide partition management and fault tolerance. Nuage is working to incorporate the Helix information of a data platform, which will further facilitate the service owner’s monitoring and troubleshooting.

How Nuage handles scalability challenges

As the number of systems it manages grows, Nuage itself also faces a lot of scalability challenges in managing so many large-scale distributed systems with such large footprints of metadata. We seriously consider scalability and incorporate it into all our design decisions to make Nuage a more scalable service.

Caching, caching
As the metadata from the data platforms that Nuage manages grows inconceivably large, it is unrealistic for Nuage to have to consult underlying platforms to serve every request. Instead, Nuage extensively uses caching to reduce latency and improve the user experience.

For example, if a user opens a Nuage page to view all the databases he or she owns, Nuage can render the page in almost no time thanks to caching, as compared to a round trip to the underlying platform, which could significantly hurt the user’s experience. Furthermore, Nuage persists some additional information about the data platforms in its own metadata database, including database owners and descriptions. These fields are for the convenience of application developers and do not need to be passed to the underlying data platform if they get updated.

Distributed for load balancing
Similar to the underlying data infrastructure, Nuage is a distributed system with its frontend and backend services running distributed across multiple data centers at LinkedIn. The customers’ requests are routed to the service machines to which they are closest. Requests are load balanced among a distributed set of Nuage instances across data centers. Distribution avoids single point failure and improves the system’s availability. It is a must-have to support future growth.

Asynchronous to avoid waiting time
Nuage is enriched with an asynchronous approach to better support operations involving some of its data platforms, such as search service, and some operations that usually take a long time to complete, like the database promotion operation. An asynchronous approach dramatically reduces idle cycles and improves the throughput. We are currently trying to modify more operations to leverage an asynchronous framework, as a synchronous approach generally slows down scalability.

Stateless frontend
Nuage is equipped with a stateless frontend as a thin client. Its frontend is not aware of any state change, and for the same user input, it will always provide the same output. This stateless feature makes the frontend easy to test and also enables the system to scale, as users’ requests can be routed to any web server without feeling any difference.

Easy to monitor and diagnose
Scalability not only means how much traffic the system can handle but also encompasses how easy it is to monitor behavior and make a diagnosis when something goes wrong as the load increases. Nuage automatically sends email alerts to Nuage administrators upon certain failures, like if a database provisioning operation fails. This way, we can detect urgent issues in time. Debugging a distributed system for a diagnosis is always difficult and time-consuming. In Nuage, we have set up ELK and do the request tagging so that we can easily find the log to investigate the root of an issue.

Capacity management and prediction
Nuage manages several clusters for every individual data platform. With continuous resource profiling, such as disk usage and resource (such as read quota, write quota) utilization, Nuage has a good understanding of the cluster resource usage history and trends. Therefore, Nuage can provide periodic capacity predictions to the platform teams to give them advanced warning that they need to increase their system’s capacity by either expanding the cluster or moving some databases to a different cluster.

We are also working on enabling Nuage to automatically expand system capacity through underlying resource management tooling, which will further improve the auto-management of our platforms. Capacity prediction helps form the growth plan for underlying database platforms and enables them to scale more smoothly.

Toward future growth at LinkedIn

We expect tremendous data growth in the next few years for LinkedIn. Nuage is playing an essential role in this trend, and what we should prioritize is making Nuage even more scalable and automated. A vital goal in our roadmap is the self-integration framework. Many teams at LinkedIn would like to integrate with Nuage. However, the integration work still requires many collaborations between Nuage and the platform team, with lots of time and effort from engineers.

To scale Nuage, we are making the integration steps more accessible and easier to plug in. Our team is actively working on providing an SDK for all platform developers so that they can plug their platform into Nuage by themselves. They will have the choice to onboard any frontend and backend features. Nuage, as a team, will shift the balance of its duty from onboarding new systems to developing new and added value features.

Acknowledgments

Nuage would not be possible without the hard work from our team members, Terry Fu, Vishal Gupta, Nishant Lakshmikanth, Yifang Liu, Ji Ma, Hunter Perrin, Changran Wei, Lydia Weng, and Vivo Xu, and the leadership and support from our managers, Mohamed Battisha and Eric Kim.

I would like to express my sincere appreciation to all of them. I would also like to extend a special thank to Lei Xia for his great feedback on this blog. Finally, a huge thank you to all data platform teams—Espresso, Kafka, Venice, Samza, Ambry, and others—for their continuous cooperation and support.

Topics: Data Infrastructure