Graph Systems

How LIquid Connects Everything So Our Members Can Do Anything

Imagine a tool that can store and connect all the information you need to make decisions and solve problems. Most people would say it’s nice to think about, but not yet possible. The good news is this tool already exists - and it’s called a graph database. At LinkedIn, technologies like graph databases are essential to powering today's platform, while being flexible enough to scale for our future needs. In earlier blog posts, we defined what a graph database is and also shared how to index graph data for fast, constant-time access. One of the biggest (and most important) questions that we haven’t answered is how we utilize graph databases to deliver tangible benefits for our members. 

In this post, we will explain how LIquid, LinkedIn’s graph database, delivers value for our members and customers by helping them quickly find what they're interested in, like getting advice on skills, finding specialized talent for an important project, and even connecting with a coach -- all in real time. 

Background

Graphs are a view of the world where the most important aspects are the relationships between entities like members, schools, companies, and skills along with the paths from one entity to another through these relationships. In a graph database, all the connections between the data are already built in, so you can easily explore and expand your understanding of the relationships between different pieces of information, making it easier to navigate complex networks. For example, you receive suggestions to connect to “People You May Know” based on friends-of-friends and shared interests (companies, skills, groups, etc.) or get a list of “Jobs you may be interested in” that are aligned to your skills and interests.

Screenshot of My Network Tab on LinkedIn

Figure 1:  People You May Know (PYMK) 

A knowledge graph is like a big web of information where the way things are connected helps us understand and think about the world. This includes things like connections between people, skills, jobs, experiences, and events, and each of these connections has a special meaning that helps us understand them better. Each entity and each connection has a label that helps us understand the world in a logical way. Queries, or questions, look at the connections, labels, and relationships within the graph and resulting data is used to provide answers for those questions.  LinkedIn’s domain-specific knowledge graph, the Economic Graph, is our digital representation of the global economy that we use to answer questions about and provide insights into the dynamics of the global workforce and job market. 

The Ability To Connect Everything

By hosting LinkedIn’s Economic Graph, LIquid automates the indexing and real-time access of all connections to other members, schools, skills, companies, positions, jobs, events, groups, etc.  This knowledge graph is massive, with 270 billion edges and growing. As more people join and make new connections, the number of edges increases. Currently, we handle a workload of 2 million queries per second, and we expect that number to double in the next 18 months as more people join and use LinkedIn.

One way that the Economic Graph provides value for our members is through second-degree connections (the colleague of an old school friend or the new boss of a co-worker from a prior job), the paths to get to them, and their surroundings. 

Diagram of PYMK triangle closing queries

Figure 2: PYMK triangle closing queries

Being able to do this at our scale and in real time is not a simple task. If someone has 1000 connections and only 10 of those people have 1000 connections, the number of 2nd degree connections reaches 100,000 very fast. If anyone in this cohort has over 10,000 connections that makes the selection of important connections even harder. 

To handle this growth in real-time queries and ensure uninterrupted access for our members, LIquid’s inherent design allows it to scale up to ten times its current size. This means it can easily accommodate both organic growth and new semantic domains for our 930+ million members. It also automatically expands to accommodate the size of the graph and volume of activity, providing 99.99% availability for our members.

LIquid provides a comprehensive foundation for our developer community to enhance the member experience, by allowing them to utilize a composable and declarative query language based on Datalog to easily ask questions about the data they want to see. We also have special technology in place that helps them find the data they need quickly and efficiently. A composable language allows developers to build on existing features (called modules), and a declarative language allows developers to focus on expressing what they intend to build and LIquid automates efficient access. 

The increase in velocity afforded by being able to utilize modular/composable and declarative queries in LIquid allow us to bring better experiences to our members and customers faster because our developers can make quick and easy changes to a dataset and the queries traversing the data in it. Previously, making changes to these databases was a major bottleneck as it took multiple months to make adjustments and updates. However, with LIquid serving as an index, our developers are able to quickly add new schema and data within two weeks.

The Power To Do Everything

LinkedIn’s “People You May Know” (PYMK) feature has long been used by our members to form connections with other members to expand their networks. To gain efficiencies and drive better results for our members, we migrated our People You May Know recommendation system from a legacy system named GAIA to LIquid. Earlier versions of this recommendation system were computed off-line and the information was limited in scope and only updated periodically. For the team to deliver deeper insights and more relevant recommendations for our members, they needed to explore a different technical solution to achieve their objective, one that could handle the expected scale up in volume, queries, and members without the cost of the real-time system increasing exponentially. 

Diagram of PYMK architecture using LIquid for graph traversal

Figure 3: PYMK architecture utilizing LIquid for graph traversal

Figure 3 illustrates the architecture of the system. Based on the hundreds of candidates generated through LIquid queries on the Economic Graph, a second ranking function is applied. This ranking function uses machine learned features from Venice and analytics insights from Pinot to score and select the top candidates. The filtering step prepares this ranked list for rendering and final scoring. 

This architecture employs LIquid to answer the queries, which are complex and completely new, with both short latencies and acceptable hardware cost. In the initial phase, a lot of engineering effort went into optimizing LIquid to meet the required speed and cost as well as measure the user impact. We were able to achieve parity of hardware and compute cost of GAIA when we launched. Now we are making continuous improvements over and above what was achieved at the initial launch and gained further hardware efficiencies even as we scaled up the workloads. 

LIquid introduced new database indexing techniques that made on-line querying of the data possible. As a result, the queries are very recent. A connection that is only seconds old can now be used to create new recommendations. The use of a declarative query also allows more diverse, targeted, and explainable results to be provided, e.g., people you may know because you worked at the same companies, attended the same events, shared the same skills, etc. 

Rather than being limited to just a few queries, LIquid enabled the client team to handle multiple queries, as well as the ability to develop a comprehensive and explainable set of queries.

It also helped our developers speed up their experimentation process, turning many features that relied on empirical insight into data-driven, high-velocity experiments. In the past, it could take months for an offline experiment to be ramped up and for its impact to be measured. With LIquid, our developers can start an A/B experiment within hours by simply changing the query parameters. This system only generates the necessary data, minimizing the required compute resources. The impact of sensitive parameters also can easily be adjusted to optimize the results. 

Our implementation of LIquid for PYMK has delivered some exceptional outcomes. The QPS has increased from 120QPS to 18000QPS, latencies have dropped from over 1s to under an average of 50ms, and CPU utilization has decreased by more than 3x.

Moreover, we improved our member engagement metrics. Initially starting with just two queries, we have added an additional dozen for PYMK that are now providing real-time and relevant insights for our members. This success led to other platform applications to adopt LIquid queries, including Social Seeking Notification and the Hiring In Your Network Job Collection.

What next?

Connecting everything requires continuous growth in the Economic Graph footprint. LIquid’s current architecture is homogeneous, meaning that all data is treated equally. As the graph grows inefficiencies become visible in two dimensions. First, some data is important but the number of queries touching it is small. Second, some traffic to the data does not require the same stringent real-time requirement. When considering the next few orders of magnitude – data footprint size and the number of queries  – the heterogeneous nature of data is how we solve for 10x or 100x increase in footprint size and number of queries. In database research, these are called tiered storage and workload optimization, respectively. Auto-tuning storage and workloads will allow dynamic optimization of these cost-saving features.

The roadmap to enable members to do anything requires increased sophistication in answering questions for our members. This can be improved along two main axes. First, the complexity of the query and the variety of the data sources added to the Economic Graph will enable new features to be developed and surfaced. Second, enriching the data will improve the ability to reason over it. This can be achieved through creating derived data (either through deterministic algorithms or probabilistic machine learned methods) or improved reasoning through richer semantic in the Knowledge Graph (KG) schema. We plan to focus on both high-performance graph compute and analytics and building a KG ecosystem to enable our developers to further enhance our member experience. 

LIquid is now an essential component in delivering value to our members via real-time access to relevant data for three other knowledge systems in LinkedIn. This successful implementation has inspired other teams within LinkedIn to adopt it as a graph index and for sister teams at Microsoft to show interest in the technology.