Galene Search Infrastructure
Search is one of the most intensely studied problems in software engineering. It brings together information retrieval, machine learning, distributed systems, and other fundamental areas of computer science. And search is core to LinkedIn. Our 300M+ members use our search product to find people, jobs, companies, groups and other professional content. Our goal is to provide deeply personalized search results based on each member’s identity and relationships.
LinkedIn built our early search engines on Lucene. As we grew, we evolved the search stack by adding layers on top of Lucene. Our approach to scaling the system was reactive, often narrowly focused, and led to stacking new components to our architecture, each to solve a particular problem without thinking holistically about the overall system needs. This incremental evolution eventually hit a wall requiring us to spend a lot of time keeping systems running, and performing scalability hacks to stretch the limits of the system.
Around a year ago, we decided to completely redesign our platform given our growth needs and our direction towards realizing the world’s first economic graph. The result was Galene, our new search architecture, which has since been implemented and successfully powering multiple search products at LinkedIn. Galene has helped us improve our development culture and forced us to incorporate new development processes. For example, the ability to build new indices every week with changes in the offline algorithms requires us to adopt a more agile testing and release process. Galene has also helped us clearly separate infrastructure tasks from relevance tasks. For example, relevance engineers no longer have to worry about writing multi-threaded code, perform RPCs, or worry about scaling the system.
Over the years, LinkedIn has built a series of graph databases to serve real-time queries to what has come to be known as “the economic graph.” Taking shape under the leadership of Scott Meyer and Srinath Shankar, LIquid is the fourth such system. Unlike predecessors which were specific to LinkedIn’s then-current schema, LIquid is a general-purpose implementation of the relational model which supports complex multi-join queries of the economic graph.
A graph database is an implementation of the relational model with the following 4 properties:
- All relations are first-class, graph edges.
- Navigation along a graph edge is constant-time.
- Schema-evolution is constant-time.
- Query results are a subgraph.
LIquid is a distributed, in-memory graph database. LIquid’s kernel query language is Datalog operating over a graph expressed as primitive edges, “Edge(subject, predicate, object).” Unlike standard Datalog which returns sets of variable bindings -- basically relations -- LIquid’s Datalog returns the subgraph of primitive edges, encoded as a JSON tree by default.
 Giuseppe DeCandia, et. al., Dynamo: Amazon’s Highly Available Key-value Store, ACM SIGOPS Operating Systems Review - SOSP '07, Volume 41 Issue 6, December 2007