LIquid: The soul of a new graph database, Part 2
September 16, 2020
Co-authors: Scott Meyer, Andrew Carter, and Andrew Rodriguez
Editor’s note: This is the second part of a two-part blog series. Part 1 described how graph data relates to the in-memory graph of structures commonly used as models (as in “model-view-controller”) by applications, how graph data can be stored in a conventional relational database, and why conventional relational databases don’t work very well on this sort of data. Rather than scrapping the relational model, in this installment, we’ll show how relational graph data can be processed efficiently.
Graphs as relational data
In the last post, we started by discussing the design of LIquid in a straightforward way, beginning from the application programmer’s view, a graph of objects in memory, and solving problems as we encountered them. However, having arrived at a point where we have relational edge data (the graph log) and an inverted index (the objects), we might want to look at a conventional relational approach to this problem. In fact, representing a graph as an adjacency list of edges in a tabular relational system is easy:
There are obvious benefits to adopting this restricted form of the standard tabular relational model, including:
Use of the well-studied relational model: It is difficult to argue that set theory and first-order logic should be considered harmful or something to avoid. The relational model has had enormous practical success, and for very good reasons.
No NULLs: Edges either exist or do not. The system is not forced to assign meaning to something like Edge("x", "age", NULL), so there’s no need for built-in trinary logic. The application must explicitly say what it intends (0, MAXINT, …) when nulls arise due to disjunction in queries.
No normalization problems: This data is in fifth normal form. There’s no way for the user to express denormalization.
Easy data portability: Today, it is difficult to move data from one database to another. There's no standard encoding for a set of tables. For single tables, in our specific case (edges expressed as triples of strings), data exchange is easy: CSV and spreadsheets are common tabular interchange formats.
This form of adjacency-list graph encoding is simple, well-known and has been available for decades. So again, “Why aren’t we done?” Basically, two reasons:
Poor performance: Traditional relational database systems, based on a static tree of relational algebra operations applied to sorted storage, perform horribly on this schema. For a static query planner based on table-level statistics, self joins are intensely problematic.
SQL: SQL (an implementation relational algebra) is extremely awkward to use for building applications. It is verbose, composes awkwardly, and naturally returns only results which can be expressed as a single table. Colloquially, this awkwardness is known as the “object-relational impedance mismatch” or worse, if the speaker has suffered through a recent adverse encounter with it.
Our workload, and a very common application workload generally, is browsing: inspecting a small portion of a very large graph. While the subgraph displayed on a browser page is tiny relative to the size of the overall graph, it has an intricate structure because human beings like to know what is interesting about the object they are looking at. The declarative query that retrieves the subgraph necessary to display a page will commonly have tens or hundreds of edge-constraints, self-joins with the edges table in this data model. Unavoidably, the bulk of query evaluation for such queries turns into random probing of the edges table.
If the table is implemented by a sorted structure, some form of B-tree is nearly universal in tabular relational systems, random probing is O(log(N)), and N and the constant factors involved are large.
Concretely, at terabyte scale, an immutable B-tree will probably have at least three levels, with only the first being relatively hot in the L1 cache. A single probe will require two probably-uncached binary searches, easily dozens of L3 cache misses to materialize a single edge. Mutability will make this performance worse. In contrast, the index structures that we describe in “Nanosecond Indexing of Graph Data with Hash Maps and VLists” (presentation) will do a probe for a single edge in either 2 or 3 L3 misses.
The fact that social graph data is frequently skewed, with the largest fan-outs being many orders of magnitude larger than the average, exacerbates the performance difference between sorted and hash storage. In a sorted index, finding the size of a result set without actually materializing it will require a second binary search. In practice, it is difficult to decide when the second binary search is worth doing. In a hash-based index, the size of a set can be stored in the top-level hash table, available in 1 L3 miss. Detecting skew in advance of experiencing its adverse effects is cheap and easy.
Shifting our attention from storage to query evaluation, we find that relational algebra is inefficient, both as a declarative specification and as execution machinery. Consider the following simple graph query about skills and employers in someone's second degree network:
As notation, this is cumbersome. Syntax alone is a solvable problem. However, notice the redundancy in the output: pairs like (Java, Oracle) or (a1, b1) occur multiple times, even though there is just one underlying edge. This spurious cross product in the result table is a fundamental problem. Both the data and the result desired by the user are graph-shaped. Representing a graph as a table of tuples of variable bindings forces cross products, two of them in this very simple example. In theory, this is just a protocol problem brought on by SQL’s use of a single table for query results—one we could fix by returning matching rows from multiple tables instead of just one. However, in practice, tabular relational systems really use relational algebra to evaluate queries, and thus spurious cross products are actually an efficiency concern. As one of our co-authors, Andrew Carter, once quipped, “It is difficult to have MxN stuff without doing MxN work.”
Sadly, this isn’t the end of the bad news for relational algebra. Graph-shaped queries—those with circular constraints among the variables—turn out to be important, with triangular or diamond-shaped constraints being a common task. Recent work on “Worst-case Optimal Join Algorithms” has shown that “any algorithm based on the traditional select-project-join style plans typically employed in an RDBMS are asymptotically slower than the optimal for some queries,” with a simple triangular join being the leading example.
LIquid uses a cost-based dynamic query evaluator, based on constraint satisfaction applied to a graph of edge-constraints. In this context, “dynamic” means that the query plan is not known when evaluation starts. In a traditional SQL query evaluator, the query plan (a tree of relational algebra operators) is completely determined before evaluation starts. In contrast, LIquid’s evaluator establishes path consistency of the result subgraph by repeatedly satisfying the cheapest remaining constraint. For cost computations, the evaluator relies heavily on constant-time access to set sizes in the index. We’ll describe the query evaluator in more detail in an upcoming paper; for now, let’s just say that we think that the performance problems traditionally associated with a relational graph data model are tractable.
This leaves us needing only a relational query language with a succinct, composable syntax. Like this, perhaps:
The above is the same query, composed from two sub-queries.
While Datalog has a dearth of industrial-strength implementations, it is extremely well studied, commonly used in the study of relational databases. To adapt Datalog for our purposes, we need only require that Edge-relations be the only form which can be asserted. All rules must ultimately reduce to Edges.
The shape of query results can be selected by the user: the traditional table of rule-head variable bindings as in SQL, or the actual edges used to satisfy the query. Aside from possibly forcing cross products after the constraints have been satisfied, choosing tabular results has no impact on query evaluation. However, if what the user needs is a complex, graph-shaped result, such as might naturally be produced by a complex, composed query, then an efficient way to consume it is important. Tabular results and spurious cross products are the crux of the object-relational impedance mismatch. Returning query results as a subgraph (a set of edges) allows arbitrarily complex queries to be answered in a single, consistent operation.
Now we have an in-memory relational database with hash-based indexes and Datalog as a query language, with a dynamic query planner based on constraint satisfaction. Are we done yet? Not quite. Relative to object-graphs or SQL, we lack schema. All values are just strings. Worse, we have no way to insist that certain relationships be unique—for example, that an entity should have just one name. Without enforced uniqueness, there’s no way to map a graph into a normalized table. Consider:
Generally, people have one name and one birthday. For this query, a single table is an excellent return value. However, if the graph allows multiple name or date of birth edges on an entity, we will have to expand the cross product and the user will have to re-normalize manually.
Tables are common and useful, and users reasonably require this mapping. What we need to repair this default is a way to describe the ends of an edge: the scalar type (how to interpret the literal string that identifies the vertex) and cardinality constraint (how many incident edges) for subject and object. As with the SQL catalog, which describes tables, we encode this description of edges as graph data so that the graph is self-describing:
liquid/type and liquid/cardinality are “bootstrap predicates,” known to the database implementation and used to define the schema. Now, we could describe a predicate as follows:
We say “could” because it is a bit awkward to ask the user to define extra identities for the subject and object metadata, “name_sub” and “name_obj”. This will be fixed presently.
Scalar types such as liquid/node, liquid/string, or liquid/int determine parsing rules with which the corresponding identity must comply for the write to succeed. Identities can and do serve multiple purposes:
As in the above example, the user is expected to disambiguate in the query if need be. The database implementation is free to exploit type information, for example by sorted or quantized indexing or immediate storage of small integers.
While many relationships are naturally represented as a subject-predicate-object triple: mother, name, author, there are also many which are not, like an employment tenure (person, company, start date), or an actor’s film performance (actor, role, film). For ternary relationships, we could force them into the existing Edge by making one of the vertices do double duty as a predicate: “Harrison Ford Han-Solo’d in Star Wars,” but this is a one-off hack that doesn't generalize to n-ary relationships.
A graph data model should allow arbitrary n-ary “compound edges” to behave just like simple edges, as unique relationships which either exist or do not. Given that edges can be used blithely, it is tempting to start on a solution by building n-ary relationships out of edges:
With pointer-like index performance, the extra hop should not be a performance problem. However, the need to fabricate an identity of the hub node, "x", is a problem. In particular, if the user simply creates arbitrary hub identities, the FilmPerf assertion does not behave like an Edge assertion. Duplicate FilmPerf compound edges could occur. However, if we choose a systematic encoding for the hub node, say something like "actor:Harrison Ford,film:Star Wars,role:Han Solo", FilmPerf assertions will have the same identity semantics as edges. To allow the database implementation to do this encoding for us, we introduce compound graph schema: a systematic way of expressing relative identities. A compound is specified by some meta-structure which indicates that a set of predicates is used to generate a relative identity:
The subject side of the predicate always refers to the relative identity, the “compound hub.” The object side is specified by the user. For convenience, a compound definition implicitly generates a corresponding rule with a reserved @-suffix. Here’s what the FilmPerf compound actually looks like:
Notice that we can use the generated relative identity, cid, just like any other node. Here, we indicate that this film performance was a breakthrough. In tabular relational terms, a compound edge behaves exactly like a compound primary key.
Since compounds are known to the database implementation, the database is free to optimize compound relationships. In fact, a simple generalization of the index structures described in the start of this writing will allow compound edges and attributes which refer to them to be accessed as fast as atomic Edges. The database implementation can optimize away the extra hop. Effectively, we get the expressiveness and performance of a property graph without the need for the user to define or operate on properties specially.
Instances of FilmPerf behave “like vertexes” in that edges can refer to them, and “like edges” in that they can be deleted. Sometimes we need compound relationships that only behave “like vertexes.” We just want a scalar value with some queryable internal structure:
Such vertex-compounds allow us to encode arbitrary tree structures directly in graph data, which is fully accessible to the same Datalog query machinery used on an ordinary graph.
Generally, compounds define “relative identity,” a notion which turns out to be surprisingly useful. In particular, we use it to define “the subject side” and “the object side” of an edge in predicate schema. The real definition of DefPred looks something like this:
This allows us to have distinct relative identities for a predicate: “the subject side,” sm, and “the object side,” om, without bothering the user to come up with unique identifiers for them.
LIquid’s nearest neighbors
Let’s define what we’ve built. A “graph database” is an implementation of the relational model with the following properties:
All relations are equal and first-class. While a useful form of index for predictable access patterns, employing tables as the physical data model results in inequities as soon as there is more than one table. Tuples expressed in a single physical table are vastly more performant than those expressed by joining two or more tables. Lastly, the lack of domain support in tabular relational systems makes the user's intentions with respect to possible joins opaque. In a graph database, the existence of an edge is exactly expressive of user intent.
Edge traversal is fast and constant-time. If everything is stored as edges, we’re going to be doing a lot of joining, specifically self-joins. Every query with small intermediate results will almost certainly devolve into random access.
Query results are a subgraph. Application models are complex and applications need to be able to retrieve arbitrarily complex models in a single query with a single consistency guarantee.
Schema-evolution is constant-time. If schema-evolution (really, the addition of new predicates) takes non-constant time, then the database implementation must be pretending to operate with edges while maintaining structs or tables under the covers.
There are many things in the world that call themselves graph databases, roughly divided into three families:
RDF, SPARQL: triple stores
Property graphs: Neo4J, Facebook’s Tao, various Gremlin implementations
SQL and SQL extensions such as SQL Server’s graph support
Aside from #3, none of them have clear relationships to the relational model. None of them support n-ary compound edges. Query evaluation is either relational algebra-based, or imperative (traversal-based).
RDF is quite close to LIquid’s Edge in spirit, but has introduced a great deal of complexity. Triples are asymmetric. Actually, they are quads. There is a language tag that is difficult to square with both unicode string encoding, which is multilingual, and no type system (there are several) that can express cardinality constraints like: “one name in any language.” SPARQL seems to have inherited most of SQL’s ailments.
Property graphs introduce a second schema for properties and force the user to choose between representing data as edges or as properties of nodes or edges. A node with properties is basically an object and suffers from all of the drawbacks of OODBs. The obvious encoding of a property graph as a table of vertices (with properties) and a table of edges (with properties) will force two lookups for any query involving properties of vertices and edges.
While many graph database implementations are effectively “in RAM,” none explicitly optimize for random access with hash-based indexing, or exploit fast random access to set sizes in query evaluation. Nothing targets complex queries or supports the query composition necessary to do so easily.
We have defined what a graph database is and described the implementation of one that supports the Datalog query language. We have also suggested how to index graph data for fast, constant-time access and how to construct a dynamic query planner based on constraint satisfaction. The graph data model is simple and universal, and features a self-describing schema which allows us to specify relationship cardinality and value domains. And, this schema allows us to define n-ary compound edges, which we can use to implement property-graphs. So, “Why aren’t we done?”
Yeah actually, we’re pretty much done.
Did we mention that LIquid is fun to use?
Thanks to team LIquid at LinkedIn for many years of hard work making this database a reality. In particular: Bogdan Arsintescu, Roman Averbukh, Andrew Carter, Juan Colmenares, Ionut Constandache, Peter Li, Scott Meyer, Andrew Rodriguez, Siddharth Shah, Yiming Yang, and Jiajun Yao.