SQL and coSQL are not diabolically opposed, but deeply connected via beautiful math theory

Erik Meijer, Gavin Bierman

While we don't often think of it this way, the RAM is actually a key-value store where

Languages such as C# and Java make no distinction between r-values and l-values, unlike C or C++, where the distinction is explicit.

In C, the pointer dereference operator ***p** retrieves the value stored at address **p** in the implicit global store.

In any modern object-oriented language we can model **products** using the following declaration, which for each **product** has scalar properties for title, author, publication date, and number of pages, and which contains two nested collections — one for keywords and another for ratings:

Product { string Title ; string Author ; int Year ; int Pages ; [string] Keywords ; [string] Ratings ; }

Given this declaration, we can use object initializers to create a product and insert it into a collection using collection initializers:

var _1579124585 = new Product { Title : "The Right Stuff", Author : "Tom Wolfe", Year : 1979, Pages : 320, Keywords : new [] { "Book", "Hardcover", "American" }, Ratings : new [] { "****", "4 stars" }, } var Products = new[] { _1579124585 };

Now let's redo this example using the relational model.

First of all, we must normalize our nested Product class into three flat tables, for Products, Keywords, and Ratings respectively, as shown below.

Products { int ID; string Title; string Author; int Year; int Pages; } Keywords { int ID; string Keyword; int ProductID; } Ratings { int ID; string Rating; int ProductID; }

As usual in the relational world, we do not model the individual collections of keywords and ratings for each product as separate entities, but instead we directly associate multiple keywords and ratings to a particular product. This shortcut works only for one-to-many relationships. The standard practice for many-to-many relationships (multivalued functions) requires intersection tables containing nothing but pairs of foreign keys to link the related rows.

Perhaps surprisingly for a declarative language, SQL does not have expressions that denote tables or rows directly. Instead we must fill the three tables in an imperative style using loosely typed DML statements:

Products.Insert (1579124585, "The Right Stuff", "Tom Wolfe", 1979, 320); Keywords.Insert (4711, "Book" , 1579124585); Ratings.Insert (787, "****", 1579124585); Keywords.Insert (1843, "Hardcover", 1579124585); Ratings.Insert (747, "4 stars", 1579124585); Keywords.Insert (2012, "American", 1579124585);

An important consequence of normalizing a single type into separate tables is that the database must maintain referential integrity to ensure that:

For example, we cannot delete a row in the Products table without also deleting the related rows in the Keywords and Ratings tables.

Referential integrity implies a closed-world assumption where
transactions on the database are serialized by (conceptually) suspending the world synchronously,
applying the required changes, and
resuming the world again when referential integrity has been restored successfully, rolling back otherwise

On the one hand, it simplifies the life of developers via ACID (atomicity, consistency, isolation, durability) transactions (although in practice, for efficiency, one must often deal with much weaker isolation levels) and allows for impressive (statistics-based) query optimization.

The closed-world assumption, however, is in direct contradiction with distribution and scale-out. The bigger the world, the harder it is to keep closed.

Returning to our example, we present the naïve query to find the titles and keywords of all products that have four stars, expressed directly in terms of the relational model. It creates the cross-product of all possible combinations:

from product in Products from rating in Ratings from keyword in Keywords where product.ID == rating.ProductId && product.ID == keyword.ProductID && rating == "****" select product.Title, keyword.Keyword

The result of this query is the row set. Disappointingly, this row set is not itself normalized.

In fact, to return the normalized representation of our object-graph query, we need to perform two queries (within a single transaction) against the database: one to return the title and its primary key, and a second query that selects the related keywords.

What we observe here is SQL's lack of compositionality — the ability arbitrarily to combine complex values from simpler values without falling outside the system.

By looking at the grammar definition for table rows, we can immediately see that SQL lacks compositionality; since there is no recursion, rows can contain only scalar values:

row ::= new { ..., name = scalar, ... }

value ::= new { ..., name = value, ... } | scalar

SQL is rife with noncompositional features. For example, the semantics of ` NULL`

is a big mess: why does adding the number ` 13 `

to a ` NULL`

value, ` 13+NULL `

, return ` NULL`

, but summing the same two values, ` SUM(13, NULL)`

, returns ` 13`

?

Also, even though query optimizers in modern SQL implementations are remarkably powerful, the original query will probably run in cubic time when implemented via three nested loops that iterate over every value in the three tables. A seasoned SQL programmer would instead use explicit join syntax to ensure that the query is as efficient as our object-based query:

from product in Products join rating in Ratings on product.ID equals rating.ProductId where rating == "****" select product into FourStarProducts from fourstarproduct in FourStarProducts join keyword in Keywords on product.ID equals keyword.ProductID select new { product.Title, keyword.Keyword };

It is not only the programmer who needs to recover the original structure of the code. The database implementer must also jump through hoops to make queries execute efficiently by building indexes that avoid the potential cubic effect.

For one-to-many relationships, indexes are nothing more than nested collections resulting from precomputing joins between tables to quickly find all the rows whose foreign keys point to a row with a particular primary key. Since the relational model is not closed under composition, however, the notion of index has to be defined outside the model.

Two natural indexes correspond to the relationships between Products and Ratings and Products and Keywords, respectively. For each product in the Product table, the ratings index contains the collection of all related ratings:

from rating in Ratings where rating.ProductID == product.ID select rating;

from keyword in Keywords where keyword.ProductID == product.ID select keyword;

One of the advantages touted for normalization over hierarchical data is the ability to perform ad-hoc queries — that is, to join tables on conditions not envisioned in the original data model. For example, we could try to find all pairs of products where the length of the title of one product is the same as the length of the author's name in the other using the following query:

from p1 in Products from p2 in Products where p1.Title.Length == p2.Author.Length select new { p1, p2 };

The ability to create indexes makes a closed-world assumption. For example, if we modify the previous ad-hoc query to find all pairs of Web pages where one page has a URL referencing the other, it should be obvious that building an index for this join is quite a hard task when you do not have the whole Web available inside a single database:

from p1 in WWW from p2 in WWW where p2.Contains (p1.URL) select new { p1, p2 };

Summarizing what we have learned so far, we see that

Let's start by slightly simplifying the example. We do so by removing the object identity of the Ratings and Keywords collections to reflect more directly how they are modeled in the relational world. We inline the Keywords and Ratings items directly into the parent Product, as if we had value-based collections.

We notice two interesting facts:

At this point, it appears that there is a strong correspondence between these two representations. they both consist of:

Is there some precise way of describing such a situation?

Fortunately, there is an entire area of mathematics designed exactly for this: CATEGORY THEORY

Category theory arose from studies of mathematical structures and an attempt to relate classes of structures by considering the relations between them. Thus, category theory expresses mathematical concepts in terms of objects, arrows between objects, and the composition of arrows, along with some axioms that the composition of arrows should satisfy.

A computational view of category theory is that it is a highly stylized, compact functional language. Small as it is, it's enough to represent all of mathematics. For computer scientists, category theory has proved to be a rich source of techniques that are readily applicable to many real-world problems.

The first powerful concept from category theory that we will use is DUALITY.

Formally, given a category **C** of objects and arrows, we obtain the dual category **coC** by reversing all the arrows in **C**. If a statement **T** is true in **C**, then its dual statement **coT** is true in the dual category **coC**. In the context of this article, we can read "opposite statement" for "dual statement".

In other words, the noSQL is the dual of the SQL — noSQL is really coSQL.

The implication of this duality is that coSQL and SQL are not in conflict, like good and evil. Instead they are two opposites that coexist in harmony and can transmute into each other like yin and yang. Interestingly, in Chinese philosophy yin symbolizes open and hence corresponds to the open world of coSQL, and yang symbolizes closed and hence corresponds to the closed world of SQL.

Because SQL and coSQL are mathematically dual, we can reason precisely about the tradeoffs between the two instead of relying on rhetoric and anecdotal evidence.

If we really take the duality to heart, we may also choose to (but don't have to) fine-tune our model for key-value stores to reflect the duality between values and computations, and that between synchronous ACID and asynchronous BASE (Basically Available, Soft state, Eventually consistent).

Looking up a value given its address or key in a key-value store can involve a computation, which in a truly open world has potential latency and may fail. Perhaps an even best example of a computation-driven key-value store with long latency and high probability of failure is the Web (always able to handle a 404), with URIs as keys, "resources" as values, and the HTTP verbs as a primitive query/data-manipulation language. On the other hand, in a C-like key-value memory model, we usually make the simplifying assumption that a key lookup in memory takes constant time and does not fail.

Traversing a relationship in the closed world of the relational model involves comparing two values for equality, which is guaranteed to succeed because of referential integrity; and vice versa, referential consistency dictates that relationships are value-based. Otherwise, we could never be sure that referential consistency actually holds.

Note that comparing keys by value requires that objects in the SQL category are strongly typed, at least enough to identify primary and foreign keys; and dually, since we do not need to know anything about the value of a coSQL object to find it using its key, objects in the coSQL world can be loosely typed.

Our abstract model of the SQL category did not impose any restrictions on the structure of rows; we assumed only that we could determine a primary or foreign key to relate two rows.

In the typical relational model we would further impose the constraint that rows consist of flat sequences of scalar values (the so-called First Normal Form, or 1-NF).

If we dualize relations in 1-NF, then we get a key-value model where values consist of either scalars or keys or collections of scalars or keys.

If we assume that rows in the foreign-/primary-key model are simply blobs and keys are strings, then the dual key-value model is exactly the HTML5 key-value storage model:

interface Storage { readonly attribute unsigned long length; getter DOMString key (in unsigned long index); getter any getItem (in DOMString key); setter creator void setItem (in DOMString key, in any data); deleter void removeItem (in DOMString key); void clear (); }

To talk about queries, we need to be more precise about what we mean by collections of values. Pure relational algebra is based on sets of rows, but actual relational databases use multisets (bags) or ordered multisets (permutations). To model collections abstractly, we look at sets/bags/permutations of rows and apply the category theory dictum: "What is the interface that these various collections of rows implement?" and "How do we generalize queries based on such an interface?"

First, let us stick with simple set collections. When we write a SQL query such as SELECT F(a,b) FROM as AS a, bs AS b WHERE P(a,b) the SQL compiler translates that pretty syntax into a relational-algebra expression in terms of selection, projection, joins, and Cartesian product.

There is no reason to restrict the relational algebra operators to work over just sets (or bags, or permutations) of rows. Instead, we can define similar operators on any collection of values of arbitrary type T.

Rather incredibly, an interface of this shape is well known in category theory. It is called a monad, where the type constructor is a functor of the monad; the constructor is the unit of the monad; SelectMany is the bind of the monad; and ∅ and ∪ are the neutral element and addition, respectively. For the rest of us, they are just the signatures for methods defined on an abstract interface for collections.

This is no theoretical curiosity. We can play the same syntactic tricks that SQL does with relational algebra, but using monads instead. Such monad comprehensions have been recognized as a versatile query language by both functional programmers and database researchers.

For a large subset of coSQL queries, the shape of the query closely follows the shape of the data. Mathematics dictates that coSQL queries are performed using MapReduce.

Practical implementations of MapReduce usually slightly generalize Bird's lemma to use SelectMany instead of Select so that the map phase can return a collection instead of a single value, and insert an intermediate GroupBy as a way to "write" equivalence classes of values from the map phase into the key-value store for subsequent processing in the reduce phase, and then aggregate over each subcollection.

In an open world where collections are distributed across the network, it is much harder for a query optimizer to perform a global optimization taking into account latency, errors, etc. Hence, most coSQL databases rely on explicit programmatic queries of a certain pattern such as MapReduce that can be executed reliably on the target physical machine configuration or cluster.

In contrast to common belief, the question of big versus small data is orthogonal to the question of SQL versus coSQL. While the coSQL model naturally supports extreme sharding, the fact that it does not require strong typing and normalization makes it attractive for "small" data as well. On the other hand, it is possible to scale SQL databases by careful partitioning.