SQL and coSQL are not diabolically opposed, but deeply connected via beautiful math theory

Erik Meijer, Gavin Bierman

SQL and coSQL


While we don't often think of it this way, the RAM is actually a key-value store where

  • keys are addresses (l-values), and
  • values are the data stored at some address in memory (r-values)
  • Languages such as C# and Java make no distinction between r-values and l-values, unlike C or C++, where the distinction is explicit.

    In C, the pointer dereference operator *p retrieves the value stored at address p in the implicit global store.

    In any modern object-oriented language we can model products using the following declaration, which for each product has scalar properties for title, author, publication date, and number of pages, and which contains two nested collections — one for keywords and another for ratings:

    Product
        {   string     Title    ;
            string     Author   ;
            int        Year     ;
            int        Pages    ;
            [string]   Keywords ;
            [string]   Ratings  ;
        }
    

    Given this declaration, we can use object initializers to create a product and insert it into a collection using collection initializers:

    var _1579124585 = new Product
    {
       Title    : "The Right Stuff",
       Author   : "Tom Wolfe",
       Year     : 1979,
       Pages    : 320,
       Keywords : new [] { "Book", "Hardcover", "American" },
       Ratings  : new [] { "****", "4 stars" },
    }
    
    var Products = new[] { _1579124585 };
    
    Note that the two nested collections for the Keywords and Ratings properties are both represented by actual objects with their own identity.

    Now let's redo this example using the relational model.

    First of all, we must normalize our nested Product class into three flat tables, for Products, Keywords, and Ratings respectively, as shown below.

    Products
    {              
       int      ID;             
       string   Title;  
       string   Author;  
       int      Year; 
       int      Pages;                    
    }   
       
    Keywords
    {                
      int       ID;      
      string    Keyword;  
      int       ProductID; 
    }
    
    Ratings
    {
      int       ID;
      string    Rating;
      int       ProductID;
    }
    
    Each record in the relational world requires a new primary key property (here all called ID). Furthermore, the Keywords and Ratings tables require an additional foreign-key property ProductID that encodes the one-to-many association between Products and Keywords and Ratings. Later we will decorate these class declarations with additional metadata to reflect the underlying database tables.

    As usual in the relational world, we do not model the individual collections of keywords and ratings for each product as separate entities, but instead we directly associate multiple keywords and ratings to a particular product. This shortcut works only for one-to-many relationships. The standard practice for many-to-many relationships (multivalued functions) requires intersection tables containing nothing but pairs of foreign keys to link the related rows.

    Perhaps surprisingly for a declarative language, SQL does not have expressions that denote tables or rows directly. Instead we must fill the three tables in an imperative style using loosely typed DML statements:

    Products.Insert (1579124585, "The Right Stuff", "Tom Wolfe", 1979, 320);
    Keywords.Insert (4711, "Book" , 1579124585);                
    Ratings.Insert  (787, "****", 1579124585);
    Keywords.Insert (1843, "Hardcover", 1579124585);
    Ratings.Insert  (747, "4 stars", 1579124585); 
    Keywords.Insert (2012, "American", 1579124585);
    

    An important consequence of normalizing a single type into separate tables is that the database must maintain referential integrity to ensure that:
  • the foreign-/primary-key relationships across related tables remain synchronized across mutations;
  • the primary key of every row remains unique within its table; and
  • foreign keys always point to a valid primary key
  • For example, we cannot delete a row in the Products table without also deleting the related rows in the Keywords and Ratings tables.

    Referential integrity implies a closed-world assumption where
  • transactions on the database are serialized by (conceptually) suspending the world synchronously,
  • applying the required changes, and
  • resuming the world again when referential integrity has been restored successfully, rolling back otherwise
  • On the one hand, it simplifies the life of developers via ACID (atomicity, consistency, isolation, durability) transactions (although in practice, for efficiency, one must often deal with much weaker isolation levels) and allows for impressive (statistics-based) query optimization.

    The closed-world assumption, however, is in direct contradiction with distribution and scale-out. The bigger the world, the harder it is to keep closed.

    Returning to our example, we present the naïve query to find the titles and keywords of all products that have four stars, expressed directly in terms of the relational model. It creates the cross-product of all possible combinations:

       from product  in   Products
       from rating   in   Ratings
       from keyword  in   Keywords
           where product.ID  ==  rating.ProductId
             &&  product.ID  ==  keyword.ProductID
             &&  rating      ==  "****"
         select product.Title, keyword.Keyword
    

    The result of this query is the row set. Disappointingly, this row set is not itself normalized.

    In fact, to return the normalized representation of our object-graph query, we need to perform two queries (within a single transaction) against the database: one to return the title and its primary key, and a second query that selects the related keywords.

    What we observe here is SQL's lack of compositionality — the ability arbitrarily to combine complex values from simpler values without falling outside the system.

    By looking at the grammar definition for table rows, we can immediately see that SQL lacks compositionality; since there is no recursion, rows can contain only scalar values:

       row ::= new { ..., name = scalar, ... }
    
    Compare this with the definition where a row can contain arbitrary values, including other rows (or nested collections):
       value ::= new { ..., name = value, ... } | scalar
    

    SQL is rife with noncompositional features. For example, the semantics of NULL is a big mess: why does adding the number 13 to a NULL value, 13+NULL , return NULL, but summing the same two values, SUM(13, NULL), returns 13?

    Also, even though query optimizers in modern SQL implementations are remarkably powerful, the original query will probably run in cubic time when implemented via three nested loops that iterate over every value in the three tables. A seasoned SQL programmer would instead use explicit join syntax to ensure that the query is as efficient as our object-based query:

         from product in Products join rating in Ratings
           on product.ID equals rating.ProductId
             where rating == "****"
           select product into FourStarProducts
           from fourstarproduct in FourStarProducts join keyword in Keywords
             on product.ID equals keyword.ProductID
           select new { product.Title, keyword.Keyword };
    
    Depending on the encoding of the nesting of the result of a join using flat result sets, the SQL programmer must choose among various flavors of INNER, OUTER, LEFT, and RIGHT joins.

    It is not only the programmer who needs to recover the original structure of the code. The database implementer must also jump through hoops to make queries execute efficiently by building indexes that avoid the potential cubic effect.

    For one-to-many relationships, indexes are nothing more than nested collections resulting from precomputing joins between tables to quickly find all the rows whose foreign keys point to a row with a particular primary key. Since the relational model is not closed under composition, however, the notion of index has to be defined outside the model.

    Two natural indexes correspond to the relationships between Products and Ratings and Products and Keywords, respectively. For each product in the Product table, the ratings index contains the collection of all related ratings:

    from rating in Ratings 
      where rating.ProductID == product.ID
        select rating;
    
    Similarly, for each product in the Product table, the keywords index contains the collection of all keywords related to that product:
      from keyword in Keywords 
        where keyword.ProductID == product.ID
          select keyword;
    
    If we visualize the indexes as additional columns on the Products table, the reversal of the original relationships between the tables becomes apparent. Each row in the Products table now has a collection of foreign keys pointing to the Keywords and Ratings tables much like the original object graph.

    One of the advantages touted for normalization over hierarchical data is the ability to perform ad-hoc queries — that is, to join tables on conditions not envisioned in the original data model. For example, we could try to find all pairs of products where the length of the title of one product is the same as the length of the author's name in the other using the following query:

    from p1 in Products
      from p2 in Products
        where p1.Title.Length == p2.Author.Length
          select new { p1, p2 };
    
    Without an index, however, this query requires a full table scan and hence takes quadratic time in the length of the Products table.

    The ability to create indexes makes a closed-world assumption. For example, if we modify the previous ad-hoc query to find all pairs of Web pages where one page has a URL referencing the other, it should be obvious that building an index for this join is quite a hard task when you do not have the whole Web available inside a single database:

      from p1 in WWW
      from p2 in WWW
        where p2.Contains (p1.URL)
          select new { p1, p2 };
    

    Summarizing what we have learned so far, we see that

    in order to use a relational database, starting with a natural hierarchical object model,
  • the designer needs to normalize the data model into multiple types that no longer reflect the original intent;
  • the application developer must reencode the original hierarchical structure by decorating then normalized data with extra metadata; and, finally,
  • the database implementer has to speed up queries over the normalized data by building indexes that essentially re-create the original nested structure of the data as well
  • Let's start by slightly simplifying the example. We do so by removing the object identity of the Ratings and Keywords collections to reflect more directly how they are modeled in the relational world. We inline the Keywords and Ratings items directly into the parent Product, as if we had value-based collections.

    We notice two interesting facts:

  • In the object-graph model, the identity of objects is intensional — that is, object identity is not part of the values themselves but determined by their keys in the store.
  • In the relational model, object identity is extensional — that is, object identity is part of the value itself, in the form of a primary key.
  • At this point, it appears that there is a strong correspondence between these two representations. they both consist of:
  • a collection of elements (objects or rows) and
  • a collection of arrows between the elements
  • the only difference being the direction of the arrows.

    Is there some precise way of describing such a situation?

    Fortunately, there is an entire area of mathematics designed exactly for this: CATEGORY THEORY


    Category theory arose from studies of mathematical structures and an attempt to relate classes of structures by considering the relations between them. Thus, category theory expresses mathematical concepts in terms of objects, arrows between objects, and the composition of arrows, along with some axioms that the composition of arrows should satisfy.

    A computational view of category theory is that it is a highly stylized, compact functional language. Small as it is, it's enough to represent all of mathematics. For computer scientists, category theory has proved to be a rich source of techniques that are readily applicable to many real-world problems.

    The first powerful concept from category theory that we will use is DUALITY.

    Formally, given a category C of objects and arrows, we obtain the dual category coC by reversing all the arrows in C. If a statement T is true in C, then its dual statement coT is true in the dual category coC. In the context of this article, we can read "opposite statement" for "dual statement".

    In the SQL category, child nodes point to parent nodes when the foreign key of a child node equals the primary key of the parent node.

    In the coSQL category, the arrows are reversed. Parent nodes point to child nodes when the child pointer in the parent equals the address of the child node in the store.

    In other words, the noSQL is the dual of the SQL — noSQL is really coSQL.

    The implication of this duality is that coSQL and SQL are not in conflict, like good and evil. Instead they are two opposites that coexist in harmony and can transmute into each other like yin and yang. Interestingly, in Chinese philosophy yin symbolizes open and hence corresponds to the open world of coSQL, and yang symbolizes closed and hence corresponds to the closed world of SQL.

    Because SQL and coSQL are mathematically dual, we can reason precisely about the tradeoffs between the two instead of relying on rhetoric and anecdotal evidence.

    If we really take the duality to heart, we may also choose to (but don't have to) fine-tune our model for key-value stores to reflect the duality between values and computations, and that between synchronous ACID and asynchronous BASE (Basically Available, Soft state, Eventually consistent).

    Looking up a value given its address or key in a key-value store can involve a computation, which in a truly open world has potential latency and may fail. Perhaps an even best example of a computation-driven key-value store with long latency and high probability of failure is the Web (always able to handle a 404), with URIs as keys, "resources" as values, and the HTTP verbs as a primitive query/data-manipulation language. On the other hand, in a C-like key-value memory model, we usually make the simplifying assumption that a key lookup in memory takes constant time and does not fail.

    Traversing a relationship in the closed world of the relational model involves comparing two values for equality, which is guaranteed to succeed because of referential integrity; and vice versa, referential consistency dictates that relationships are value-based. Otherwise, we could never be sure that referential consistency actually holds.

    Note that comparing keys by value requires that objects in the SQL category are strongly typed, at least enough to identify primary and foreign keys; and dually, since we do not need to know anything about the value of a coSQL object to find it using its key, objects in the coSQL world can be loosely typed.

    Our abstract model of the SQL category did not impose any restrictions on the structure of rows; we assumed only that we could determine a primary or foreign key to relate two rows.

    In the typical relational model we would further impose the constraint that rows consist of flat sequences of scalar values (the so-called First Normal Form, or 1-NF).

    If we dualize relations in 1-NF, then we get a key-value model where values consist of either scalars or keys or collections of scalars or keys.

    If we assume that rows in the foreign-/primary-key model are simply blobs and keys are strings, then the dual key-value model is exactly the HTML5 key-value storage model:

    interface Storage
    {
      readonly attribute unsigned long     length;
      getter                               DOMString key (in unsigned long index);
      getter any                           getItem (in DOMString key);
      setter creator                       void setItem (in DOMString key, in any data);
      deleter                              void removeItem (in DOMString key);
      void                                 clear ();
    }
    


    A little more Category theory

    So far we have discussed the basic data models for SQL and coSQL, but we have not yet touched upon queries. By applying a little more category theory we can show how a single abstraction, monads and monad comprehensions, can be used as a unified query language for both SQL and coSQL.

    To talk about queries, we need to be more precise about what we mean by collections of values. Pure relational algebra is based on sets of rows, but actual relational databases use multisets (bags) or ordered multisets (permutations). To model collections abstractly, we look at sets/bags/permutations of rows and apply the category theory dictum: "What is the interface that these various collections of rows implement?" and "How do we generalize queries based on such an interface?"

    First, let us stick with simple set collections. When we write a SQL query such as SELECT F(a,b) FROM as AS a, bs AS b WHERE P(a,b) the SQL compiler translates that pretty syntax into a relational-algebra expression in terms of selection, projection, joins, and Cartesian product.

    There is no reason to restrict the relational algebra operators to work over just sets (or bags, or permutations) of rows. Instead, we can define similar operators on any collection of values of arbitrary type T.

    Rather incredibly, an interface of this shape is well known in category theory. It is called a monad, where the type constructor is a functor of the monad; the constructor is the unit of the monad; SelectMany is the bind of the monad; and ∅ and ∪ are the neutral element and addition, respectively. For the rest of us, they are just the signatures for methods defined on an abstract interface for collections.

    This is no theoretical curiosity. We can play the same syntactic tricks that SQL does with relational algebra, but using monads instead. Such monad comprehensions have been recognized as a versatile query language by both functional programmers and database researchers.

    Scalability and distribution

    In contrast to most treatments of noSQL, we did not mention scalability as a defining characteristic of coSQL. The openness of the coSQL model eases scalable implementations across large numbers of physically distributed machines.

    For a large subset of coSQL queries, the shape of the query closely follows the shape of the data. Mathematics dictates that coSQL queries are performed using MapReduce.

    Practical implementations of MapReduce usually slightly generalize Bird's lemma to use SelectMany instead of Select so that the map phase can return a collection instead of a single value, and insert an intermediate GroupBy as a way to "write" equivalence classes of values from the map phase into the key-value store for subsequent processing in the reduce phase, and then aggregate over each subcollection.

    In an open world where collections are distributed across the network, it is much harder for a query optimizer to perform a global optimization taking into account latency, errors, etc. Hence, most coSQL databases rely on explicit programmatic queries of a certain pattern such as MapReduce that can be executed reliably on the target physical machine configuration or cluster.

    In contrast to common belief, the question of big versus small data is orthogonal to the question of SQL versus coSQL. While the coSQL model naturally supports extreme sharding, the fact that it does not require strong typing and normalization makes it attractive for "small" data as well. On the other hand, it is possible to scale SQL databases by careful partitioning.