# Abstract nonsense lurking beneath SQL

## Harlan Ichikawa

This is part of the “machine learning for disillusioned mathematicians” series. Let’s get straight to it.

## Visualizing tables

A cartoon of a table with only one column is a line with a bunch of dots. Each dot represents a “row” of the table. Similarly, a table with two columns is cartoonified as a bunch of line segments between two lines. Here, each line segment corresponds to a row of the table. Naturally, a table with three columns is visualized as a bunch of triangles And for four column tables, we use tetrahedra, but I can’t draw it because I’m lazy. $N$-column tables are visualized as a bunch of $N$-simplices, and I don’t need to draw it because it’s sooooooo easy to visualize an $N$-simplex if you are an inter-dimensional being.

### nullible fields?

If a field can take on a null value, we get slightly different visuals. For example, a three column table where the third field is nullible is visualized as What one should note is that there are both triangles and line segments (as opposed to purely triangles as in the non-nullible case). So nullible fields allow for mixed shapes.

## Visualizing Unions and Joins

This is a cartoon of what UNION looks like This is a cartoon of an INNER JOIN An Outer Join And finally, a cross Join Perhaps you think that was abstract nonsense?

### … Hold my beer.

My spidey-sense tells me that INNER JOIN is a limit of the diagram

Moreover, OUTER JOIN is a co-limit of the same diagram.

Finally, SQL’s CROSS JOIN is a limit of the diagram

$\bullet \quad \bullet$.

I’ll take my beer back now.

## I’m going to puke

I apologize. With only 4 lines of text and two diagrams, I probably ejected 90% of my potential readership. That was clearly an unhealthy amount of abstraction.

There really is no reason to write/read a blog post on SQL joins because they are such a simple concepts. The writer of such an article would need to be so stupid, so ignorant, that he/she would probably stoop as low as redefining the most basic concept, like “tables”.

### The notion of a table

These cartoons suggest that a table is abstractly represented as some sort of collection of $N$-simplices (possibly of varying dimension). However, it’s nut just a set. It’s a multiset. We can formally represent a table as a multiset equipped with a collection of morphisms called “columns”. Lets unpack this.

### Multi-what?

A set is just a bag of distinct things. For example $\{1,2,3\}$ is a set. If you take the set $\{1,2,3\}$ and insert a $1$ again, you still get $\{1,2,3\}$, as opposed to something that has two copies of $1$. A set has no notion of multiplicity.

Enhancing sets with multiplicity yields the concept of a multiset as a pair $(S,m)$ where $S$ is a set and $m: S \to \mathbb{N}$ denotes the multiplicity of each element. Here $\mathbb{N} = \{ 1, 2, 3, \dots \}$ denotes the natural numbers.

I like to think of multisets as the toy models of measures. Just like measure, the natural action of a map on a multiset is a push-forward. A morphism from $(X,m)$ to $(Y,n)$ is a (set) map $f:X \to Y$ such that the diagram commutes. The collection of multisets along with these arrows form a category I will call ${\rm Multiset}$.

Tables in a relational database are multisets, as opposed to just plain old sets, because it’s possible to have duplicate rows in a table.

### Example: a two column table

Now there are some deranged freaks who insist that a table is something that looks like

f_0 f_1
1 true
2 true
2 true
3 false

I call these sick and twisted individuals “normal people”. However, it’s perfectly valid to represent this so-called “table” as the multiset $(S, m)$ where $S = \{(1,true),(2,true),(3,false)\}$ and

The columns are the maps $f_0(x,y) = x$ and $f_1(x,y) = y$ for $(x,y) \in S$. So we see a table is a multiset.

### Example: nullible fields

You can represent a table with a nullible field such as

f_0 f_1
1 true
2 true
2 true
3 false
2 NULL

by considering the set $S = \{(1,true),(2,true),(3,false), 2 \}$, and the same multiplicity function as the previous section, but extended by defining $m(2)=1$. In any case, NULL is naturally accounted for by multisets, and NULL entries are merely an artifact that occasionally arises when we display a multiset as a (normal person’s) table.

### The DISTINCT monad

The forgetful functor $F: (S,m) \in {\rm Multiset} \mapsto S \in {\rm Set}$ has an adjoint, $G:{\rm Set} \to {\rm Multiset}$ given by $G(S) = (S,1_S)$ where $1_S: S \to \mathbb{N}$ is just the constant function with value $1$. We can define the monad ${\rm DISTINCT} := G \circ F$. This is the content of the DISTINCT keyword in SQL.

### Cross joins

Cross joins are just another word for a product. Once you realize a table is just a multiset, a cross join is truly the natural generalization of cartesian products within ${\rm Multiset}$. Given multisets (tables) $(X,m)$ and $(Y,n)$ the cartesian product (cross join), denoted $(X,m) \times (Y,n)$, is the multiset (table) $(X \times Y, m \cdot n)$ where $(m\cdot n)(x,y) := m(x)n(y)$.

### Unions

Given sets $X$ and $Y$, you know what the union is. Given multisets $(X,m)$ and $(Y,n)$ the union is the multiset $(X \cup Y , m+n)$. This translates directly to the SQL notion of UNION

### Natural (inner) Join

Here is an SQL query

SELECT *
FROM t1 JOIN t2
ON t1.col1 = t2.col2


This is one example of joining tables. Here is another (which assumes t1 has only one column)

SELECT *
FROM t1 JOIN t2
ON LOWER(t1)=t2.col2


Basically, an inner join requires that you have two tables $t_1,t_2$, along with multiset morphisms $f:t_1 \to b_1$ and $g:t_2 \to b_2$. Then the inner join of $t_1$ and $t_2$ is the largest table which makes the diagram commute.

### Full outer joins

The story for outer joins is similar, except you swap the direction of some of the arrows. An outer join is the minimal table which makes the diagram commute. In order to display the table (in the normal people sense) we will probably need to have NULL entries. However, the definition of outer-join does not even need a notion of NULL. It’s naturally incorporated.

### Final thoughts

I wish I could take this further, but I am sleepy. After some googling I stumbled upon this stack exchange posting which seems like a decent rabbit hole to climb into. Unsurprisingly, it appears David Spivak has dug into this stuff a lot, and one of his papers mentioned in the stackexchange post is called Simplicial Databases. If that’s the title, the mental picture of databases is probably similar to the caveman paintings at the beginning of this post.

There are a lot of things that feel expressible in categorical terms that I puzzled over quite a bit but never figured out

• Left Joins
• Group by
• Aggregation
• Schemas

In particular, I feel as if GROUP BY eats multiset morphisms and outputs multisets of multisets, which then collapse back down to a “first order” multiset after an aggregation. It feels like there is a monad lurking, but I never managed to make this anything more than a feeling. If any of you readers have anything to add, can you reach out to me?

### Shoutout

I used tikz-cd editor a lot in the writing of this post. Super awesome tool.

Theme "Swiss" designed by broccollini