This is part of the “machine learning for disillusioned mathematicians” series. Let’s get straight to it.

## Visualizing tables

A cartoon of a table with only one column is a line with a bunch of dots.

Each dot represents a “row” of the table. Similarly, a table with two columns is cartoonified as a bunch of line segments between two lines.

Here, each line segment corresponds to a row of the table. Naturally, a table with three columns is visualized as a bunch of triangles

And for four column tables, we use tetrahedra, but I can’t draw it because I’m lazy. -column tables are visualized as a bunch of -simplices, and I don’t need to draw it because it’s sooooooo easy to visualize an -simplex if you are an inter-dimensional being.

### nullible fields?

If a field can take on a null value, we get slightly different visuals. For example, a three column table where the third field is nullible is visualized as

What one should note is that there are both triangles *and* line segments (as opposed to purely triangles as in the non-nullible case). So nullible fields allow for mixed shapes.

## Visualizing Unions and Joins

This is a cartoon of what `UNION`

looks like

This is a cartoon of an `INNER JOIN`

An Outer Join

And finally, a cross Join

Perhaps you think that was abstract nonsense?

### … Hold my beer.

My spidey-sense tells me that `INNER JOIN`

is a limit of the diagram

Moreover, `OUTER JOIN`

is a co-limit of the same diagram.

Finally, SQL’s `CROSS JOIN`

is a limit of the diagram

.

I’ll take my beer back now.

## I’m going to puke

I apologize. With only 4 lines of text and two diagrams, I probably ejected 90% of my potential readership. That was clearly an unhealthy amount of abstraction.

There really is no reason to write/read a blog post on SQL joins because they are such a simple concepts. The writer of such an article would need to be so stupid, so ignorant, that he/she would probably stoop as low as redefining the most basic concept, like “tables”.

### The notion of a table

These cartoons suggest that a table is abstractly represented as some sort of collection of -simplices (possibly of varying dimension). However, it’s nut just a set. It’s a multiset. **We can formally represent a table as a multiset equipped with a collection of morphisms called “columns”.** Lets unpack this.

### Multi-what?

A set is just a bag of distinct things. For example is a set. If you take the set and insert a again, you still get , as opposed to something that has two copies of . *A set has no notion of multiplicity.*

Enhancing sets with multiplicity yields the concept of a **multiset** as a pair where is a set and denotes the multiplicity of each element. Here denotes the natural numbers.

I like to think of multisets as the toy models of measures. Just like measure, the natural action of a map on a multiset is a push-forward. A morphism from to is a (set) map such that the diagram

commutes. The collection of multisets along with these arrows form a category I will call .

Tables in a relational database are multisets, as opposed to just plain old sets, because it’s possible to have duplicate rows in a table.

### Example: a two column table

Now there are some deranged freaks who insist that a table is something that looks like

f_0 | f_1 |
---|---|

1 | `true` |

2 | `true` |

2 | `true` |

3 | `false` |

I call these sick and twisted individuals “normal people”. However, it’s perfectly valid to represent this so-called “table” as the multiset where and

The columns are the maps and for .
So we see a table *is* a multiset.

### Example: nullible fields

You can represent a table with a nullible field such as

f_0 | f_1 |
---|---|

1 | `true` |

2 | `true` |

2 | `true` |

3 | `false` |

2 | `NULL` |

by considering the set , and the same multiplicity function as the previous section, but extended by defining . In any case, `NULL`

is naturally accounted for by multisets, and `NULL`

entries are merely an artifact that occasionally arises when we display a multiset as a (normal person’s) table.

### The `DISTINCT`

monad

The forgetful functor has an adjoint, given by where is just the constant function with value . We can define the monad . This is the content of the `DISTINCT`

keyword in SQL.

### Cross joins

Cross joins are just another word for a product. Once you realize a table is just a multiset, a cross join is truly the natural generalization of cartesian products within . Given multisets (tables) and the cartesian product (cross join), denoted , is the multiset (table) where .

### Unions

Given sets and , you know what the union is. Given multisets and the union is the multiset . This translates directly to the SQL notion of `UNION`

### Natural (inner) Join

Here is an SQL query

```
SELECT *
FROM t1 JOIN t2
ON t1.col1 = t2.col2
```

This is one example of joining tables. Here is another (which assumes t1 has only one column)

```
SELECT *
FROM t1 JOIN t2
ON LOWER(t1)=t2.col2
```

Basically, an inner join requires that you have two tables , along with multiset morphisms and . Then the inner join of and is the largest table which makes the diagram

commute.

### Full outer joins

The story for outer joins is similar, except you swap the direction of some of the arrows. An outer join is the minimal table which makes the diagram

commute. In order to display the table (in the normal people sense) we will probably need to have `NULL`

entries. However, the definition of outer-join does not even need a notion of `NULL`

. It’s naturally incorporated.

### Final thoughts

I wish I could take this further, but I am sleepy. After some googling I stumbled upon this stack exchange posting which seems like a decent rabbit hole to climb into. Unsurprisingly, it appears David Spivak has dug into this stuff a lot, and one of his papers mentioned in the stackexchange post is called Simplicial Databases. If that’s the title, the mental picture of databases is probably similar to the caveman paintings at the beginning of this post.

There are a lot of things that feel expressible in categorical terms that I puzzled over quite a bit but never figured out

- Left Joins
- Group by
- Aggregation
- Schemas

In particular, I feel as if `GROUP BY`

eats multiset morphisms and outputs multisets of multisets, which then collapse back down to a “first order” multiset after an aggregation. It feels like there is a monad lurking, but I never managed to make this anything more than a feeling. If any of you readers have anything to add, can you reach out to me?

### Shoutout

I used tikz-cd editor a lot in the writing of this post. Super awesome tool.