This is part of the “machine learning for disillusioned mathematicians” series. Let’s get straight to it.
A cartoon of a table with only one column is a line with a bunch of dots.
Each dot represents a “row” of the table. Similarly, a table with two columns is cartoonified as a bunch of line segments between two lines.
Here, each line segment corresponds to a row of the table. Naturally, a table with three columns is visualized as a bunch of triangles
And for four column tables, we use tetrahedra, but I can’t draw it because I’m lazy. -column tables are visualized as a bunch of -simplices, and I don’t need to draw it because it’s sooooooo easy to visualize an -simplex if you are an inter-dimensional being.
If a field can take on a null value, we get slightly different visuals. For example, a three column table where the third field is nullible is visualized as
What one should note is that there are both triangle and line segments (as opposed to purely triangles as in the non-nullible case). So nullible fields allow for mixed shapes.
Visualizing Unions and Joins
This is a cartoon of what
UNION looks like
This is a cartoon of an
An Outer Join
And finally, a cross Join
Perhaps you think that was abstract nonsense?
… Hold my beer.
My spidey-sense tells me that
INNER JOIN is a limit of the diagram
OUTER JOIN is a co-limit of the same diagram.
CROSS JOIN is a limit of the diagram
I’ll take my beer back now.
I’m going to puke
I apologize. With only 4 lines of text and two diagrams, I probably ejected 90% of my potential readership. That was clearly an unhealthy amount of abstraction.
There really is no reason to write/read a blog post on SQL joins because they are such a simple concepts. The writer of such an article would need to be so stupid, so ignorant, that he/she would probably stoop as low as redefining the most basic concept, like “tables”.
The notion of a table
These cartoons suggest that a table is abtractly represented as some sort of collection of -simplices (possibly of varying dimension). However, it’s nut just a set. It’s a multiset. We can formally represent a table as a multiset equipped with a collection of morphisms called “columns”. Lets unpack this.
A set is just a bag of distinct things. For example is a set. If you take the set and insert a again, you still get , as opposed to something that has two copies of . A set has no notion of multiplicity.
Enhancing sets with multiplicity yields the concept of a multiset as a pair where is a set and denotes the multiplicity of each element. Here denotes the natural numbers.
I like to think of multisets as the toy models of measures. Just like measure, the natural action of a map on a multiset is a push-forward. A morphism from to is a (set) map such that the diagram
commutes. The collection of multisets along with these arrows form the category .
Tables in a relational database are instances of multisets, as opposed to just plain old sets, because it’s possible to have duplicate rows in a table.
Example: a two column table
Now there are some deranged freaks who insist that a table is something that looks like
I call these sick and twisted individuals “normal people”. However, it’s perfectly valid to represent this so-called “table” as the multiset where and
The columns are the maps and for . So we see a table is a multiset.
Example: nullible fields
You can represent a table with a nullible field such as
by considering the set , and the same multiplicity function as the previous section, but extended by defining . In any case,
NULL is naturally accounted for by multisets, and
NULL entries are merely an artifact that occasionally arises when we display a multiset as a (normal person’s) table.
The forgetful functor has an adjoint, given by where is just the constant function with value . We can define the monad . This is the content of the
DISTINCT keyword in SQL.
Cross joins are just another word for a product. Once you realize a table is just a multiset, a cross join is truly the natural generalization of cartesian products within . Given multisets (tables) and the cartesian product (cross join), denoted , is the multiset (table) where .
Given sets and , you know what the union is. Given multisets and the union is the multiset . This translates directly to the SQL notion of
Natural (inner) Join
Here is an SQL query
SELECT * FROM t1 JOIN t2 ON t1.col1 = t2.col2
This is one example of joining tables. Here is another (which assumes t1 has only one column)
SELECT * FROM t1 JOIN t2 ON LOWER(t1)=t2.col2
Basically, an inner join requires that you have two tables , along with multiset morphisms and . Then the inner join of and is the largest table which makes the diagram
Full outer joins
The story for outer joins is similar, except you swap the direction of some of the arrows. An outer join is the minimal table which makes the diagram
commute. In order to display the table (in the normal people sense) we will probably need to have
NULL entries. However, the definition of outer-join does not even need a notion of
NULL. It’s naturally incorporated.
I wish I could take this further, but I am sleepy. After some googling I stumbled upon this stack exchange posting which seems like a decent rabbit hole to climb into. Unsurprisingly, it appears David Spivak has dug into this stuff a lot, and one of his papers mentioned in the stackexchange post is called Simplicial Databases. If that’s the title, the mental picture of databases is probably similar to the caveman paintings at the beginning of this post.
There are a lot of things that feel expressible in categorical terms that I puzzled over quite a bit but never figured out
- Left Joins
- Group by
In particular, I feel as if
GROUP BY eats multiset morphisms and outputs multisets of multisets, which then collapse back down to a “first order” multiset after an aggregation. It feels like there is a monad lurking, but I never managed to make this anything more than a feeling. If any of you readers have anything to add, can you reach out to me?
I used tikz-cd editor a lot in the writing of this post. Super awesome tool.