My own journey from academia to datascience was hard and annoying. The hard part was becoming a decent coder. The annoying part was the math. Not because the math was difficult, but because it was often written in such a boring way. I think the authors typically have practicality in mind, the math was merely a tool for them, and any abstraction needed immediate justification when used. That makes sense for the large readership of engineers and statisticians. However, it left me unsatisfied, and I suspect I am not alone. There is a small collection of people who actually enjoy abstraction for its own sake. We might be afraid to admit it, because we were trained to hide predilections that risk our chances at grant money, but this doesn’t change the truth of the matter. Now that I’m out of academia I feel safer admitting that I find abstraction fun, and an ends in and of itself. It is useful for the speed with which it allows one to transport logic from one domain into another. But it’s also fun. My hope is that this series will make ML more fun for a neglected audience with similar feelings.
If you’re a starving and idealistic mathematician, as I once was, and you are coming to the depressing realization that the ivory tower is a shadow of its past self, and now merely a zombie for promoting STEM initiatives and thus a puppet of private industry, come with me and step down into our cave.
Plato’s cave
By RaphaelQS  Own work, CC BYSA 4.0, Link
Plato’s allegory of the cave is roughly summarized as follows: A bunch of people are chained inside a cave and forced to watch shadow puppets dance on the wall until they think the shadows are reality. This happens to us every time we go to the movie theatre. If the movie is any good, we forget that we are staring at a blank wall with light projected on it. Once we leave the theatre, it’s not clear that we’ve really left the theatre. Perhaps our perception of reality is but a shadow of truth.
The notion of a shadow is mathematically represented as a map projecting the platonic forms into a space of observables. If reality were a sequence of statistically independent events then the model for such a reality could be easily mathematized. Our senses could be represented by a collection of random variables for . All of datascience is concerned with studying models based on finite collections of samples in this model of Plato’s cave. For example, what is the relationship (if any) between senses and .
Relations
Recall, a relation is a subset, , which we can think of as a generalization of a function. A function is represented as a relation by its graph, . In fact we can make this a functor .
At a high level we have two categories here, and one is a strict subcategory. The category , where the objects are sets and the arrows are maps, and the category , where the objects are sets and the arrows are relations.
Note that the embedding functor , is not invertible. Generally speaking, an arbitrary relation is not the graph of a function… sad face emoji. In supervised machine learning we’d like to do something like constructing a function from a relation… crying face emoji!!
Stop crying!
So clearly the goal stated above is generically infeasible. Fortunately, our real goal is not quite so dumb. It’s almost that dumb though.
A labeled dataset consists of pairs, but it is NOT merely a finite set of pairs. Elements can be listed more than once, and so a labeled dataset is a finite multiset Multisets are a foundation for the field of probability. Therefore, we should probably be doing something probabilistic.
The goal of supervised machine learning is to construct a probabilistic function (a posterior) using a dataset sampled from a probabilistic relation (a joint distribution).
I’ll explain what these probabilistic functions and relations are next.
The probabilistic category
Let’s start by setting up some notation. If is a algebra, we let denote the set of probability distributions over .
We could define a probabilistic category where the objects are probability distributions and the arrows are conditional distributions. A conditional distribution is usually written with two arguments like , but it’s only a probability distribution over the first argument, . We can curry this function and viewed , as a map, . For a specific , the conditional distribution, , can be viewed as an arrow from to the distribution
and two posterior distributions, can be composed via the law of total probability
This defines the category, where
 the objects are probability distributions,
 the arrows are conditional probability distributions,
 and the composition is given by the law of total probability.
We will think of as a probabilistic version of .^{1}
Relations in the probabilistic category
There is a probabilistic analog to . A stochastic relation from to is an element whose marginals are and respectively. We’ll call this category .^{2}
Just as there is a functor which embeds into , there is a functor, , which embeds into :
 for an object in , .
 for an arrow in the category , is the joint distribution on given by
Something to note: is invertible!
The goal of bayesian supervised machine is to associate a posterior distribution to a labeled dataset, , sampled from an (unknown) joint distribution . In other words, given samples from a probabilistic relation, , we’d like to find a probabilistic function such that . Thank god is invertible.
The ideal solution
If we knew the probabilistic relation we could produce the desired posterior. Let and denote the canonical projections We would define to be the posterior probability distribution
for events .
However, since se do not know , we can’t compute . So much for idealism.
Empiricism
By
Unknown  The original uploader was Odin at German Wikipedia.(Original text: Scottish National Portrait Gallery, Edinburgh)
Later version(s) were uploaded by Ekuah at de.wikipedia., Public Domain, Link
We might not know , but we do have a labeled dataset, sampled from . We should be able to approximate using the distribution where . In fact, for algebras observed in practice, we weakly as .
We could then do the same computation as before to get the posterior distribution
In fact weakly as . This is a very conservative datadriven approach which people like David Hume (pictured above) may have advocated. His advice for anything not based upon direct observation was to “commit it then to the flames: for it can contain nothing but sophistry and illusion.”
So we’re done right?
Critique on Pure Reason
The empiricist approach Kant work.
By Unidentified painter  /History/Carnegie/kant/portrait.html, Public Domain, Link
I’m so punny. Feel free to take a moment, and allow yourself to stop laughing.
The solution, is (typically) a really bad solution. Sure, it converges (sort of). However, if is large convergence will be slow. This is especially true if either or are continuous random variables, so that . For example, if is a real valued random variable, then is then illdefined for any value of outside of dataset. Such a failure will occur almost surely for any distribution of ’s with a support with nonempty interior. This phenomena is known as overfit and we’d say our solution has high variance and the resulting errors are known as generalization errors because our analysis failures to generalize to anything outside our dataset.
The war on variance
To ensure that our classifier does not fail outside of requires that we adopt an assumption. You need to take a risk in order to talk about data that has never been seen.
Typically one assumes the solution exists in some submanifold and then we choose the “best” element of as our solution.
Best?
We define “best” using some cost function which satisfies
and the continuity property
For example, the negative logliklihood function
would do nicely. In this case
Then we can set
The cost function presented here, yields as estimate known as the maxiumum likelihood estimate (MLE).
If then will converge to as the number of samples goes to infinity. Unfortunately, it is unlikely that , so this convergence is also unlikely. This phenomena is known as bias.
There are more creative things one can do, like making really large and then assuming a prior over to combat variance. This is known as maximum aposterior estimation (MAP estimation) but you will still suffer from bias. The only way to ensure bias is a complete non issue is to set and assume a uniform prior. But then you die from variance. F***, you just can’t win!
The bias variance tradeoff
Datascience is more of an art than a science. The art is in choosing to be large enough that you don’t get crushed by bias, but small enough so that you don’t get crushed by variance.
Fortunately these phenomena strike hard in different regimes. For small datasets, nonconvergence at infinity is a pipedream and bias is the least of your worries. You should worry about variance for small datasets. Conversely, for large datasets you can reasonably expect to get close to the true value of the posterior, and worrying about bias starts to make sense.
I could tell you how to test for bias and variance, but that’s out of the scope of this blogpost. Let’s see an example
Example: A robot that denies people loans
A loan application consists of a variety of numbers
 loan amount
 credit score
 historical default rate
We can represent an application as a single vector, and view the process of recieving such applications as a vector valued random variable, . As time progresses we will see some of these loans get paid back, or default. Let be the binary valued vector which is for default and otherwise. Our goal is to find a posterior where is just the Borel algebra on and is just the standard algebra on objects.^{3} Note that because any distribution on 2 objects is represented by a biased coin (i.e. a Bournoulli distribution), and such coins are parametrized by the probability that they turn up heads.
Once we have a dataset , we can start to construct . First, consider the logistic function . This function has the graph depicted below.
By Qef (talk)  Created from scratch with gnuplot, Public Domain, Link
The primary reason this is a useful function is because it maps to . Now we just need a map from to (because is valued). I’m tired, so we will just consider linear mappings, or maybe affine mappings (linear + offset). Therefore, let consist of posteriors of the form
for some and . This implies since only takes the values and . We see that is parametrized by and so .
Given our dataset we consider the (approximate) crossentropy
And we solve for which is known as the Maximum likelihood estimate.
The function is convex, and so finding is not a problem. This procedure is known as logistic regression
What next?
It should be obvious that supervised learning is generically useful. Via supervised learning our society could make robots for all sorts of things.
 Robots for parsing legal documents
 Robots for recommending products
 Robots for identifying faces in photos
 Robots to detect suspicious objects at the airport
 Robots for hiring and firing people at a company
 Robots for firing at people on the battle field
 Robots for rendering legal verdicts such as the life sentences, or death sentences
… so that’s cool.
Footnotes

The category is related to the category of probabilistic relations. If your a fan of categorical approaches to mathematics, I recommend this article. You might also enjoy my blogseries on the topic: part I, part II, and part III ↩

You can derive as the image of under the operation of projecting each span to its associated product. Same goes for with respect to . ↩

I like this notation more than something like . In any course on the foundations of math you would learn to define the natural numbers as a set. In particular , and then we define the successor operation . In particular, . ↩