Prologue
This will be the first in a 3part series where I grapple with the foundations of probability theory. Caution: If you find commutative diagrams to be impenetrable, then you might not get much out of this series.
The status quo
Does this sentence make sense?
Given functions and , find the numbers such that .
Rhetorical question. It’s nonsense. What does even mean!? is a map which eats complex numbers, yet we are feeding it . WTF.
Now consider the following
Given the probability space and the random variable , find the numbers such that .
Okay. So that sentence made sense, right? The sentence is commanding that we find all the numbers for which has a positive probability of hitting.
What is weird, to me at least, is that these two text boxes are virtually identical. At least at the level of mathematical abstractions. We’ve merely substituted with and with . Moreover, the same issue arises. is a function of , so how are we reading it a real valued map, ?
I think the reason we understand the second box, but not the first is because the second box has contextual clues which hint at the intended meaning. Probability theory relies heavily on contextual clues to compensate for awful notation.
In terms of communications theory, the signal is highly redundant and therefore robust to noise. What sucks, is that all the noise is in the mathematical ink. This is the the opposite of where it should be. If anything, the prose should be conveying high level description, and the math should be the hyperprecise component which fills in the details.
Devil’s advocate: If we understand the second box, then what’s the problem? The point of words and notation is to convey syntax, by hook or crook.
Well Keanu Reeves, you do have a point. And I actually agree in principle. However, all to often this sloppy notation actually catapults the syntax into a fog. When I was first presented randomvariables , and then asked to consider the posterior distribution , This is roughly the conversation I had with myself.
Is this actually a distribution? Over what space, is it ? Yeah… makes sense. But it has two arguments and . Wait… are and arguments, I thought they were randomvariables. Oh, I get it! This is just shorthand notation for the probability distribution over , once we know what is…. Wait?? So it is a distribution.
 me being confused (500 BC)
Bottom line, Keanu, I think you should stop advocating for Satan.
The problem
The first issue is really about the nature of the expression “”. Is it a function or not? Your textbook will say that it is a function. Unfortunately, it’s the only “function” in the discipline of mathematics whose domain appears to change from one line of text to the next. First it’s , then or worse (which is a notational abuse of , which is incidentally pretty good notation for reasons which will be described). Finally, you run into ? A function has a single domain! This function takes ’s, booleans like , and then it takes in and at the same time.
Fixes
Fix 1: Stop lying
Let’s get one thing straight. is a function. I’m compelled to say this because the following sentiment is the current state of pedagogy:
should really not be taken so seriously. It really should be seen as casual shorthand for “probability of”
 Anonymous friend of mine who teaches statistics
This trick is pretty useful for getting through the textbook without pulling your hair out. Perhaps this is because this is how the textbook writer thinks of . Nonetheless, this approach only works because it provides comfort in a convenient lie. If is a probability space, then is a function from the algebra, , into the unit interval . Stop lying. Lying sucks.
Fix 2: Stop sucking (start pushing)
If is a probability space, and is a random variable, then the expression makes no sense (unless is the Borel algebra on ). It’s only upon looking at the examples that it becomes clear what is really meant is the pushforward. The map transforms the probability space into a new one, . Here is just the image of , , and is the unique probability density function on defined by . We call the pushforward of by . In measure theory, this notation is standard (see pushforward measure). I don’t see a compelling reason to for probability theory and statistics to resist it.
Just to be clear, is a probability function on . This means that takes measurable subsets of and outputs real numbers between 0 and 1.
Joint probabilities
In the case of two random variables on the same probability space, . The joint probability “” is a pushforward as well. Consider the random variable . Then the joint probability is . No need to obfuscate when it’s only a few extra penstrokes.
Fix 3: Say what you mean
Typically, this use of lowercase for a “realization” of a random variable is just a sloppy way to avoid dealing with the fact that is a function from one space to another. However, writing makes no sense when you’ve not defined . Calling it a “realization” does not define it. All you get is an alias to an undefined symbol. The correct expression is (recall, , is a random variable, so it’s preimages are members of , which is the domain of ). However, I can see a good defense for writing instead of . It’s customary in the rest of mathematics to write “” as a shorthand for the set . The expression is consistent with this standard.
Similarly, I think is a pretty good replacement for the more verbose . Nonetheless, it’s good to be aware you’re making these abuses. Such shorthands should be deliberate, not incidental or subconscious. Math without precision is basically really bad poetry, (i.e. unstructured BS).
Fix 4: The nuclear option
This last fix is probably a bad idea, but it’s my favorite… It’s the best idea. In fact I’m going to dedicate a section to it.
The nuclear option
“Dr Strangelove or: How I Learned to Stop Worrying and Love the Bomb” (1964)
I think reconsidering the foundations is worthwhile. Recall, a probability space is a triple where is a set, , is a algebra and is a probability measure. Lastly, we should note that a measureable map is nothing but a algebra morphism.
When I put my algebra hat on, the following perturbs me. A measureable map, , ought to transform the probability space into a probability space over the algebra . However, it is not generally the case that the natural candidate , is a probability distribution.
For example, let (i.e. a measureable subset of ), then we can consider the map , which then lifts to an obvious algebra morphism, embedding the algebra into . However, the function is not a probability distribution on because (at least not generally speaking). It’s too bad. I really wanted the pullback to be the posterior . (For those in the know, my gripe is that probability densities, defined in this way, do not form a category.)
So what to do… what to do…
Let’s just redefine what a probability measure is, so that it automagically normalizes itself. To begin, let’s formalize this normalization business.
Given two measures, we will say they are probabilistically equivalent if for some scalar . In words, two measures are probabilistically equivalent if they are histograms of the same probability distribution. Noting that a single probability distribution has an entire equivalence class of measures associated with it, we can turn this thinking on it’s head^{1} and define a probability distribution as an equivalence class.
What exactly are these equivalent classes? They are simply rays contained within the space of measures. Recall, a ray of a vectorspace (or a cone in this case) is just a semiinfinite linesegment (wikipedia link). There is a natural map from any vectorspace to the space of rays through the origin (for each nonzero vector, just take the ray through the origin which passes through it). The reason this ray business is relevant is because of the way rays transform. Transforming one ray into another has no effect on its “size”, an notion which is not defined. They are purely directional entities, and there is never a need to take a (possibly infinite) norm for the sake of normalization. So we arrive at the following redefinition of a probability measure.
Definition: Let be a algebra. A neoprobability is a ray within , the space of measures.
Okay, I realize that definition might look weird (unless you’re into quantum stuff, maybe).
First, just to check we aren’t leaving our sanity behind, I’ll tie this new definition to the traditional definition. Given a traditional probability space the corresponding ray is simply given by , where is the quotient map. In other words, the rays we are concerned with are exactly the rays given by the traditional probability density, when viewed as elements of .
Now let’s see how things behave under the embedding . Note that acts naturally on by pullback, and we can naturally project this into a map on the space of rays, via the quotient projection, , ala the commutative diagram:
In particular, given a probability, , over , the ray in the bottom right corner of the diagram is that of the “posterior” . In other words, conditional probabilities arise naturally (via the restriction maps) when we use this new definition!
Similar findings
 Marginals are pushforwards on cartesian products.
 Intersection of measureable sets, “”, is a algebra morphism.
 Posteriors are arrows of a category and Baye’s theorem is nearly a tautalogy (this is part 3).
okay, I can see your face. I’ll stop. Hope you guys check out part II.
Footnotes:

In my opinion, this is actually putting our feet on the ground. However, I understand that it might not feel natural if you’ve spent your life walking on your hands. ↩