The New Flesh

Categorification of Markov Decision Processes

Harlan Ichikawa

In earlier posts I re-discovered the category of stochastic relations. To me this is a very useful category to get comfortable with if you’re a machine learning practitioner with a taste for mathematical abstraction. In this post we will see how this category, literally, reduces the definition of markov decision processes to cartoon sketches of arrows and dots.


The main take-aways of this post are that within the category of stochastic relations, :

  • discrete time markov models
  • hidden markov models
  • markov decision processes
  • and partially observeable markov decision processes

are given by the diagrams

markov, hmm, mdp, pomdp

and the categories for each of these items is (at least in spirit) a functor category of the form where is replaced by each of the diagrams above.

TL;DRs are so useful! You understood everything just said, and now you don’t need to read this post. It’s a wonder I wasted time writing it. Toodeloo!

The category of stochastic relations

The category of stochastic relations, , is basically the category of random variables. The objects of are measureable spaces. The morphisms are what engineers might call “stochastic maps” and what mathematicians call conditional probability distributions. The composition of two conditional probability distributions is (at least hueristically) given by

Obviously you replace integrals with sums in the case where is discrete. The discovery of this category is generally credited to an unpublished note by Lawvere in 1962.1

Notable limits within

The terminal object of is the trivial measureable space, . A morphism, is equivalent to a probability distribution over , i.e. .

Within the category of sets, , the limit of the diagram is given by the graph, . Within the same construction holds. In this case we have a stochastic map, , and the limit is the measureable space

which is a measureable sub-space of .

Determinism and measureable maps

The category of measureable spaces is a subcategory of . The objects of both categories are identical, and for each measureable map, , we can define the stochastic map, by

for a measureable set and . We call such morphisms deterministic. There is a universal property which characterizes a the deterministic maps … they are the least fuzzy… Okay, to be the honest I’ve not tightened all the screws here, but here goes.

A natural (re)definition of determistic

The deterministic maps are special within because they preserve some notion of locality. So, I’ll make an attempt to nail down what locality means in . Firstly, Each measureable space, is naturally equipped with a collection of (measureable) inclusion maps, for each measureable . Perhaps you can question if it’s really so natural. I’d say it is because I believe in the following

Conjecture: The monomorphisms of are (up to isomorphism) the inclusion maps which embed measureable subsets into measureable spaces.

Certainly the inclusion maps are monic. I believe the converse holds as well (anything monic is isomorphic to one of these inclusion maps). Dually, I also believe that the epimorphism are the deterministic surjections.

Corollary: A stochastic relation, , is deterministic if and only if for each inclusion map, , the pushout consists of an inclusion map and a stochastic relation, so that we get the commutative square

The category of Markov models

A discrete time Markov process is one where we witness a random sequence of objects existing withing some measureable space , where the dependency between subsequent observations satisfies . In fact, the entire systems is characterized by the conditional probability function , sometimes referred to as the transition function. In other words, a discrete time Markov process is one whose diagram in is given by


These models form a category, denoted , in the obvious way. The objects are endomorphisms, , and an arrow from to is given by a similarity transformation, i.e. a morphism such that . In Lawvere’s 1962 note1, which is credited for kicking off the categorification of probability theory, making explicit was (at least ostensibly for funding purposes) a prime motive. In that pre-moon-landing era note, Lawvere identified , where was viewed as a monoidal category. Now days we might write something more like

The category of hidden Markov models

Traditionally, a hidden markov model consists of a markov process, as well as a stochastic observation of the state. Formally, this is given by two maps, the stochastic transition map which sends and the stochastic observation, . More precisely, a hidden Markov model is nothing but a diagram


We can form a category, , in the obvious way. The objects of are hidden markov models, and an arrow in from to is a pair of morphisms, and of such that and . In other-words we have a functor category, , where is the diagram depicted above.

At this point a pattern has been established for this blog post. We will continue in our caveman like fashion to

  • look at
  • diagram
  • define functor-category

Let’s do this next for Markov Decision processes.

Markov decision process

A markov decision process is, roughly speaking, a controllable markov process with rewards. The goal is to pull levers in the correct way in order to maximize a cumulative reward. In this section we will discover that all Markov decision processes can be represented as a diagram within the category . Before doing that we’ll go over the traditional definition. The traditional definition is needlessly restrictive, but let’s go through it for the sake of tradition.


The traditional definition

Many textbooks define an MDP with states and actions as a triple where , expresses that if you’re in state and apply action , then you will end up in a random state drawn from . The reward function, , is something that rewards us when the transition is executed. Lastly, is a real number between 0 and 1, called the “discount rate”.

“Solving” and MDP means finding a policy, which maximizes

where denotes the expected value of the reward given at time .

A superior definition

The traditional definition is needlessly restrictive, and is frequently violated in practice. In many cases the set of actions depends on the state. For example, consider an -state system, where all the states are numbered through , and the actions are to increase/decrease the state by one. Then there are two possible actions on states through , but only one possible action at the boundaries and . Perhaps this is gnit-picky, but it’s easy to fix.

Here’s the fix. We assume the existence of an epimorphism , and this map should be deterministic as well, however I conjecture that follows automatically from being epic. The action space for a given state is given by the pre-image . The transition map is a stochastic morphism . Thus, the kinematics of an MDP are given by the diagram

This picture is not quite complete. It’s not an MDP unless there is a reward function. However, I have another gnit-pick regarding the traditional definition of the reward function. Traditionally, the reward function is defined over the entirety of . This is excessive. The reward function need only be defined on transitions generated by . A more precise definition is that the reward function is a stochastic relation from the (stochastic) graph of to the reals.2

In summary, an MDP is given by a triple where is deterministic-epic and the triple satisfies the diagram


The category of

We can an MDP onto another MDP using a morphisms , an affine orientation preserving transformation , and a morphism such that

, ,

In this sense, the triple is an arrow of to . This has the feel of a functor category, i.e. where is the commutative diagram of an MDP depicted in the last sub-section.3

Solutions to MDPs

The solution to an MDP is a policy, some way of choosing an action given the state. A policy, under this definition, is a section of . i.e. a stochastic relation such that

Given a policy, we can form the composition . Thus a policy transforms an MDP into a markov process. A solution of an MDP is a policy that maximizes the value function

Now some frantic hand-waving. I’m pretty sure the mapping which takes an MDP and gives you it’s solution yields two functors. One which maps an MDP to the optimal policy , and another which maps to the markov process . I suspect, this is possible by viewing as an totally ordered set, which induces a total ordering on the policies, and so we can take limits (in the categorical sense) and cobble together a functor. The solutions are not always unique :(. If we consider the subcategory of MDPs with unique solutions, or cleverly “quotient away” enough redundancies, I suspect this is possible.

Partially observable markov decision processes

A partially observable Markov decision process (POMDP) is to an MDP as a hidden Markov model is to a Markov model. The traditional definition does little more than append a stochastic map of observations to the standard definition of an MDP. A little more care should be applied in my opinion though. I think it’s reasonable to assume that the set of actions one can take in a given state be observable (if you can’t see them, they wouldn’t be actionable).

As a result, we assume a (deterministic) epimorphism, . A pomdp consists of a tuple where

  • relates actions to observations
  • is the observation map
  • denotes the observation space
  • denotes the state space
  • is the transition map
  • is the reward function
  • is the discount factor.

The is just the pull-back of along . As a set,

and is equipped with the two natural projections “” and “”. All these maps allow us to diagram a generic POMDP as


The category of POMDPs

Again, we can form a category of POMDPs by imagining the relevant commutative squares. A POMDP where

can be mapped to another POMDP by maps

which generates a new POMDP as one which makes the squares commute

, , ,

so again we may define the category as something like as functor category, where is the diagram associated to POMDP’s depicted earlier.

(Not) Solving POMDPs

Just like the solution to an MDP is a policy that tells one what action to take, a solution to a POMDP is going to be a policy as well. What is a policy in this context? Nobody knows!!!! The answer to this question continues to be debated within the reinforcement learning community. Some believe a policy should be a section of , i.e. a map from . Personally, that sounds ridiculous. In the theory of control of LTI systems, you look at the history of observations to make an informed decision. So some advocate for a map from histories of ’s to actions. I have my own ideas, but they’ll have to wait for another blog post.


In this post basically nothing useful was accomplished. Have a nice day.

Full disclosure: This post is a primer for the next post, where actually useful work will be done. Solving POMDPs (and even defining the solution space) continues to perplex ppl, and this abstract nonsense could be helpful in clearing the crud and proposing an eleagant and “simple” solution. There are standard techniques to solve MDPs. If we can build a faithful functor , we can solve POMDPs using the same techniques. This will be attempted in the next post.


  1. Lawvere, “The Category of Probabilistic Mappings”, 1962 link  2

  2. There is a caveat. The reward function maps to the reals viewed as a measureable vector-space with orientation. This is because we take expected values of sums of rewards (thus we need vectoriness) and we maximize cumulative rewards (thus we need to know which way is heaven) 

  3. I can’t equate with because our diagram has extra information in it (the diagram is more than a cartoon of a small category). For example, the epicness of and the fact that reward function maps to viewed as an oriented vector-space. 

Theme "Swiss" designed by broccollini