Introduction
There’s an incredibly cute technique for sampling from the stationary distribution of a Markov chain, when it exists. This quick article gives a Haskell-flavoured overview. I discovered this trick in a post by the unimitable Dan Piponi.
Some Haskell throat-clearing.
{-# LANGUAGE LambdaCase #-}
module Main where
import System.Random.MWC.Probability
-- between 0 and 1
udbl :: Prob IO Double
= uniform
udbl
replicateA :: Applicative f => Int -> f a -> f [a]
= sequenceA (replicate n fa) replicateA n fa
Jumping between editors
Let’s start with a problem. Let’s say we’re interested in creating a model of some aspects of a developer. Namely, we will be modelling how we tend to cycle through various editors randomly, only sometimes settling down on one forever.
data Editor = Emacs | Vim | VsCode deriving (Eq, Show, Ord, Enum, Bounded)
This is going to be our state space: S = \{ \text{Emacs}, \text{Vim}, \text{VsCode} \}.
A Markov chain over such a state space looks something like this:
Putting the unrealistic probabilities aside for a second: the states are the circles, and the “probabilistic jumps” are the arrows, with the labels expressing the probability of following that jump if the current state is the source of the arrow. For our process to be a Markov chain, it needs to have the so-called Markov property, meaning that the jumps can only depend on the current state, not any of the states we visited previously during our “jumping-around”.
Note: the numbers must be such that the probabilities of all the possible jumps one can do from any given state sum up to 1.
For each state s \in S, we
effectively have a probability distribution over S of where we can go next. Since we only have
3 states, we can for example throw the 3 probabilities for jumping out
of Emacs
in a vector: \mathbf{x}_{\text{emacs}} = \begin{pmatrix}1 \\ 0 \\
0\end{pmatrix}.
We can even “concatenate” the vectors for Emacs
,
Vim
and VsCode
so as to get a 3x3 matrix:
P = \begin{pmatrix} 1 & 0.7 & 0.3\\ 0 & 0.1 & 0.2\\ 0 & 0.2 & 0.5 \end{pmatrix}
The first column shows the probabilities of staying with
Emacs
or going to Vim
and VsCode
respectively, if we are currently using Emacs
. The same
logic applies to the second and third columns, with respectively
Vim
and VsCode
as the current editor.
Similarly, the first row will always indicate the probability of
reaching Emacs
, second row Vim
, third row
VsCode
.
Finally, each column adds up to 1. Such matrices are called stochastic matrices and have quite a number of interesting properties and applications - more on WP.
If we have a probability distribution over our editors, as a vector, e.g
\mathbf{x} = \begin{pmatrix}p_{\text{emacs}} \\ p_{\text{vim}} \\ p_{\text{vscode}}\end{pmatrix}
saying that at some point in time we have a probability p_{\text{emacs}} of being in the
Emacs
state etc, then you can multiply P by the vector and get an “updated
distribution”, reflecting the probabilities of using each editor after
one step of our Markov chain, but starting with the prior/assumptions on
the current state given by \mathbf{x}.
For example, let’s say there’s a developer friend of yours, never
quite told you exactly which editor they were using, perhaps fearing
mockery or trolling. But based on what you know, you can establish some
guesstimates about the likelihood that they use Emacs
,
Vim
or VsCode
right now, e.g you could think
there’s a 70% chance they use VsCode, 20% Vim, 10% Emacs.
\mathbf{x_{\text{friend}}} = \begin{pmatrix}0.1 \\ 0.2 \\ 0.7\end{pmatrix}
Then what happens when we put our friend through the Markov chain?
P \cdot \mathbf{x_{\text{friend}}} = \begin{pmatrix} 1 & 0.7 & 0.3\\ 0 & 0.1 & 0.2\\ 0 & 0.2 & 0.5 \end{pmatrix} \begin{pmatrix}0.1 \\ 0.2 \\ 0.7\end{pmatrix} = \begin{pmatrix}0.45 \\ 0.16 \\ 0.39\end{pmatrix}
We can already see “the Emacs
effect” kicking in.
In Haskell
A natural way to encode this in Haskell, using eg Jared TobinI highly recommend his blog for anyone interested in
functional programming and probabilistic programming.
mwc-probability
package, could be:
editorChange :: Editor -> Prob IO Editor
= do
editorChange e <- udbl
x return $ case e of
Emacs -> Emacs
Vim -> if x < 0.7 then Vim else if x < 0.8 then VsCode else Emacs
VsCode -> if x < 0.3 then VsCode else if x < 0.5 then Vim else Emacs
For the sake of this article, we will use a slightly different representation, where we will sample a state transition function f : S \rightarrow S instead of sampling a new state given the current one. Both formulations are equivalent, but the function sampling view will prove useful.
editorChange :: Prob IO (Editor -> Editor)
= do
editorChange <- udbl
x return $ \case
Emacs -> Emacs
Vim -> if x < 0.7 then Vim else if x < 0.8 then VsCode else Emacs
VsCode -> if x < 0.3 then VsCode else if x < 0.5 then Vim else Emacs
Where to?
We’re usually very interested in the behaviour of such Markov Chains as we perform jumps one after another. Does the process tend towards some sort of limit, a so-called stationary distribution?
Building atop of our “sampling a state transition function” view from earlier, we can observe that performing n steps amounts to sampling n state transition functions f_1, ..., f_n : S \rightarrow S and composing them: f_n \cdot ... \cdot f_1 : S \rightarrow S, where the transition function for the latest step f_n gets applied last.
On our example. we might for example start on VsCode
,
and go back and forth between Vim
and VsCode
from there, but eventually jump to Emacs
for the first time
and stay there, obviously.
Many other “paths” are possible, like staying on Vim
for
10000 steps only to switch to VsCode
for the rest of time
on the next one, or just hopping between Vim
and
VsCode
forever, or even staying forever on one of those.
But can we somehow characterize the likelihood of using a certain editor
after n
steps as n \rightarrow
\infty ? And more importantly, can we sample from that
distribution?
The spectral approach
One way to solve this is to simply solve the following equation for \mathbf{x}:
P\mathbf{x} = \mathbf{x}
where \mathbf{x} = \begin{pmatrix}x_1 \\ x_2 \\ x_3\end{pmatrix} \in \mathbf{R}^3.
Such an \mathbf{x} would be a
probability distribution over the state space S that is “left untouched by P”. That is, we’re looking for a distribution
over our editors such that running one more step of our editor-jumping
process doesn’t change the likelihood of being in Emacs
,
Vim
or VsCode
anymore, we’d get the same
distribution.
One way to find such a vector is to use your favorite linear algebra tools to find an eigenvector for the eigenvalue of 1: the distribution we’re after is a multiple of such an eigenvector, with all the entries summing up to 1. (There may be more than one stationary distribution though.)
For our example, this would help us find out that the vector \begin{pmatrix}1 \\ 0 \\ 0\end{pmatrix} fits the bill.
The interpretation of this vector that if before doing our editor jumping step, we know that the current editor is Emacs, then our Markov chain isn’t going to send us anywhere else. (By design, mwahahaha.)
While we can’t leave Emacs
if we reach it, there’s a
nonzero chance to leave the other editors (including for
Emacs
), so eventually we’re bound to go to Emacs and stay
there. Therefore, the “limit” of our editor jumping process is a
distribution that just gives you Emacs
all the time…
Note however that this story is only nice when the state space S is finite, and ideally not too
large. If you have, say, 1 million different states, we’re talking 10^6x10^6
matrices of Double
s, which if I’m not mistaken is about 8TB
of data if we store all the coefficients. Even if we assume the matrices
are sparse, let’s say 10% non-zero values, that’s still going to be
800GB and an insane amount of work.
Forward simulation
In those cases where the exact solving approach above isn’t practical, a plan B is badly needed. Could we instead just run the editor-jumping process enough times and try to observe that it stabilizes around some values after a (possibly large) number of iterations? This sort of iterative approach to solving problems is quite common after all.
Yes we could. Let’s see how.
Alright, so let’s code up a function that runs a n steps of the chain by sampling f_1, …, f_n and composing them as described earlier, f_n \cdot ... \cdot f_1:
forwardN :: Monad m => Int -> Prob m (a -> a) -> Prob m (a -> a)
= go 0 id
forwardN n sampleF where go k fk | k < n = do f <- sampleF
+1) (fk . f)
go (k| otherwise = return fk
Perfect. So what n should we give, for our editor example, to get close enough to or exactly the stationary distribution?
Well… no idea. The best we can do is to pick “a large n” and hope that’s good enough. Yuck. What can we do?
Backward simulation?
Remember this diagram?
What if we flip the arrows between the f_is and therefore compose them the other way around? What happens?
The first thing we can say is that this approach is in some sense equivalent to the forward simulation of n steps: we are drawing n independent functions from the same distribution, and composing them. Which means that the compositions are drawn from the same distribution.
The difference lies in the order and how that affects the entire computation. Previously, if we had started with initial state s_0 performed n steps, landing on state s_n : S, then performing one more step would just amount to drawing an f_{n+1} and computing s_{n+1} = f_{n+1}(s_n). We’re just simply running our Markov chain normally, one step further.
But what’s happening here? When we perform one more step in this backward scheme, we add an f_{n+1} to the right of the chain, i.e at the beginning! This new f_{n+1} gets applied to the initial state, and the resulting state will then go into the composition of functions we got for the previous step. We’re effectively restarting or changing the course of the Markov chain at every step! As opposed to extending an ongoing run. Just by flipping the direction of our composition!
backwardN :: Monad m => Int -> Prob m (a -> a) -> Prob m (a -> a)
= go 0 id
backwardN n sampleF where go k fk | k < n = do f <- sampleF
+1) (f . fk)
go (k-- in forwardN earlier, we were doing:
-- go (k+1) (fk . f)
-- so we really do just flip the composition!
| otherwise = return fk
Another way to phrase this is: in this new approach, when increasing n by 1 we generate a “new initial state”, f_{n+1}(s_0) : S, with a new random seed, for the n-steps simulation function f_1 \cdot ... \cdot f_n : S \rightarrow S we already have around.
OK, this is fun, but how is it useful? We still have no idea for how long we should go regardless of the direction, and how fast we can approach or reach the stationary distribution. How does “restarting” help at all with any of this?
And this is where my mind was blown…
A look at our random functions S \rightarrow S
We have been talking about those sampled f_i : S \rightarrow S functions a lot, but what do they look like? Here’s the code that generates then, as a reminder.
editorChange :: Prob IO (Editor -> Editor)
= do
editorChange <- udbl
x return $ \case
Emacs -> Emacs
Vim -> if x < 0.7 then Vim else if x < 0.8 then VsCode else Emacs
VsCode -> if x < 0.3 then VsCode else if x < 0.5 then Vim else Emacs
Depending on the value of x
, the uniform
Double
between 0 and 1 that we sample, we get functions
that behave very differently.
We can picture all the possible functions
editorChange
can generate, depending on the interval where
x falls, along with the probability of
being sampled.
The first function maps each editor to itself: this is the identity
function. The last function maps everything to Emacs
. What
a lovely function. As per the code in editorChange
, we can
see that no function even attempts mapping Emacs
to
anything else.
Alright, so those are our players. What’s up with them?
Coalescence
Let’s picture a sample run with a few steps, and what the composition looks like.
Imagine we start by sampling f1
, and then
f2
. With our backwards simulation scheme, this means we
have to apply f2
first. Here is what applying
f2
then f1
looks like:
In short: - if we start on Emacs
, we stay there - if we
start on Vim
, we stay on Vim
- if we start on
VsCode
, we end up on Vim
(The states that we cannot reach no matter what initial state we pick are a bit more transparent.)
Now let’s say we sample f5
(which maps everything to
Emacs
), and “prepend” it to our existing chain of
functions, as per our backwards simulation scheme:
This is it. We reached a point where the outcome does not depend on
the initial state, meaning we sampled from the limiting distribution.
Indeed, whatever function gets prepended in front of f5
,
its result will get mapped to Emacs
.
In such a case, we say that there is coalescence of all the trajectories.
Colaescence doesn’t necessarily require that we sample as brutal a
function as f5
. We could have gotten it in our example by
first sampling f4
and then f3
(therefore
yielding the function that first applies f3
and then
f4
), as shown below:
We are again in a situation where the composition is a constant
function (i.e doesn’t depend on its argument), which means again that
this gives us a sample from the limiting distribution (which, as a
reminder, consists in giving probability 1 to Emacs
and 0
to the others).
In a situation where the limiting distribution is more balanced, we
could get other editors coming out as results of this process, merely by
combining the right functions to “put the other states out of reach”.
But in this article, we can’t escape Emacs
(editorial
choice). Once we land there, we stay there. We’re guaranteed to be in
Emacs
with probability 1 after a sufficient (but finite!)
number of steps, however long it takes to sample the right functions to
coalesce the trajectories. Prepending new functions doesn’t alter the
behaviour of the chain.
Leveraging the trick
Let’s put in code what we discussed in the past couple of sections.
allEqual :: Eq a => [a] -> Bool
= True
allEqual [] = True
allEqual [_] : as) = all (== a) as
allEqual (a
isConstant :: (Bounded a, Enum a, Eq a) => (a -> a) -> Bool
= allEqual $ map f (enumFromTo minBound maxBound)
isConstant f
coupleFromPastEB :: (Eq s, Enum s, Bounded s, Monad m) => Prob m (s -> s) -> Prob m (s -> s)
= go id (0 :: Int)
coupleFromPastEB trans where go f k | isConstant f = return f
| otherwise = do new_f <- trans
. new_f) (k+1)
go (f
editorLimit :: Prob IO (Editor -> Editor)
= coupleFromPastEB editorChange editorLimit
That’s it! editorStationary
is the limiting distribution
of our editor-jumping process. And sure enough, we get plenty of
Emacs
.
main :: IO ()
= do
main <- create
gen let n = 100000
<- fmap (map ($ Vim)) . flip sample gen $ replicateA n editorLimit
xs putStrLn "is Emacs the best editor?"
print $ all (==Emacs) xs
ghci> main
is Emacs the best editor?
True
Maths don’t lie.
Why doesn’t this work in the forward simulation?
The coalescence behaviour does not emerge when doing the forward simulation. Indeed, we append the newly sampled functions at the end of the chain. So if at any point the composition we have is a constant function, appending another function at the end would change the “constant result”, and would therefore in the general case have no reason to simply stop moving around and just sit at a final value.
When are we guaranteed to see the histories coalesce?
When the Markov chain is ergodic. This means it cannot be periodic, and that all states can “communicate” (that is, for any pair of states, there’s one that can reach the other after a sufficient number of jumps through other states). More on this on WP.
For a proof of why ergodicity is sufficient, see the original paper.
Handling large state spaces and more
Wait, we said earlier that the linear algebra computations don’t cut it for big state spaces, but at every step of our backwards simulation process, we have to check whether the composition we have so far is the constant function. That’s not exactly ideal either when S has 10 million elements, is it?
It is not indeed, but we have one more trick up our sleeves from the original paper, which requires one assumption on S but in exchange lets us cut down significantly on the cost of the coalescence check and handle uncountably infinite spaces (like the unit interval [0, 1]).
The additional assumption consists in requiring that S can be equipped with a partial order (S, \leq), along with minimal and maximal elements s_{\text{min}} and s_{\text{max}} for that order, and that our random functions are monotone with respect to that order.
If I oversimplify this to total order + explicit minimal and maximal elements, we could get the following code:
-- the 'trans'ition distribution must hand us monotone functions for this to work
coupleFromPastMonotone :: (Ord s, Eq s, Monad m) => s -> s -> Prob m (s -> s) -> Prob m (s -> s)
= go id
coupleFromPastMonotone emin emax trans where go f | f emin == f emax = return f
| otherwise = do new_f <- trans
. new_f) go (f
There even is a way to improve the performance further by doubling the size of jumps that we make in the past, see the original paper.
Finally, this work inspired a lot of extensions to those ideas, which can work in quite general settings. Some are about being able to extend the state to be a vector instead of just one piece, and transforming it with random matrices and what not, while others are about supporting extremely general forms of state spaces and functions.
The best survey of all those methods that I found is Iterated Random Functions by Persi Diaconis and David Freedman.
Conclusion
This cute little technique escalated into a whole range of algorithms to sample from the limiting distribution of a wide array of Markov chains. To me, it stands in stark contrast with the simplicity of what I imagine to be the conversation that birthed it all.
- Hey, any clue what happens if we just flip the arrows?
- No idea, let’s take a look…
Reading list
- The original Propp&Wilson paper
- Iterated Random Functions
- Coalescence of Markov Trajectories
- Dan Piponi’s post (unfortunately has LaTeX formatting issues)
- Lecture notes/scribbles from an MIT course: 1 2 3
Posted: