Home

Deciding your future editor, in the past

Introduction

There’s an incredibly cute technique for sampling from the stationary distribution of a Markov chain, when it exists. This quick article gives a Haskell-flavoured overview. I discovered this trick in a post by the unimitable Dan Piponi.

Some Haskell throat-clearing.

{-# LANGUAGE LambdaCase #-}
module Main where

import System.Random.MWC.Probability

-- between 0 and 1
udbl :: Prob IO Double
udbl = uniform

replicateA :: Applicative f => Int -> f a -> f [a]
replicateA n fa = sequenceA (replicate n fa)

Jumping between editors

Let’s start with a problem. Let’s say we’re interested in creating a model of some aspects of a developer. Namely, we will be modelling how we tend to cycle through various editors randomly, only sometimes settling down on one forever.

data Editor = Emacs | Vim | VsCode deriving (Eq, Show, Ord, Enum, Bounded)

This is going to be our state space: S = \{ \text{Emacs}, \text{Vim}, \text{VsCode} \}.

A Markov chain over such a state space looks something like this:

EmacsVimVsCode 1 0.7 0.1 0.2 0.3 0.2 0.5

Putting the unrealistic probabilities aside for a second: the states are the circles, and the “probabilistic jumps” are the arrows, with the labels expressing the probability of following that jump if the current state is the source of the arrow. For our process to be a Markov chain, it needs to have the so-called Markov property, meaning that the jumps can only depend on the current state, not any of the states we visited previously during our “jumping-around”.

Note: the numbers must be such that the probabilities of all the possible jumps one can do from any given state sum up to 1.

For each state s \in S, we effectively have a probability distribution over S of where we can go next. Since we only have 3 states, we can for example throw the 3 probabilities for jumping out of Emacs in a vector: \mathbf{x}_{\text{emacs}} = \begin{pmatrix}1 \\ 0 \\ 0\end{pmatrix}.

We can even “concatenate” the vectors for Emacs, Vim and VsCode so as to get a 3x3 matrix:

P = \begin{pmatrix} 1 & 0.7 & 0.3\\ 0 & 0.1 & 0.2\\ 0 & 0.2 & 0.5 \end{pmatrix}

The first column shows the probabilities of staying with Emacs or going to Vim and VsCode respectively, if we are currently using Emacs. The same logic applies to the second and third columns, with respectively Vim and VsCode as the current editor.

Similarly, the first row will always indicate the probability of reaching Emacs, second row Vim, third row VsCode.

Finally, each column adds up to 1. Such matrices are called stochastic matrices and have quite a number of interesting properties and applications - more on WP.

If we have a probability distribution over our editors, as a vector, e.g

\mathbf{x} = \begin{pmatrix}p_{\text{emacs}} \\ p_{\text{vim}} \\ p_{\text{vscode}}\end{pmatrix}

saying that at some point in time we have a probability p_{\text{emacs}} of being in the Emacs state etc, then you can multiply P by the vector and get an “updated distribution”, reflecting the probabilities of using each editor after one step of our Markov chain, but starting with the prior/assumptions on the current state given by \mathbf{x}.

For example, let’s say there’s a developer friend of yours, never quite told you exactly which editor they were using, perhaps fearing mockery or trolling. But based on what you know, you can establish some guesstimates about the likelihood that they use Emacs, Vim or VsCode right now, e.g you could think there’s a 70% chance they use VsCode, 20% Vim, 10% Emacs.

\mathbf{x_{\text{friend}}} = \begin{pmatrix}0.1 \\ 0.2 \\ 0.7\end{pmatrix}

Then what happens when we put our friend through the Markov chain?

P \cdot \mathbf{x_{\text{friend}}} = \begin{pmatrix} 1 & 0.7 & 0.3\\ 0 & 0.1 & 0.2\\ 0 & 0.2 & 0.5 \end{pmatrix} \begin{pmatrix}0.1 \\ 0.2 \\ 0.7\end{pmatrix} = \begin{pmatrix}0.45 \\ 0.16 \\ 0.39\end{pmatrix}

We can already see “the Emacs effect” kicking in.

In Haskell

A natural way to encode this in Haskell, using eg Jared TobinI highly recommend his blog for anyone interested in functional programming and probabilistic programming.

mwc-probability package, could be:

editorChange :: Editor -> Prob IO Editor
editorChange e = do
    x <- udbl
    return $ case e of
        Emacs -> Emacs
        Vim -> if x < 0.7 then Vim else if x < 0.8 then VsCode else Emacs
        VsCode -> if x < 0.3 then VsCode else if x < 0.5 then Vim else Emacs

For the sake of this article, we will use a slightly different representation, where we will sample a state transition function f : S \rightarrow S instead of sampling a new state given the current one. Both formulations are equivalent, but the function sampling view will prove useful.

editorChange :: Prob IO (Editor -> Editor)
editorChange = do
    x <- udbl
    return $ \case
        Emacs -> Emacs
        Vim -> if x < 0.7 then Vim else if x < 0.8 then VsCode else Emacs
        VsCode -> if x < 0.3 then VsCode else if x < 0.5 then Vim else Emacs

Where to?

We’re usually very interested in the behaviour of such Markov Chains as we perform jumps one after another. Does the process tend towards some sort of limit, a so-called stationary distribution?

Building atop of our “sampling a state transition function” view from earlier, we can observe that performing n steps amounts to sampling n state transition functions f_1, ..., f_n : S \rightarrow S and composing them: f_n \cdot ... \cdot f_1 : S \rightarrow S, where the transition function for the latest step f_n gets applied last.

On our example. we might for example start on VsCode, and go back and forth between Vim and VsCode from there, but eventually jump to Emacs for the first time and stay there, obviously.

VsCodeVimVsCodeVimVsCodeEmacs

Many other “paths” are possible, like staying on Vim for 10000 steps only to switch to VsCode for the rest of time on the next one, or just hopping between Vim and VsCode forever, or even staying forever on one of those. But can we somehow characterize the likelihood of using a certain editor after n steps as n \rightarrow \infty ? And more importantly, can we sample from that distribution?

The spectral approach

One way to solve this is to simply solve the following equation for \mathbf{x}:

P\mathbf{x} = \mathbf{x}

where \mathbf{x} = \begin{pmatrix}x_1 \\ x_2 \\ x_3\end{pmatrix} \in \mathbf{R}^3.

Such an \mathbf{x} would be a probability distribution over the state space S that is “left untouched by P”. That is, we’re looking for a distribution over our editors such that running one more step of our editor-jumping process doesn’t change the likelihood of being in Emacs, Vim or VsCode anymore, we’d get the same distribution.

One way to find such a vector is to use your favorite linear algebra tools to find an eigenvector for the eigenvalue of 1: the distribution we’re after is a multiple of such an eigenvector, with all the entries summing up to 1. (There may be more than one stationary distribution though.)

For our example, this would help us find out that the vector \begin{pmatrix}1 \\ 0 \\ 0\end{pmatrix} fits the bill.

The interpretation of this vector that if before doing our editor jumping step, we know that the current editor is Emacs, then our Markov chain isn’t going to send us anywhere else. (By design, mwahahaha.)

While we can’t leave Emacs if we reach it, there’s a nonzero chance to leave the other editors (including for Emacs), so eventually we’re bound to go to Emacs and stay there. Therefore, the “limit” of our editor jumping process is a distribution that just gives you Emacs all the time…

Note however that this story is only nice when the state space S is finite, and ideally not too large. If you have, say, 1 million different states, we’re talking 10^6x10^6 matrices of Doubles, which if I’m not mistaken is about 8TB of data if we store all the coefficients. Even if we assume the matrices are sparse, let’s say 10% non-zero values, that’s still going to be 800GB and an insane amount of work.

Forward simulation

In those cases where the exact solving approach above isn’t practical, a plan B is badly needed. Could we instead just run the editor-jumping process enough times and try to observe that it stabilizes around some values after a (possibly large) number of iterations? This sort of iterative approach to solving problems is quite common after all.

Yes we could. Let’s see how.

Alright, so let’s code up a function that runs a n steps of the chain by sampling f_1, …, f_n and composing them as described earlier, f_n \cdot ... \cdot f_1:

forwardN :: Monad m => Int -> Prob m (a -> a) -> Prob m (a -> a)
forwardN n sampleF = go 0 id
    where go k fk | k < n = do f <- sampleF
                               go (k+1) (fk . f)
                  | otherwise = return fk

Perfect. So what n should we give, for our editor example, to get close enough to or exactly the stationary distribution?

Well… no idea. The best we can do is to pick “a large n” and hope that’s good enough. Yuck. What can we do?

Backward simulation?

Remember this diagram?

initial statef1f2...fnstate after n steps sample sample sample

What if we flip the arrows between the f_is and therefore compose them the other way around? What happens?

state after n stepsf1f2...fninitial state sample sample sample

The first thing we can say is that this approach is in some sense equivalent to the forward simulation of n steps: we are drawing n independent functions from the same distribution, and composing them. Which means that the compositions are drawn from the same distribution.

The difference lies in the order and how that affects the entire computation. Previously, if we had started with initial state s_0 performed n steps, landing on state s_n : S, then performing one more step would just amount to drawing an f_{n+1} and computing s_{n+1} = f_{n+1}(s_n). We’re just simply running our Markov chain normally, one step further.

But what’s happening here? When we perform one more step in this backward scheme, we add an f_{n+1} to the right of the chain, i.e at the beginning! This new f_{n+1} gets applied to the initial state, and the resulting state will then go into the composition of functions we got for the previous step. We’re effectively restarting or changing the course of the Markov chain at every step! As opposed to extending an ongoing run. Just by flipping the direction of our composition!

state after n stepsf1f2...fnfn+1initial state sample sample sample sample
backwardN :: Monad m => Int -> Prob m (a -> a) -> Prob m (a -> a)
backwardN n sampleF = go 0 id
    where go k fk | k < n = do f <- sampleF
                               go (k+1) (f . fk)
                               -- in forwardN earlier, we were doing:
                               -- go (k+1) (fk . f)
                               -- so we really do just flip the composition!
                  | otherwise = return fk

Another way to phrase this is: in this new approach, when increasing n by 1 we generate a “new initial state”, f_{n+1}(s_0) : S, with a new random seed, for the n-steps simulation function f_1 \cdot ... \cdot f_n : S \rightarrow S we already have around.

OK, this is fun, but how is it useful? We still have no idea for how long we should go regardless of the direction, and how fast we can approach or reach the stationary distribution. How does “restarting” help at all with any of this?

And this is where my mind was blown…

A look at our random functions S \rightarrow S

We have been talking about those sampled f_i : S \rightarrow S functions a lot, but what do they look like? Here’s the code that generates then, as a reminder.

editorChange :: Prob IO (Editor -> Editor)
editorChange = do
    x <- udbl
    return $ \case
        Emacs -> Emacs
        Vim -> if x < 0.7 then Vim else if x < 0.8 then VsCode else Emacs
        VsCode -> if x < 0.3 then VsCode else if x < 0.5 then Vim else Emacs

Depending on the value of x, the uniform Double between 0 and 1 that we sample, we get functions that behave very differently.

We can picture all the possible functions editorChange can generate, depending on the interval where x falls, along with the probability of being sampled.

0 <= x < 0.30.3 <= x < 0.50.5 <= x < 0.70.7 <= x < 0.80.8 <= x <= 1f1, 30%f2, 20%f3, 20%f4, 10%f5, 20%EmacsVimVsCodeEmacsVimVsCodeEmacsVimVsCodeEmacsVimVsCodeEmacsVimVsCodeEmacsVimVsCodeEmacsVimVsCodeEmacsVimVsCodeEmacsVimVsCodeEmacsVimVsCode

The first function maps each editor to itself: this is the identity function. The last function maps everything to Emacs. What a lovely function. As per the code in editorChange, we can see that no function even attempts mapping Emacs to anything else.

Alright, so those are our players. What’s up with them?

Coalescence

Let’s picture a sample run with a few steps, and what the composition looks like.

Imagine we start by sampling f1, and then f2. With our backwards simulation scheme, this means we have to apply f2 first. Here is what applying f2 then f1 looks like:

EmacsVimVsCodeEmacsVimVsCodeEmacsVimVsCode

In short: - if we start on Emacs, we stay there - if we start on Vim, we stay on Vim - if we start on VsCode, we end up on Vim

(The states that we cannot reach no matter what initial state we pick are a bit more transparent.)

Now let’s say we sample f5 (which maps everything to Emacs), and “prepend” it to our existing chain of functions, as per our backwards simulation scheme:

EmacsVimVsCodeEmacsVimVsCodeEmacsVimVsCodeEmacsVimVsCode

This is it. We reached a point where the outcome does not depend on the initial state, meaning we sampled from the limiting distribution. Indeed, whatever function gets prepended in front of f5, its result will get mapped to Emacs.

In such a case, we say that there is coalescence of all the trajectories.

Colaescence doesn’t necessarily require that we sample as brutal a function as f5. We could have gotten it in our example by first sampling f4 and then f3 (therefore yielding the function that first applies f3 and then f4), as shown below:

EmacsVimVsCodeEmacsVimVsCodeEmacsVimVsCode

We are again in a situation where the composition is a constant function (i.e doesn’t depend on its argument), which means again that this gives us a sample from the limiting distribution (which, as a reminder, consists in giving probability 1 to Emacs and 0 to the others).

In a situation where the limiting distribution is more balanced, we could get other editors coming out as results of this process, merely by combining the right functions to “put the other states out of reach”. But in this article, we can’t escape Emacs (editorial choice). Once we land there, we stay there. We’re guaranteed to be in Emacs with probability 1 after a sufficient (but finite!) number of steps, however long it takes to sample the right functions to coalesce the trajectories. Prepending new functions doesn’t alter the behaviour of the chain.

Leveraging the trick

Let’s put in code what we discussed in the past couple of sections.

allEqual :: Eq a => [a] -> Bool
allEqual [] = True
allEqual [_] = True
allEqual (a : as) = all (== a) as

isConstant :: (Bounded a, Enum a, Eq a) => (a -> a) -> Bool
isConstant f = allEqual $ map f (enumFromTo minBound maxBound)

coupleFromPastEB :: (Eq s, Enum s, Bounded s, Monad m) => Prob m (s -> s) -> Prob m (s -> s)
coupleFromPastEB trans = go id (0 :: Int)
    where go f k | isConstant f = return f
                 | otherwise    = do new_f <- trans
                                     go (f . new_f) (k+1)

editorLimit :: Prob IO (Editor -> Editor)
editorLimit = coupleFromPastEB editorChange

That’s it! editorStationary is the limiting distribution of our editor-jumping process. And sure enough, we get plenty of Emacs.

main :: IO ()
main = do
    gen <- create
    let n = 100000
    xs <- fmap (map ($ Vim)) . flip sample gen $ replicateA n editorLimit
    putStrLn "is Emacs the best editor?"
    print $ all (==Emacs) xs
ghci> main
is Emacs the best editor?
True

Maths don’t lie.

Why doesn’t this work in the forward simulation?

The coalescence behaviour does not emerge when doing the forward simulation. Indeed, we append the newly sampled functions at the end of the chain. So if at any point the composition we have is a constant function, appending another function at the end would change the “constant result”, and would therefore in the general case have no reason to simply stop moving around and just sit at a final value.

When are we guaranteed to see the histories coalesce?

When the Markov chain is ergodic. This means it cannot be periodic, and that all states can “communicate” (that is, for any pair of states, there’s one that can reach the other after a sufficient number of jumps through other states). More on this on WP.

For a proof of why ergodicity is sufficient, see the original paper.

Handling large state spaces and more

Wait, we said earlier that the linear algebra computations don’t cut it for big state spaces, but at every step of our backwards simulation process, we have to check whether the composition we have so far is the constant function. That’s not exactly ideal either when S has 10 million elements, is it?

It is not indeed, but we have one more trick up our sleeves from the original paper, which requires one assumption on S but in exchange lets us cut down significantly on the cost of the coalescence check and handle uncountably infinite spaces (like the unit interval [0, 1]).

The additional assumption consists in requiring that S can be equipped with a partial order (S, \leq), along with minimal and maximal elements s_{\text{min}} and s_{\text{max}} for that order, and that our random functions are monotone with respect to that order.

If I oversimplify this to total order + explicit minimal and maximal elements, we could get the following code:

-- the 'trans'ition distribution must hand us monotone functions for this to work
coupleFromPastMonotone :: (Ord s, Eq s, Monad m) => s -> s -> Prob m (s -> s) -> Prob m (s -> s)
coupleFromPastMonotone emin emax trans = go id
    where go f | f emin == f emax = return f
               | otherwise        = do new_f <- trans
                                       go (f . new_f)

There even is a way to improve the performance further by doubling the size of jumps that we make in the past, see the original paper.

Finally, this work inspired a lot of extensions to those ideas, which can work in quite general settings. Some are about being able to extend the state to be a vector instead of just one piece, and transforming it with random matrices and what not, while others are about supporting extremely general forms of state spaces and functions.

The best survey of all those methods that I found is Iterated Random Functions by Persi Diaconis and David Freedman.

Conclusion

This cute little technique escalated into a whole range of algorithms to sample from the limiting distribution of a wide array of Markov chains. To me, it stands in stark contrast with the simplicity of what I imagine to be the conversation that birthed it all.

Reading list

Posted: