Main point: Programs can be interpreted as probabilistic distributions using the probability monad. This yields, amongst other things, foundations for easier to understand statistical models.
updated february 2021 for clarity.
A probability distribution is in fact a program. Thinking about it in this way makes them considerable more understandable and easier to work with. now a probabilistic model is simply a program that make all assumptions clear as the sun - no more assumption that data adheres to certain distributions.
In order to lift programs into a probability distribution we use the probability monad. It has for a long time been known that probability distributions are monads. This article builds on the work by Ramsey and Pfeffer on Stochastic Lambda Calculus and Monads of Probability Distributions
The source for this article is available as a gist on Github.
Probabilities
I tend to think of a probability distribution as a black box. A mysterious thing we will never be able to completely open. However, we do have different lenses that lets us see aspects of the distribution and lets us understand it. These lenses, from here on queries, are the support query, the expectation query, and the sample query.
The support of a probability distribution is simply the set of elements in the distribution which are probable. Ie. the elements with a non-zero probability of occurring.
The expectation query is implemented as a function that takes a probability distribution, and function representing an event and returns the mean that said event will happen. For the famous dice example where we want to calculate the expectation of rolling a six, the function will return 1 for six and 0 for all others. Taking the mean yields 1/6 with is the expectation for rolling a six.
The sample query yields one of the elements from the support with a given probability. The samples are usually not that interesting in small quantities and we need to be able to draw thousands. But when we do that we are in fact able to say interesting things about distributions that are recursive or have infinite support. This happens easily when we lift entire programs into a probability distribution.
When constructing probability distributions we need an intuitive way to tap into all the randomness. When working with discrete distributions we can define
Lastly, we also need an operation for building distributions. In this work,
we take offset in a single function called choose
which act as a construct
that binary chooses between two distribution with a certain probability. It
is easy to understand. A coin toss is implement as choose 0.5 Tail Head
.
From here it is merely about convincing oneself that we can approximate all
distributions with arbitrary precision using that single operator.
A Monad
A monad is a mathematical structure with its roots in category theory. Monads can be understood as an abstraction over a computational context. This computational context is in this case the concrete probabilities. Ie. notions about categories and their explicit probabilities etc.
The monad needs to implement two operations: The bind
operation for
composition and the return operation to lift values into the monad.
In the Haskell syntax these functions are declared as follows
(>>=) :: m a -> (a -> m b) -> m b -- pronounced bindreturn :: a -> m a
In terms of probability distribution these should be read respectively as
draw
and as make this single value a distribution. Admittedly, the
distribution return 4
is not in particular interesting as we will always
draw the number 4 with a 100% probability.
The raw probability distribution furthermore implements the choose
function.
This is the function
Class Hierarchy
To get an overview of the implementation of the probability monad it makes sense to look at the class hierarchy we are considering.
The following figure shows the monads we are implementing.
The lowermost three boxes are the most interesting. Those are the queries, the operations we can use on a distribution to make sense of it.
The Expectation Monad
This section elaborates on a measure theoretic approach to the probability monad. We represent the distribution as a continuation that takes a measure function and returns an expectation.
First, we will take a look at our monad type.
-- Probability Monadnewtype PExp a = P (( a -> Double) -> Double)
It is worth looking into what happens here. We have the constructor
PExp
, that takes a function, a measure function, and gives the
expectation.
The monadic structure of probability distributions is in this setting implemented as follows.
instance Monad P where return x = P (\h -> h x) (P d) >>= k = P (\h -> let apply (P f) arg = f arg g x = apply (k x) h in d g)
From here the expectation monad is easily implemented as the whole monad type is built around it.
instance ExpMonad PExp where expectation h (PExp d) = d h
We could now start to do experiments base on this. But we would much
rather like to play with support
and sample
also. These, however,
are quite difficult to implement using the type of PExp
so we are
going to attack this from another angle.
Generalizing
The above implementation of the monad type is not very suited for other than the expectation query. To make something more suited, we will try to stay true to the paper and implement a type such that we can hold data as it is defined
data P a where R :: a -> P a B :: P a -> (a -> P b) -> P b C :: Probability -> P a -> P a -> P a
The quick reader will see that we now use GADTs instead or ordinary ADTs. This is due to the Bind constructor. This poses a change of the type, and can not be implemented by ADTs.
In the instances for the monad and probability monad we simply defer everything to used directly by the queries.
-- P is a monadinstance Monad P where return x = R x d >>= k = B d k-- P is a probability monadinstance ProbabilityMonad P where choose p d1 d2 = C p d1 d2
This allows for definitions of the queries to be directly copied from the
paper. In addition to the expectation
query we had before, we can now also
do support
and sampling
.
instance SupportMonad P where support (R x) = [x] support (B d k) = concat [support (k x) | x <- support d] support (C p d1 d2) = support d1 ++ support d2...
For the full source I once again refer to the gist on Github..
Making Distributions
In this section we will take a look on how to construct some interesting distributions and how the queries work on them.
As often when working with probabilities we will look on dices. To make
it easy we make a data type deriving Enum
for easy enumeration.
data Dice = One | Two | Three | Four | Five | Six deriving (Enum, Eq, Show, Read, Ord)
We can from here make a uniform distribution over the sides of a dice like following.
dist :: P Dicedist = uniform [One .. Six]
The uniform
method is a simple function that makes a uniform distribution
over a list using the choose
function and the length of the list.
A support query can be performed as follows.
example01a = let dist :: P Dice dist = uniform [One .. Six] in support dist
Yielding the result of [One,Two,Three,Four,Five,Six]
. To create a posterior
distribution from a prior, we use the bind. This translates into elegant and
easily readable dependent distributions.
example02a = let dist :: P Dice dist = do d <- uniform [One .. Six] return (if d == Six then One else d) in support dist
This yields the result [One,Two,Three,Four,Five,One]
Note how One
appears
twice. This make immediate sense from the definition of the distribution
though we would have expected duplications removed. However, this is quite
easily done in Haskell.
The last example in the gist is one of the more interesting. The distribution defined is potentially infinite. Intuitively it counts up with a probability of a half and continues doing that.
walk x = do bit <- uniform [True, False] if bit then return x else walk (x + 1)
So what happens when we throw the various queries on it?
The support query should return a list of all potential outcomes of the
distribution, and so it does. running example03a
returns
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,^CInterrupted.
. If not
stopped it would continue counting up.
An important remark here is that
order of the True
and False
in the inner distribution. If we switch it
such that the inner distribution reads uniform [False, True]
it will never
return anything. Intuitively one can understand it as the last element is
unfold first. In the current ordering the walk function creates a list which
is reducible though infinite.
Secondly we have the expectation query. issuing this yields an infinite computation without any result. This is because we attempt to unroll the full structure to get the expectation of an event.
Lastly we have the sample query. This query works completely as expected yielding a list of samples. Issuing example03c
will return
[0,1,3,0,0,2,0,1,0,1]
which intuitively makes sense.
Though the expectation query is not possible through the current implementation we could actually still get an idea about the expectations using the Monte Carlo Technique. Sampling 10.000 elements of the walk distribution and taking the frequency gives us following.
> mc[(0,4937), (1,2483), (2,1294), (3,630), (4,326), (5,159), (6,87), (7,34), (8,27), (9,10), (10,5), (11,5), (12,2), (15,1)]
Which is what we would expect.
Concluding Remarks
We have discussed Ramsey and Pfeffer's paper in this article. It provides a fundamental understanding based in programming languages on distributions and probabilities. It is interesting because it makes languages like Hakaru much easier to understand. Furthermore it also greatly simplifies the hole world of probabilities.