Week 07
Markov Chain Monte Carlo Methods | Metropolis-Hasting algorithm
Markov Chain Monte Carlo Methods | Metropolis-Hasting algorithm
To make prediction, we compute the predictive posterior distribution based on p(w|t), the posterior distribution:
BUT intractable integral analytically → approximate it!
Goal: in general terms, estimate the expectation
Indeed, if we rewrite the predictive posterior distribution:
[Definition] Monte-Carlo estimator, a random variable
What is the relation of the MC estimator with our target expectation?
→ the Monte Carlo estimor is unbiased
→ the variance decreases with 1/S
Subject to existence, the MC estimator converges towards the desired expectation, with the law of large numbers i.e.
HOW TO OBTAIN I.I.D. SAMPLES FROM ANY GIVEN DISTRIBUTION?
Existing methods: rejection sampling, importance sampling, ancestral sampling.Let's forget about i.i.d samples: use random walks with specific properties instead
[Definition] 1st order Markov chain
"If you know about the present, forget about the past"
[Definition] Transition kernel T
A transition kernel is homogeneous if it is the same for all m.
[Definition] Invariance with respect to Markov chain
A distribution p(z) is invariant wrt. Markov chain if no step changes the distribution
Idea: create a specific random walk that explore the space of the target distribution, where high density regions are visited more frequently
Markov Chain Monte Carlo [MCMC] methods
refer to the design of a Markov chain such that
the target distribution is invariant wrt. the chain.
BUT HOW TO DESIGN THIS MARKOV CHAIN, IN PRACTICE?
Goal: sample from the target distribution p(θ)
INITIALIZATION
Define a symmetric proposal distribution q (= transition kernel)
For this example, we will use a Gaussian proposal
with parameter τ, the variance of the normal distribution
Define an initial value for θ
REPETITIVE PROCESS
For k=1 to K:
Generate candidate sample using proposal distribution, based on previous θ
2. Compute acceptance probability
Note: by using a symmetric proposal, we can simplify the ratio of posteriors.
3. Sample from a uniform distribution on the unit interval, then define the new θ
WHAT IS THE WARM-UP PHASE?
To get rid of the initialization and focus on the high density region of θ, we remove a part of the first θ.
Here: we remove 50% of iterations
The parameter τ controls the noise we add to the previous sample to generate a new candidate sample → step-size
PROS
Strong mathematical guarantee: we converge to the exact target distribution
Easy implementation
Easy to evaluate different models
CONS
May be very long for some difficult distributions
Proposal distribution may require tuning
Slow for large datasets