Week 09
Variational inference (KL divergence and ELBO) | Bayesian formulation of the Gaussian mixture model
Variational inference (KL divergence and ELBO) | Bayesian formulation of the Gaussian mixture model
Inference refers to using a model to make predictions, and deriving estimates based on given data.
Ex: Maximum Likelihood, Exact Bayesian inference, Laplace approximation, Markov Chain Monte Carlo
Can be much faster than MCMC - but without any guarantee...
Applicable for both continuous and discrete distributions - while Laplace approximation is only applicable on continuous ones
Offer an Accuracy VS Speed trade-off
Goal: approximate a posterior distribution, our target
"Variational" = optimize functions
Idea: use a collection of "simple" distributions to get as close as possible to our target distribution by minimizing the distance
BUT HOW, IN PRACTICE?
Step 1: Define a variational family named Q, a collection of "simple" approximation probability distributions q
Accuracy VS Speed trade-off
Q as a compromise
Larger Q → smaller approximation error
Some common variational families
Step 2: Define a measure of distance between distributions D[q, p]
The Kullback-Leibler divergence
Properties:
Step 3: Search for the closest distribution q* to the target distribution p by minimizing the distance
Let's use the KL divergence: how to minimize it, in practice?
Evidence Lower Bound (ELBO)
Thus, we define the ELBO as:
Since the marginal likelihood p(D) is constant, and KL is positive:
How this ELBO is a useful lower bound? Doesn't depend on the posterior, but on the joint distribution of D and z
Let's see an application of Variational Inference through the Gaussian Mixture Model!
Unsupervised machine learning, like clustering and density estimation
Can we divide the dataset into K groups?
[Definition] Gaussian Mixture Model (GMM)
Probabilistic model that assumes all the data points are generated from a weighted mixture of a finite number of Gaussian distributions:
Classical approach: fitting Gaussian Mixtures with Expectation-Maximization algorithm
Iterative process based on Maximum Likelihood Estimation
Expectation then maximization steps
Limitations of EM algorithm
Overfitting in Maximum Likelihood
Selection of K
Sensitive to initialization
Bayesian Approach
No overfitting anymore
Use predictive density minimization to find optimal K
Still sensitive to initialization...
Bayesian formulation of the GMM with the precision matrix:
Introducing latent variables in GMM
The distribution of the latent variables is
Why latent? We cannot observe it, we can only access to xn
What does it represent? It indicates to which cluster a given observation belongs
About the Priors
For the mixing weights, π
For the binary one-hot latent variables
For the precision matrix, Λ
Wishart distribution is a high-dimensional equivalent of the Gamma distribution
For the mean m of each component
About the Joint distribution
About the Posterior
Computing the exact posterior distribution
= sum over ALL possible configurations of the latent variables → impossible!
We use the Variational Inference method
APPLICATION OF VARIATIONAL INFERENCE
Variational family Q = factorized approximation
We want to minimize the KL divergence e.g. maximize the ELBO:
We have everything needed for Variational Inference.
Example for K = 3
Model Selection with predictive density
Optimal number of components: 3
What about increasing K on purpose?
Application in image clustering:
K=4 classes: dogs, berries, flowers and birds
PCA applied
We can distinguish specific sub-clusters in the four main classes, with common visual patterns:
Peacocks
Red berries
Small and so cute light dogs