Week 09

Variational inference (KL divergence and ELBO) | Bayesian formulation of the Gaussian mixture model

Inference refers to using a model to make predictions, and deriving estimates based on given data.

Ex: Maximum Likelihood, Exact Bayesian inference, Laplace approximation, Markov Chain Monte Carlo

Variational Inference

Motivation

Can be much faster than MCMC - but without any guarantee...
Applicable for both continuous and discrete distributions - while Laplace approximation is only applicable on continuous ones
Offer an Accuracy VS Speed trade-off

Goal: approximate a posterior distribution, our target

"Variational" = optimize functions

Idea: use a collection of "simple" distributions to get as close as possible to our target distribution by minimizing the distance

BUT HOW, IN PRACTICE?

Illustration of Variational Inference method

Method

Step 1: Define a variational family named Q, a collection of "simple" approximation probability distributions q

Accuracy VS Speed trade-off

Q as a compromise
Larger Q → smaller approximation error

Some common variational families

Step 2: Define a measure of distance between distributions D[q, p]

The Kullback-Leibler divergence

Properties:

Step 3: Search for the closest distribution q* to the target distribution p by minimizing the distance

Let's use the KL divergence: how to minimize it, in practice?

Evidence Lower Bound (ELBO)

Thus, we define the ELBO as:

Since the marginal likelihood p(D) is constant, and KL is positive:

How this ELBO is a useful lower bound? Doesn't depend on the posterior, but on the joint distribution of D and z

Let's see an application of Variational Inference through the Gaussian Mixture Model!

Gaussian Mixture model

Motivation

Unsupervised machine learning, like clustering and density estimation

Can we divide the dataset into K groups?

[Definition] Gaussian Mixture Model (GMM)

Probabilistic model that assumes all the data points are generated from a weighted mixture of a finite number of Gaussian distributions:

Classical approach: fitting Gaussian Mixtures with Expectation-Maximization algorithm

Iterative process based on Maximum Likelihood Estimation
Expectation then maximization steps

Limitations of EM algorithm

Overfitting in Maximum Likelihood
Selection of K
Sensitive to initialization

Bayesian Approach

No overfitting anymore
Use predictive density minimization to find optimal K
Still sensitive to initialization...

Key Equations

Bayesian formulation of the GMM with the precision matrix:

Introducing latent variables in GMM

The distribution of the latent variables is

Why latent? We cannot observe it, we can only access to xn
What does it represent? It indicates to which cluster a given observation belongs

About the Priors

For the mixing weights, π

How α can influence the mixing weights π

For the binary one-hot latent variables

For the precision matrix, Λ

Wishart distribution is a high-dimensional equivalent of the Gamma distribution

For the mean m of each component

How β0 can influence the component means μ

About the Dirichlet and the Wishart distributions

About the Joint distribution

About the Posterior

Computing the exact posterior distribution
= sum over ALL possible configurations of the latent variables → impossible!

We use the Variational Inference method

APPLICATION OF VARIATIONAL INFERENCE

Variational family Q = factorized approximation

We want to minimize the KL divergence e.g. maximize the ELBO:

We have everything needed for Variational Inference.

Visualisations

Example for K = 3

Model Selection with predictive density

Optimal number of components: 3

What about increasing K on purpose?

Application in image clustering:

K=4 classes: dogs, berries, flowers and birds
PCA applied

We can distinguish specific sub-clusters in the four main classes, with common visual patterns:

Peacocks

Red berries

Small and so cute light dogs

Page updated

Google Sites

Report abuse