Week 11

Black-box variational inference | Stochastic optimization

Motivation

Recall - Variational Inference method

Goal: approximate a target distribution (posterior)

Idea: use a collection of "simple" distributions to get as close as possible to our target distribution by minimizing the distance

Illustration of Variational Inference method

About the variational family

Free-form variational inference

Optimal function form given assumptions

BUT some issues:

Require model-specific derivations
Integrals may be intractable
Optimal forms may not be "well-known" distributions

Black-box variational inference

Main idea

Now, the goal is:

A simplified ELBO

About entropy calculation

Easier calculation of entropy and its gradient, for Gaussian distributions:

Get the proof of Entropy formulation

Now, let's focus on the remaining term of ELBO

About the expectation of the joint distribution of t and w

Monte Carlo sampling

What about the gradient?

We use the score function gradient estimator

We cannot use Monte Carlo sampling since

Sum-up the steps of Blackbox VI algorithm

Application: approximation of the posterior distribution of a linear Gaussian model

Stochastic optimization

Motivation

Last steps of BBVI algorithm:

"Estimate gradient"
"Update variational parameters using gradient estimate"...

BUT with a constant step-size, the gradient ascent doesn't converge

Stochastic Gradient Ascent

Now, the step-size decreases at each iteration t

Many recent methods for stochastic optimization, with Adam as the most common one

If unbiased gradient estimator, the Robbins-Monro conditions guarantees convergence

The re-parametrization trick: gradient estimator with lower variance

High variance = small step-size, so lower variance = faster optimization

Many variational families can be re-parametrized.

Example of re-parametrization with a Gaussian distribution

Visual comparison of convergence for both gradient estimators after re-parametrization

Score function gradient estimator:
more general, but large variance, so long optimization

Re-parametrized gradient estimator:
lower variance, but only applicable to continuous variables

Re-parametrization mathematical proof for Gaussian case

Page updated

Google Sites

Report abuse