Week 13

Regression modelling with heteroscedastic noise | Deep ensembles | Last-layer Laplace approximations

First and foremost... Neural Network, a Deep Learning model!

How to choose the w parameters?

CLASSICAL APPROACH
Maximum A Posteriori
→ pick one optimal set of parameters

BAYESIAN APPROACH
Integrate over w
→ compute a weighted average over all parameters

Motivation of Bayesian approach for NN

Bayesian approach can also quantify the epistemic uncertainty

Epistemic uncertainty = lack of knowledge, not enough data (reductible)
Aleatoric uncertainty = measurement noise (irreductible)

Classical approach is usually too confident when data is missing

Example below with classification problem (from Week 5)

Let's see an application of Bayesian Neural Networks!

Regression modelling with heteroscedastic noise

Context

We study regression models with a Gaussian likelihood as follows:

Heteroscedastic = the noise measurement (variance) varies with the input data
(Opposite) Homoscedastic = constant noise, independent of the data

Which Bayesian Neural Network?

→ Fully connected neural network, with two outputs y1, y2 and two hidden layers

We associate the Gaussian mean to the first output (y1) and the log variance to the second one (y2).

The model parameters to approximate?

Key Equations

About the Prior

About the Likelihood

As mentioned previously:

About the Posterior

We cannot compute it directly

high dimension of w, complexity of posterior geometry, very large datasets

Let's approximate the posterior distribution!

Alternative 1: deep ensembles

Method

Train our model S times with different seeds, e.g. different initial parameter values
Compute the S resulting parameters vectors w, and average them

Note: if S = 1 : MAP estimation

Dirac's delta function

Deduce the Predictive Posterior distribution

Application & Visualization

Alternative 2: last-layer Laplace approximation

Motivation

Recall of Laplace approximation

BUT the size of the Hessian matrix grows very fast with parameters: too high dimension!

Idea: apply the Laplace approximation to the last layer of our NN only

Almost as good as a complete Laplace approximation
Clearly faster
Can be easily added after training a model

Method

Apply the Maximum A Posteriori estimation to approximate the parameters vector w for full network
Apply a Laplace approximation on the last layer

Estimate the Hessian matrix for the two parameters of the last layer, W2 and b2

Construct the Gaussian approximation on W2 and b2

Deduce the Predictive Posterior distribution

Application & Visualization

Evaluation: the log predictive density

Page updated

Google Sites

Report abuse