Week 02

Bayesian linear regression | Model selection using the marginal likelihood

Linear Regression

Supervised machine learning technique

Classical approach: estimate the Maximum Likelihood Estimator for unknown regression weights by minimizing the sum-of-squares error

Model complexity

Example with polynomial regression:

M defines the model complexity (here: polynomial order of the model).

Model selection by changing the model complexity M

Model Selection: Under-fitting VS Overfitting
→ Regularization of the error with a penalty parameter λ

How to handle overfitting? How to choose the optimal λ?

Bayesian linear regression

Motivation

Less prone to overfitting
Can adapt model complexity automatically

Key equations

Assumption: the Gaussian noise is independent and identically distributed (i.i.d)

About the Prior

About the Likelihood

About the Posterior

Conjugate model: prior and posterior are both multivariate normal distributions

Example: linear model, where we estimate the intercept and the slope

Let's add one data point at a time:

Posterior lines, Prior, Likelihood and Posterior distributions as we increase the number of measurements N

About the Predictive Posterior

We average over all possible parameter values, weighted by the posterior.

Two terms in the predictive posterior variance:

first one = posterior uncertainty projected onto data space (epistemic/reductible)
second one = measurement noise (aleatoric/irreductible)

The effect of hyperparameters

α, prior precision of the weights

If α increases, then the prior will influence more the posterior in the "prior-likelihood compromise" but it won't change the predictive distribution that much

Effect of increasing α on the posterior and the predictive distributions of a given data point x0
From α = 100 to 1000

If α tends to 0, then the posterior mean converges to the Maximum Likelihood estimated weights:

β, precision of measurements

If β increases, then the likelihood will influence more the posterior

Effect of increasing β on the posterior and the predictive distributions of a given data point x0
From β = 0.1 to 0.5

If β tends to 0, then the posterior distribution tends to the prior distribution:

Effect of β = 0.0001 << 1 on the posterior computation

QUESTION: HOW TO CHOOSE THESE HYPERPARAMETERS?

The Evidence Approximation

We cannot analytically find optimal hyperparameters α and β with a Bayesian approach...

Solution: maximizing the marginal likelihood of the model

Illustration with the Airplane Passenger dataset from the exercise

With arbitrary priors (α, β) = (10, 1)
VS
After marginal likelihood maximization (α, β) = (0.176, 138.63)

But maximizing the marginal likelihood can be done for another purpose...

Model Selection by maximizing the marginal likelihood

Maximize the marginal likelihood
to determine the complexity M of the model

Application of model selection by maximizing the marginal likelihood

Page updated

Google Sites

Report abuse