Week 06

Multi-class classification | Decision theory | Calibration

Multi-class classification

Classification with K classes

Multi-class classification: 1D dataset with K=4 classes

Key Equations

About the Likelihood

Categorical distributions

Softmax function to generate probabilities

About the Posterior Predictive

It is another categorical distribution

→ Like in binary classification case, we use Laplace approximation to compute the posterior, then we estimate the posterior predictive probabilities by applying the Monte-Carlo sampling or the Probit approximation method

We predict using the most likely class, i.e.

Bayesian Decision theory

Motivation

Purpose: find the optimal decision, and know when it's better not to choose
Need to measure uncertainty

CONFIDENCE
of the posterior predictive distribution

Confidence of the previous 1D multi-class classification

ENTROPY
An indicator of uncertainty

Entropy of the previous 1D multi-class classification

REJECT OPTION
A condition on confidence to know when it is better not to choose any option

Reject region for a 2D multi-class classifier as we increase the threshold p_reject

About the utility function

Let's make it Bayesian by introducing a utility function!

Note: the utility function is the opposite of the loss function, which is the evaluation of the cost for a given choice

Example with the 0/1-utility function, as a matrix between true and predicted classes

How do we use the utility function in practice to predict the targets?

Compute the predictive posterior distribution p(t*|t,x*) that contains all knowledge about new data given the observations
Choose the target that maximizes the posterior expected utility:

Effect of the utility function on predictions

Utility matrix = identity matrix → all choices have the same importance

Induce a negative utility for predicting green (1) when the true target is red (0)
→ the decision region for predicting green decreases so that predicting red becomes more important

Induce a positive utility for predicting blue (2) when the true target is yellow (3), and vice-versa
→ blue and yellow decision regions merge