Using MC Dropout to get probability intervals for neural network predictions.

Recently, I have been reading Probabilistic Deep Learning which introduces Bayesian methods for fitting Neural Networks using Tensorflow and Keras. One case study consists of a non-linear regression problem for a sinusoidal curve. Additionally, the curve is considered to have heteroskedastic variance, meaning the variance changes along the domain of the function. In this post I will consider approximating a non-linear function with constant (homoskedastic) variance and quantifying the uncertainty. I will be using PyTorch - because why not.

Uncertainty is an important concept in machine learning, if we have confidence in the predictions of model we are able to make more informed decisions. However, neural networks trained using traditional back-propagation typically return a single point estimate by training to minimise a loss function. There have been many attempts to incorporate uncertainty into neural networks, the most principled way is to fit a Bayesian Neural Network (BNNs) by placing prior distributions on the weights and calculating the posterior distribution using Bayes’ theorem:

\[p(\theta| y) = \frac{p(\theta)p(y|\theta)}{\int_\theta p(y|\theta)p(\theta)}\]

Where \(\theta\) represents the parameters in the neural network, ie. the weights and biases. \(p(\theta)\) is the prior distribution, and \(p(y|\theta)\) is the likelihood. These can all be specified. However, calculating the evidence, the denominator in the equation, \(\int_\theta p(y|\theta)p(\theta)\) is intractable analytically in most real-world problems, including BNNs. The gold standard for computing these integrals for real-world problems is MCMC. In the case of modern deep learning models, the parameter space is too high dimensional (ie. there are too many parameters!) for these methods to be computationally feasible (although (Izmailov et al. 2021) did use HMC for fitting NNs to the CIFAR-10 image dataset and IMDB review dataset using 512 TPUv3 devices). Hence, we look to approximate methods, such as variational inference, which can place additional assumptions on the form on the prior and posterior distributions. There are also non-Bayesian methods (such as deep ensembles) which can be evaluated to have desirable properties such as calibration. From the Wikipedia for statistical calibration “As Philip Dawid puts it,”a forecaster is well calibrated if, for example, of those events to which he assigns a probability 30 percent, the long-run proportion that actually occurs turns out to be 30 percent“.”

In this post we will investigate a single easy to implement method for uncertainty estimation, MC dropout (Gal and Ghahramani 2016), applied to a regression problem.

Let’s start by creating simulating data from a simple model:

\[ \begin{aligned} y_i &\sim \mathcal{N}(\mu_i, \sigma^2), \\ \mu_i &= a \sin(x_i + b). \end{aligned} \]

Where \(a=2\), \(b=-3\) and \(\sigma=1\).

```
import numpy as np
import matplotlib.pyplot as plt
def generate_data(xs):
= 5
a = 3
b = lambda x: a * np.sin(x + b)
mean = [(x, np.random.normal(loc=mean(x), scale=1, size=1))
data for x in xs]
= zip(*data)
x, y return np.array(x), np.array(y)
= np.random.uniform(0, 10, size=100)
xs = generate_data(xs)
x, y plt.scatter(x, y)
```

Create a neural net which can learn the non-linear function. We’ll use the class `nn.ModuleList`

to allow us to experiment with different numbers of hidden layers. According to the MC Dropout paper, we must apply dropout to each layer in order for the procedure to be equivalent to variational inference - under a set of assumptions.

```
import torch.nn as nn
import torch
import torch.nn.functional as F
class NonLinearRegression(nn.Module):
def __init__(self, dropout, hidden_size, hidden_layers):
super(NonLinearRegression, self).__init__()
self.input = nn.Linear(1, hidden_size)
self.dropout = nn.Dropout(dropout)
self.linears = nn.ModuleList(
for i in range(hidden_layers)])
[nn.Linear(hidden_size, hidden_size) self.dropouts = nn.ModuleList(
for i in range(hidden_layers)])
[nn.Dropout(dropout) self.output = nn.Linear(hidden_size, 1)
def forward(self, x):
= self.input(x)
x = F.relu(x)
x = self.dropout(x)
x for l, d in zip(self.linears, self.dropouts):
= l(x)
x = F.relu(x)
x = d(x)
x = self.output(x)
x return x
```

Split the data randomly into training and testing.

```
from sklearn.model_selection import train_test_split
= train_test_split(np.array(list(x)), np.array(list(y))) x_train, x_test, y_train, y_test
```

Test the untrained model input and output shapes by making a prediction using the training data.

```
= NonLinearRegression(0.5, 32, 1)
model = torch.tensor(x_train, dtype=torch.float).unsqueeze(-1)
x_input model(x_input).shape
```

`torch.Size([75, 1])`

Use Skorch to avoid writing the training boilerplate. Use the `.fit`

method provided by Skorch.

```
from skorch.regressor import NeuralNetRegressor
= 500
epochs
= NeuralNetRegressor(
net =NonLinearRegression,
module=0.1,
module__dropout=64,
module__hidden_size=3,
module__hidden_layers=torch.optim.AdamW,
optimizer=True,
iterator_train__shuffle=500,
max_epochs=0
verbose
)
= torch.tensor(x_train, dtype=torch.float).unsqueeze(-1)
x_train_input =torch.float)) net.fit(x_train_input, torch.tensor(y_train, dtype
```

```
<class 'skorch.regressor.NeuralNetRegressor'>[initialized](
module_=NonLinearRegression(
(input): Linear(in_features=1, out_features=64, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linears): ModuleList(
(0): Linear(in_features=64, out_features=64, bias=True)
(1): Linear(in_features=64, out_features=64, bias=True)
(2): Linear(in_features=64, out_features=64, bias=True)
)
(dropouts): ModuleList(
(0): Dropout(p=0.1, inplace=False)
(1): Dropout(p=0.1, inplace=False)
(2): Dropout(p=0.1, inplace=False)
)
(output): Linear(in_features=64, out_features=1, bias=True)
),
)
```

We can visualise the learning curve for this model below.

We can see in the figure below that the predictions are in the right region, but using only a point prediction we miss out on capturing the uncertainty.

```
= torch.tensor(x_test, dtype=torch.float).unsqueeze(-1)
x_test_input = net.predict(x_test_input) preds
```

We can represent the aleatoric uncertainty, ie. the uncertainty inherent in the data using a Normal distribution. We know the observation distribution of the data generating process is Normal with a standard deviation (called scale in PyTorch) of 1.0. We can learn this additional parameter using PyTorch, to do this we can introduce a new loss function and alter the forward function of the Neural Network module to include the standard deviation parameter. First we’ll consider how to write the new loss function, we need to calculate the log-likelihood of the observations then we wish to maximise this. Neural networks in PyTorch have optimisers with minimise the loss function, hence we will minimise the negative log-likelihood. Let’s first consider the probability density function of a Normal distribution with mean \(\mu \in \mathbb{R}\) and standard deviation \(\sigma \in \mathbb{R}^+\):

\[p(y | \mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{1}{2\sigma^2}(y-\mu)^2\right)\]

We have multiple observations which are independent and normally distributed, hence for each batch we will need to calculate the product of the likelihood:

\[p(\mathbb{y}|\mu, \sigma) = \prod_{i=1}^B \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{1}{2\sigma^2}(y_i-\mu)^2\right),\]

Where \(B\) is the batch size, this involves multiplying small numbers together which can result in arithmetic underflow, hence we work on the log-scale which changes the calculation to addition:

\[\log p(\mathbb{y}|\mu, \sigma) = - \frac{B}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^B (y_i-\mu)^2 \]

In the case of the Neural net, \(\mu\) is the output and \(\sigma\) is an additional parameter. We can code the log-likelihood by converting this function into PyTorch or using the build in distributions.

```
from torch.distributions import Normal
import math
= NonLinearRegression(0.1, 64, 3)
normal_model = model(x_input)
inputs = torch.tensor(y_train, dtype=torch.float)
labels
= inputs
mu = torch.tensor(0.0, dtype=torch.float)
log_sigma
# Calculate manually
= - 0.5 * math.log(2.0 * math.pi) - \
manual_ll - 0.5 * ((labels - mu) / log_sigma.exp()) ** 2
log_sigma
# Use the built in log_prob function
= Normal(mu, log_sigma.exp()).log_prob(labels)
log_likelihood
sum(), manual_ll.sum() log_likelihood.
```

`(tensor(-597.9736, grad_fn=<SumBackward0>), tensor(-597.9736, grad_fn=<SumBackward0>))`

We can modify the `nn.Module`

to return an additional parameter representing the log of the standard deviation of the Normal distribution. Additionally, the forward model will return the predicted mean and the global standard deviation - this is necessary when using Skorch, since the standard form of the loss function is `def loss(y_pred, y_true)`

.

```
import torch.nn as nn
import torch
import torch.nn.functional as F
class NonLinearRegressionNormal(nn.Module):
def __init__(self, dropout, hidden_size, hidden_layers):
super(NonLinearRegressionNormal, self).__init__()
self.input = nn.Linear(1, hidden_size)
self.dropout = nn.Dropout(dropout)
self.linears = nn.ModuleList(
for i in range(hidden_layers)])
[nn.Linear(hidden_size, hidden_size) self.dropouts = nn.ModuleList(
for i in range(hidden_layers)])
[nn.Dropout(dropout) self.output = nn.Linear(hidden_size, 1)
self.log_sigma = nn.Parameter(torch.tensor(1.0))
def forward(self, x):
= self.input(x)
x = F.relu(x)
x = self.dropout(x)
x for l, d in zip(self.linears, self.dropouts):
= l(x)
x = F.relu(x)
x = d(x)
x = self.output(x)
x return x, self.log_sigma
```

The standard deviation, `sigma`

, is constant, so in a custom training loop (not using skorch) we could simply pass the model definition to the loss function directly and extract the `sigma`

parameter as follows. This would mean the `forward`

function of the `nn.Module`

can remain the same as the first model. The loss function could look like the following:

```
def normal_pdf(model, X, y_true):
= model(X)
preds return Normal(preds, model.log_sigma.exp()).log_prop(y_true).sum
```

Then create a loss function as an `nn.Module`

:

```
from skorch.regressor import NeuralNetRegressor
class NormalLoss(nn.Module):
def __init__(self):
super(NormalLoss, self).__init__()
def forward(self, inputs, labels):
= inputs
mu, log_sigma = Normal(mu, log_sigma.exp()).log_prob(labels)
log_likelihood
return - log_likelihood.sum()
```

Train the model as usual:

```
= 500
epochs
= NeuralNetRegressor(
net_normal =NonLinearRegressionNormal,
module=0.1,
module__dropout=64,
module__hidden_size=3,
module__hidden_layers=torch.optim.AdamW,
optimizer=True,
iterator_train__shuffle=NormalLoss,
criterion=epochs,
max_epochs=0
verbose
)
= torch.tensor(x_train, dtype=torch.float).unsqueeze(-1)
x_train_input =torch.float)) net_normal.fit(x_train_input, torch.tensor(y_train, dtype
```

```
<class 'skorch.regressor.NeuralNetRegressor'>[initialized](
module_=NonLinearRegressionNormal(
(input): Linear(in_features=1, out_features=64, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linears): ModuleList(
(0): Linear(in_features=64, out_features=64, bias=True)
(1): Linear(in_features=64, out_features=64, bias=True)
(2): Linear(in_features=64, out_features=64, bias=True)
)
(dropouts): ModuleList(
(0): Dropout(p=0.1, inplace=False)
(1): Dropout(p=0.1, inplace=False)
(2): Dropout(p=0.1, inplace=False)
)
(output): Linear(in_features=64, out_features=1, bias=True)
),
)
```

Then calculate some predictions, the actual function is plotted using blue, with the predictions and prediction interval plotted in red. This is an MLE solution with a 95% confidence interval.

We have looked at how to incorporate aleatoric uncertainty, to understand the uncertainty in the parameters (epistemic uncertainty) we can use MC Dropout. Dropout is a method of avoiding overfitting at training time by removing “connections” in a neural network. However, if we leave dropout on when making predictions, then we create an ensemble of models which output slightly different predictions. It turns out that this is equivalent Bayesian variational inference with some assumptions. We can then calculate the mean and and uncertainty intervals we wish.

We can easily implement MC dropout and visualise the uncertainty provided with this method. First, we extract the module from the Skorch `NeuralNet`

class and put it in train mode, this means we make predictions with dropout ON.

` net_normal.module_.train()`

```
NonLinearRegressionNormal(
(input): Linear(in_features=1, out_features=64, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linears): ModuleList(
(0): Linear(in_features=64, out_features=64, bias=True)
(1): Linear(in_features=64, out_features=64, bias=True)
(2): Linear(in_features=64, out_features=64, bias=True)
)
(dropouts): ModuleList(
(0): Dropout(p=0.1, inplace=False)
(1): Dropout(p=0.1, inplace=False)
(2): Dropout(p=0.1, inplace=False)
)
(output): Linear(in_features=64, out_features=1, bias=True)
)
```

Let’s write a convenience function for making multiple predictions and combining them into a single numpy array.

```
def get_predictions(model, x, n):
# Make multiple predictions
= zip(*[model(x) for _ in range(n)])
mus, sigmas
# Sample from the observation distribution using each mean and the global sigma
return [Normal(mu, log_sigma.exp()).sample().detach().numpy() for mu in list(mus)]
```

We then have a collection of predictions which we can use to calculate summaries using monte carlo. We can calculate the expectation by calculating the mean of all the predictions, and probability intervals by ordering the predictions and selecting the appropriate values, using `np.quantile`

.

```
def get_probability_interval(preds, interval):
= (1. - interval) / 2
lower_int = interval + lower_int
upper_int
# Calculate percentiles and mean
= np.quantile(preds, lower_int, axis=0)
lower = np.mean(preds, axis=0)
mean = np.quantile(preds, upper_int, axis=0)
upper
return lower, mean, upper
```

First, we will predict a single point, calculate the mean and the 89% probability interval.

```
= 5.0
test_point = get_predictions(net_normal.module_,
preds =torch.float).unsqueeze(-1),
torch.tensor(test_point, dtype1000)
= get_probability_interval(preds, 0.89) lower, mean, upper
```

Next, we can use the same method to predict several points and give the impression of a function. If we extend the \(\sin\) function we can see that the uncertainty on the inputs which are out of the domain of the training examples is quite large. We can evaluate the calibration of these intervals in domain and out of domain by creating a new training split which omits data in a certain interval.

```
= np.linspace(-5, 15, 50)
x_in
= get_predictions(
preds
net_normal.module_, =torch.float).unsqueeze(-1),
torch.tensor(x_in, dtype1000)
```

We can calculate the coverage of the probability interval using an experiment. We first train a neural network generating data from the same noisy function. We have \(\mu = 5\sin(x + 3)\) and the observations corrupted by Gaussian noise, \(y_i \sim \mathcal{N}(\mu, 1^2)\). Then we sample 100 random uniform test points between 0 and 10 and make probabilistic predictions by calculating 1,000 predictions using MC Dropout and calculating 95% probability intervals. This should account for epistemic uncertainty in the parameters and aleatoric uncertainty inherent in the observed data. We then calculate the proportion of predictions which fall into the interval, this should be close to 95%.

```
def coverage_experiment():
# Generate training data
= np.linspace(0, 10, 50)
x_train = generate_data(x_train)
_, y_train
# Fit model to training data
= torch.tensor(x_train, dtype=torch.float).unsqueeze(-1)
x_train_input =torch.float))
net_normal.fit(x_train_input, torch.tensor(y_train, dtype
# Generate testing data from the same generative model
= 100
n_test = np.random.uniform(0, 10, n_test)
x_test = generate_data(x_test)
_, y_test
net_normal.module_.train()
# Calculate predictions on test data
= get_predictions(
preds
net_normal.module_,=torch.float).unsqueeze(-1),
torch.tensor(x_test, dtype1000)
= get_probability_interval(preds, 0.95)
lower, mean, upper
# Calculate proportion of predictions which fall into interval
= sum([l <= y <= u for y, l, u in zip(y_test, lower, upper)])
in_interval return in_interval / n_test
```

Let’s repeat this experiment 100 times and plot a histogram of the resulting coverage.

Gal, Yarin, and Zoubin Ghahramani. 2016. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In *International Conference on Machine Learning*, 1050–59. PMLR.

Izmailov, Pavel, Sharad Vikram, Matthew D Hoffman, and Andrew Gordon Wilson. 2021. “What Are Bayesian Neural Network Posteriors Really Like?” *arXiv Preprint arXiv:2104.14421*.

For attribution, please cite this work as

Law (2021, Aug. 16). Bayesian Statistics and Functional Programming: Uncertainty in Neural Networks. Retrieved from https://jonnylaw.rocks/posts/2021-08-16-mc-dropout-uncertainty/

BibTeX citation

@misc{law2021uncertainty, author = {Law, Jonny}, title = {Bayesian Statistics and Functional Programming: Uncertainty in Neural Networks}, url = {https://jonnylaw.rocks/posts/2021-08-16-mc-dropout-uncertainty/}, year = {2021} }