This post explores how to create a simple neural network to learn a linear function and a non-linear function using both standard R and the Torch library for R.
In this post we will see how to train a neural network model using R
and the Torch R
library which is a port of the Python torch library without dependencies on Python. Torch provides tensors (n-dimensional arrays), automatic differentiation of tensors, optimisation routines and additional helpers for common deep learning tasks such as computer vision and audio processing.
A neural network is built of layers. A layer consists of a set of weights, which are the parameters of the layer and an activation function (this can have additional parameters too). A single layer looks like the linear predictor in a generalised linear model,
\[ \eta = g(x^Tw), \]
where \(\eta\) is the linear predictor, \(g\) is the linking function and \(w\) represent the coefficients of the covariates \(x\). In machine learning, \(x\) is referred to simply as the input, \(w\) are the weights and \(g\) is the activation function. There is often an additional parameter, termed the intercept in generalised linear models and the bias in machine learning. This can be rolled in to the weight vector by appending a one to the input / covariates. We can encapsulate this logic in a function
layer <- function(input, weight, activation)
activation(input %*% weight)
We also require a loss function, to understand how well our model is fitting to the data. For a regression problem we can use squared loss.
loss <- function(pred, y) {
sum((y - pred)^2)
}
We will define a simple linear regression problem, our observations are noisy observations of a straight line.
\[y \sim \mathcal{N}(x^Tw, 2.0)\]
x <- seq(-10, 10, length.out=100)
y <- rnorm(100, mean = 2 * x + 3, sd = 2.0)
qplot(x, y, geom = "point")
We wish to learn the relationship between the inputs \(x\) and the outputs \(y\) using the model. To understand how well the model fits the observed data, we make a prediction by passing an observed input to the model (which is defined as a single layer), then we calculate how far the prediction is from the observed output using the squared loss function. The activation function is linear, or the identity function (function(x) x
).
w <- matrix(c(1, 1))
x_intercept <- cbind(1, x)
prediction <- layer(x_intercept, w, function(x) x)
loss(prediction, y)
[1] 4476.152
To find the best fit for this model to the observed data, we need to manipulate the weights to reduce the loss function. We can think of the weights as parameterising a family of related models, some of which may fit the data well.
To find the maximum of a function in calculus, we first calculate the derivative and determine the point at which the slope is equal to zero. This can find both maximums and minimums, so we can additionally calculate the second derivative and if it’s negative then we have a maximum. This is fine for linear optimisation, however when it comes to non-linear optimisation we have to be more creative. We can use gradient descent to take steps in the opposite direction of the gradient to find the minimum of a non-linear function
gradient_step <- function(learning_rate, params, grad_params)
params - learning_rate * grad_params
We must calculate the derivative of the network with respect to the weight parameters. For a single layer network with univariate inputs, a linear activation function and a squared loss function the derivative is
\[\frac{d \text{ network}}{dw} = \frac{d}{dw} (y - x^Tw)^2 = -2x^T(y-x^Tw).\]
We can encapsulate this derivative as a function.
network_gradient <- function(x, y, w) {
-2 * t(x) %*% (y - x %*% w)
}
We can check the analytically calculated gradient using Torch. First we use our calculation of the gradient.
x_test <- matrix(rnorm(2))
w <- matrix(rnorm(2))
y_test <- matrix(rnorm(1))
network_gradient(t(x_test), y_test, w)
[,1]
[1,] 0.31105026
[2,] -0.07820897
Then we must write the forward function of the model, pred
and calculate the loss
using the functions available on Torch tensors. We can then call backward()
which performs reverse mode automatic differentiation then we can access the grad
attribute of any tensor which has requires_grad = TRUE
.
network_gradient_torch <- function(x, y, w) {
pred <- w$t()$mm(x)
loss <- (y - pred)$pow(2)$sum()
loss$backward()
w$grad
}
network_gradient_torch(x = torch_tensor(as.matrix(x_test)),
y = torch_tensor(as.matrix(y_test)),
w = torch_tensor(as.matrix(w), requires_grad = TRUE))
torch_tensor
0.3111
-0.0782
[ CPUFloatType{2,1} ]
We can fit this simple model by writing a training loop which updates the parameters using gradient descent. We keep track of the loss function at each iteration of gradient descent and plot it.
observed <- tibble(x, y) %>%
sample_n(50)
training_loop <- function(x, y, init_w, epochs, learning_rate) {
# Record all changes to parameters and training loss
ws <- matrix(NA_real_, nrow = epochs + 1, ncol = length(init_w))
ws[1,] = init_w
losses <- numeric(length(y))
for (i in seq_len(epochs)) {
# Pad the input with 1 for intercept/bias
input <- cbind(rep(1, times = length(x)), x)
# Made a prediction using the model
pred <- layer(input, ws[i,], function(z) z)
# We can calculate and log/print the loss
losses[i] <- loss(y, pred)
# calculate the gradient at this point
gradient <- network_gradient(x = input, y = y, w = ws[i, ])
# Update using gradient descent
ws[i + 1, ] <- gradient_step(learning_rate,
ws[i, ],
gradient)
}
list(weights = ws, losses = losses)
}
Since this model is so small, consisting of only two weights. We can plot the actual function learned by the model using geom_abline
.
We can try to use this model on a simple non-linear regression problem, of course we probably won’t do very well here! We define the regression problem as
\[y \sim \mathcal{N}(4\sin(x), 1^2).\]
We plot the true function and the observed values in red below.
We can then plot the learned function against the observed values and the true function. We can see that a straight line is not a good fit for this data, we need more flexibility in the network.
qplot(x, y, geom = "line", colour = "truth") +
geom_point(data = non_linear, aes(x, y_obs, colour = "observed")) +
geom_abline(intercept = out$weights[201,1], slope = out$weights[201,1]) +
theme(legend.position = "none")
If we want to approximate a non-linear function we best use non-linear activation functions. We can calculate the derivative of each layer using automatic differentiation. We will use the R Torch library. We now initialise a torch tensor with the same values as x and pass it through the layers. We must re-write the layer and loss functions assuming the input is a torch_tensor
. First we will re-write the linear example using Torch
model <- nn_linear(1, 1)
loss <- function(y, pred) {
(y - pred)$pow(2)$sum()
}
train_torch <- function(x, y, model, loss, epochs) {
alpha <- 1e-4
x_in <- torch_tensor(as.matrix(x))
for (i in seq_len(epochs)) {
pred <- model(x_in)
losses <- loss(torch_tensor(y), pred)
model$zero_grad()
losses$backward()
if (i %% 10 == 0)
cat("Epoch: ", i, " Loss: ", losses$item(), "\n")
with_no_grad({
model$parameters %>%
purrr::walk(function(param) param$sub_(alpha * param$grad))
})
}
model
}
trained_model <- train_torch(scale(observed$x), scale(observed$y), model, loss, 200)
Epoch: 10 Loss: 94.43447
Epoch: 20 Loss: 77.64619
Epoch: 30 Loss: 63.87258
Epoch: 40 Loss: 52.57226
Epoch: 50 Loss: 43.30109
Epoch: 60 Loss: 35.69468
Epoch: 70 Loss: 29.45408
Epoch: 80 Loss: 24.33403
Epoch: 90 Loss: 20.13331
Epoch: 100 Loss: 16.68685
Epoch: 110 Loss: 13.8592
Epoch: 120 Loss: 11.53925
Epoch: 130 Loss: 9.635837
Epoch: 140 Loss: 8.074168
Epoch: 150 Loss: 6.792887
Epoch: 160 Loss: 5.741646
Epoch: 170 Loss: 4.879141
Epoch: 180 Loss: 4.171486
Epoch: 190 Loss: 3.590877
Epoch: 200 Loss: 3.114503
Try the non-linear example with multiple layers and a non-linear activation function from the first layer (where the input goes). We’ll also try a different optimizer, Adam.
train_torch <- function(x, y, model, loss, epochs, learning_rate) {
x_in <- x
optimiser <- optim_adam(model$parameters, lr = learning_rate)
for (i in seq_len(epochs)) {
pred <- model(x_in)
losses <- loss(y, pred)
model$zero_grad()
losses$backward()
if (i %% 10 == 0)
cat("Epoch: ", i, " Loss: ", losses$item(), "\n")
optimiser$step()
}
model
}
model <- nn_sequential(
nn_linear(1, 64),
nn_relu(),
nn_linear(64, 1)
)
trained_model <-
train_torch(torch_tensor(as.matrix(non_linear$x)),
torch_tensor(as.matrix(non_linear$y_obs)),
model,
nnf_mse_loss,
100,
0.1)
Epoch: 10 Loss: 6.005287
Epoch: 20 Loss: 2.650542
Epoch: 30 Loss: 0.5133222
Epoch: 40 Loss: 0.4129792
Epoch: 50 Loss: 0.2928792
Epoch: 60 Loss: 0.22907
Epoch: 70 Loss: 0.2131358
Epoch: 80 Loss: 0.1997202
Epoch: 90 Loss: 0.1865921
Epoch: 100 Loss: 0.1764488
We can plot the predictions alongside the observed values and the true function.
For attribution, please cite this work as
Law (2021, Feb. 2). Bayesian Statistics and Functional Programming: Neural Networks in R. Retrieved from https://jonnylaw.rocks/posts/2021-02-02-neural-networks-in-r/
BibTeX citation
@misc{law2021neural, author = {Law, Jonny}, title = {Bayesian Statistics and Functional Programming: Neural Networks in R}, url = {https://jonnylaw.rocks/posts/2021-02-02-neural-networks-in-r/}, year = {2021} }