Backpropagation by Example
I’ve been spending a lot of time on Lesson 3/Chapter 4 of the fast.ai course trying to really understand the fundamentals of neural networks. Conceptually, they’re pretty simple, but I wanted to take some time to work through the details.
In particular, I spent a lot of time on backpropagation. There are a good deal of sites explaining the formulas and proofs for backpropagation, but surprisingly few that worked through any examples with real numbers. The best was Matt Mazur’s post, but it left out the handling of biases, so I figured I’d try writing something up.
I won’t get into proofs, since there are better sources for that, like Michael Nielsen’s online book. I’ll also assume some existing understanding of neural networks. Basically, I’m hoping this will be helpful for other learners that are around where I’m at.
You’ll want to go through these examples in order, since they build on each other. And for best effect, try to do them by hand with pen, paper, and a calculator. I’ll use PyTorch to check for correctness, but won’t do anything more complex than basic tensor operations. You can find the supporting Jupyter notebook here or on Kaggle.
Loss and Activation Functions
To keep things simple, all the examples will use the same loss and activation functions. We’ll use mean squared error as our loss function since it’s easy to calculate by hand. Typically, it’s averaged over the squared differences of all the output values, but since all our examples will only have one output value, the formula we’ll use is just:
$$ \text{mse} = (\text{prediction} - \text{actual})^2 $$
For our activation function, we’ll use ReLU since it’s also easy to work with by hand:
$$ \operatorname{ReLU}(x) = \max(0, x) $$
Of note, there’s some debate about the correct definition of ReLU’s derivative since it isn’t defined at \(x = 0\). We won’t run into this situation with any of our examples, but in general I’ve been going with:
$$ \operatorname{ReLU^{\prime}}(x) = \begin{cases} 0 & \text{if } x \leq 0 \\ 1 & \text{if } x > 0 \end{cases} $$
Starting Small
To kick things off, we’ll start with a rudimentary network with just one input, one output, and no hidden layers.
In the diagram above, “i” is the input node, “o” is the output node, “b” is the bias, and “w” is the weight.
Normally, you’d use random values for the weight and bias, but we’ll use fixed values so that we can work through the examples together. We’ll use 0.5 for the input (\(i\)), 0.3 for the weight (\(w\)), and 0.4 for the bias (\(b\)). This will give us an output prediction (\(o\)) of 0.55. Let’s work through how we got that.
Forward Pass
In order to get the output prediction (\(o_{out}\)), we’ll do a forward pass through the network using the following formula:
$$ \begin{align*} o_{net} &= w * i + b \\ o_{out} &= \operatorname{ReLU}(o_{net}) \end{align*} $$
Plugging in our values, we have:
$$ \begin{align*} o_{net} &= 0.3 * 0.5 + 0.4 \\ &= 0.55 \\ \\ o_{out} &= \operatorname{ReLU}(0.55) \\ &= 0.55 \end{align*} $$
Next, we’ll calculate our loss (\(E\)) assuming the actual (a.k.a. target) value was 0.95:
$$ \begin{align*} E &= (\text{prediction} - \text{actual})^2 \\ &= (0.55 - 0.95)^2 \\ &= 0.16 \end{align*} $$
Let’s quickly check our work with PyTorch:
import torch
def activation_function(input):
relu = torch.nn.ReLU()
return relu(input)
def loss_function(prediction, actual):
mse_loss = torch.nn.MSELoss()
return mse_loss(prediction, actual)
inputs = torch.tensor([0.5])
actual = torch.tensor([0.95])
output_layer_weights = torch.tensor([0.3]).requires_grad_()
output_layer_bias = torch.tensor([0.4]).requires_grad_()
output_layer_out = activation_function(inputs@output_layer_weights + output_layer_bias)
print("Prediction: ", output_layer_out)
# Prints "Prediction: tensor([0.5500], grad_fn=<ReluBackward0>)"
loss = loss_function(output_layer_out, actual)
print("Loss: ", loss)
# Prints "Loss: tensor(0.1600, grad_fn=<MseLossBackward0>)"
Backpropagation
Now that we’ve done the forward pass and calculated our loss (\(E\)), we can move on to the main event: backpropagation. To do that, we’ll need to calculate our gradients. For the weight, it’s defined by the chain rule as:
$$ \frac{\partial E}{\partial w} = \frac{\partial E}{\partial o_{out}} * \frac{\partial o_{out}}{\partial o_{net}} * \frac{\partial o_{net}}{\partial w} $$
We’ll start by calculating \(\partial E / \partial o_{out}\), which is the partial derivative of the mean sqaured error with respect to \(o_{out}\). Using the power rule, this becomes:
$$ \begin{align*} \operatorname{mse} &= (0.55 - 0.95)^2 \\ \frac{\partial E}{\partial o_{out}} &= 2(0.55 - 0.95)^{2 - 1} \\ &= -0.8 \end{align*} $$
Next, we’ll calculate \(\partial o_{out} / \partial o_{net}\), which is the partial derivative of \(o_{out}\) with respect to \(o_{net}\). This translates to calculating the derivative of ReLU for 0.55, which is 1 (\(\operatorname{ReLU^\prime}(0.55) = 1\)). I chose ReLU because it’s so easy to work with by hand. It’ll be 1 for all our examples.
$$ \begin{align*} \frac{\partial o_{out}}{\partial o_{net}} &= \operatorname{ReLU^\prime}(0.55) \\ &= 1 \end{align*} $$
Lastly, we’ll calculate \(\partial o_{net} / \partial w\):
$$ \begin{align*} \frac{\partial o_{net}}{\partial w} &= \frac{\partial w * i + b}{\partial w} \\ &= \frac{\cancel{\partial w} * i + b}{\cancel{\partial w}} \\ &= i + b \\ &= i + 0 \\ &= i \\ &= 0.5 \end{align*} $$
With our numbers plugged in, we get:
$$ \begin{align*} \frac{\partial E}{\partial w} &= -0.8 * 1 * 0.5 \\ &= -0.4 \\ \end{align*} $$
For the bias, we have a similar formula:
$$ \frac{\partial E}{\partial b} = \frac{\partial E}{\partial o_{out}} * \frac{\partial o_{out}}{\partial o_{net}} * \frac{\partial o_{net}}{\partial b} $$
We have the first two values already, and \(\partial o_{net} / \partial b\) will always be 1, so we have:
$$ \begin{align*} \frac{\partial E}{\partial b} &= 2(0.55 - 0.95) * 1 * 1 \\ &= -0.8 \\ \end{align*} $$
We’ll once again check our work against PyTorch:
loss.backward()
print("Output layer weight gradient: ", output_layer_weights.grad)
# Prints "Output layer weight gradient: tensor([-0.4000])"
print("Output layer bias gradient: ", output_layer_bias.grad)
# Prints "Output layer bias gradient: tensor([-0.8000])"
To take things a little further, we can complete the backward pass by using these gradients to update our weight and bias. After that, we can do another forward pass to see if the loss is reduced.
learning_rate = 0.5
updated_output_layer_bias = output_layer_bias - learning_rate * output_layer_bias.grad.data
updated_output_layer_weights = output_layer_weights - learning_rate * output_layer_weights.grad.data
print("Updated weight: ", updated_output_layer_weights)
# Prints "Updated weight: tensor([0.5000], grad_fn=<SubBackward0>)"
print("Updated bias: ", updated_output_layer_bias)
# Prints "Updated bias: tensor([0.8000], grad_fn=<SubBackward0>)"
updated_output_layer_out = activation_function(inputs@updated_output_layer_weights + updated_output_layer_bias)
print("Updated prediction: ", updated_output_layer_out)
# Prints "Updated prediction: tensor([1.0500], grad_fn=<ReluBackward0>)"
loss = loss_function(updated_output_layer_out, actual)
print("Updated loss: ", loss)
# Prints "Updated loss: tensor(0.0100, grad_fn=<MseLossBackward0>)"
Cool, our loss has gone down from 0.16 to 0.01! We could repeat the forward and backward passes to further train our network, but we’ll stop here and move on.
One Hidden Layer
Let’s do another example. This time, we’ll add a hidden layer (\(h\)), but still with one neuron in each layer.
Briefly, the forward pass would be:
$$ \begin{align*} h &= \operatorname{ReLU}(w_{h} * i + b_{h}) \\ &= \operatorname{ReLU}(0.3 * 0.5 + 0.4) \\ &= 0.55 \\ \\ o &= \operatorname{ReLU}(w_{o} * h + b_{o}) = 0.21 \\ &= \operatorname{ReLU}(0.2 * 0.55 + 0.1) \\ &= 0.21 \\ \\ E &= (0.21 - 0.95)^2 = 0.5476 \end{align*} $$
(Since the output of ReLU on a number will always be the same as the input number for our examples, we’ll drop the \(o_{net}\) and \(o_{out}\) notations and just use \(o\) in our formulas going forward.)
layer1_weights = torch.tensor([0.3]).requires_grad_()
layer1_bias = torch.tensor([0.4]).requires_grad_()
output_layer_weights = torch.tensor([0.2]).requires_grad_()
output_layer_bias = torch.tensor([0.1]).requires_grad_()
layer1_out = activation_function(inputs@layer1_weights + layer1_bias)
Hidden layer: tensor([0.5500], grad_fn=<ReluBackward0>)
# Prints "Hidden layer: tensor([0.5500], grad_fn=<ReluBackward0>)"
output_layer_out = activation_function(layer1_out@output_layer_weights + output_layer_bias)
print("Prediction: ", output_layer_out)
Prediction: tensor([0.2100], grad_fn=<ReluBackward0>)
loss = loss_function(output_layer_out, actual)
print("Loss: ", loss)
# Prints "Loss: tensor(0.5476, grad_fn=<MseLossBackward0>)"
For backpropagation, we’ll take it one layer at a time, starting from the output layer. The gradient for the output layer’s weight (we’ll call it \(w_{o}\)) would be:
$$ \begin{align*} \frac{\partial E}{\partial w_{o}} &= \operatorname{mse^\prime}(0.21, 0.95) * \operatorname{ReLU^\prime}(0.21) * h \\ &= 2(0.21 - 0.95) * 1 * 0.55 \\ &= -0.814 \\ \end{align*} $$
This is the same formula as before, but using the (post-ReLU) value from the hidden layer’s node, 0.55, instead of the input layer’s value, 0.5. And the gradient for the bias (we’ll call it \(b_{o}\)) would be:
$$ \begin{align*} \frac{\partial E}{\partial b_{o}} &= \operatorname{mse^\prime}(0.21, 0.95) * \operatorname{ReLU^\prime}(0.21) * 1 \\ &= 2(0.21 - 0.95) * 1 * 1 \\ &= -1.48 \\ \end{align*} $$
To help keep things compact, we’ll start referring to \(\operatorname{mse^\prime}(0.21, 0.95)\) as just \(\operatorname{mse^\prime}\). For the hidden layer’s weight gradient, \(w_{h}\):
$$ \begin{align*} \frac{\partial E}{\partial w_{h}} &= (\operatorname{mse^\prime} * w_{o}) * \operatorname{ReLU^\prime}(0.55) * i \\ &= (2(0.21 - 0.95) * 0.2) * 1 * 0.5 \\ &= -0.148 \\ \end{align*} $$
Here, we use almost the same formula as our simple network example, but instead of just the derivative of the mean squared error, we use the derivative of the mean squared error times the output layer’s weight. Similarly, for the bias, \(b_{h}\):
$$ \begin{align*} \frac{\partial E}{\partial b_{h}} &= (\operatorname{mse^\prime} * w_{o}) * \operatorname{ReLU^\prime}(0.55) * 1 \\ &= (2(0.21 - 0.95) * 0.2) * 1 * 1 \\ &= -0.296 \\ \end{align*} $$
loss.backward()
print('Gradient for output layer weights: ', output_layer_weights.grad)
# Prints "Gradient for output layer weights: tensor([-0.8140])"
print('Gradient for output layer bias: ', output_layer_bias.grad)
# Prints "Gradient for output layer bias: tensor([-1.4800])"
print('Gradient for hidden layer weights: ', layer1_weights.grad)
# Prints "Gradient for hidden layer weights: tensor([-0.1480])"
print('Gradient for hidden layer bias: ', layer1_bias.grad)
# Prints "Gradient for hidden layer bias: tensor([-0.2960])"
Two Neurons
Let’s keep building. This time, we’ll add an extra neuron to the hidden layer.
We’ll name the top neuron in the hidden layer \(h_{1}\) and the bottom neuron \(h_{2}\). For the forward pass, the formulas are:
$$ \begin{align*} h_{1} &= \operatorname{ReLU}(w_{h1} * i + b_{h1}) \\ &= \operatorname{ReLU}(0.3 * 0.5 + 0.4) \\ &= 0.55 \\ h_{2} &= \operatorname{ReLU}(w_{h2} * i + b_{h2}) \\ &= \operatorname{ReLU}(0.2 * 0.5 + 0.1) \\ &= 0.2 \\ \\ o &= \operatorname{ReLU}((w_{o1} * h_{1}) + (w_{o2} * h_{2}) + b_{o}) \\ &= \operatorname{ReLU}((0.1 * 0.55) + (0.7 * 0.2) + 0.3) \\ &= 0.495 \\ \\ E &= (0.495 - 0.95)^2 = 0.207025 \end{align*} $$
inputs = torch.tensor([0.5])
actual = torch.tensor([0.95])
layer1_weights = torch.tensor([[0.3, 0.2]]).requires_grad_()
layer1_bias = torch.tensor([0.4, 0.1]).requires_grad_()
output_layer_weights = torch.tensor([[0.1], [0.7]]).requires_grad_()
output_layer_bias = torch.tensor([0.3]).requires_grad_()
layer1_out = activation_function(inputs@layer1_weights + layer1_bias)
print("Hidden layer weights: ", layer1_out)
# Prints "Hidden layer weights: tensor([0.5500, 0.2000], grad_fn=<ReluBackward0>)"
output_layer_out = activation_function(layer1_out@output_layer_weights + output_layer_bias)
print("Prediction: ", output_layer_out)
# Prints "Prediction: tensor([0.4950], grad_fn=<ReluBackward0>)"
loss = loss_function(output_layer_out, actual)
print("Loss: ", loss)
# Prints "Loss: tensor(0.2070, grad_fn=<MseLossBackward0>)"
For backpropagation, the formulas are:
$$ \begin{align*} \frac{\partial E}{\partial w_{o1}} &= 2(0.495 - 0.95) * 1 * .55 \\ &= -0.5005 \\ \frac{\partial E}{\partial w_{o2}} &= 2(0.495 - 0.95) * 1 * .2 \\ &= -0.182 \\ \frac{\partial E}{\partial b_{o}} &= 2(0.495 - 0.95) * 1 * 1 \\ &= -0.91 \\ \\ \frac{\partial E}{\partial w_{h1}} &= (2(0.495 - 0.95) * 0.1) * 1 * 0.5 \\ &= -0.0455 \\ \frac{\partial E}{\partial w_{h2}} &= (2(0.495 - 0.95) * 0.7) * 1 * 0.5 \\ &= -0.3185 \\ \frac{\partial E}{\partial b_{h1}} &= (2(0.495 - 0.95) * 0.1) * 1 * 1 \\ &= -0.091 \\ \frac{\partial E}{\partial b_{h2}} &= (2(0.495 - 0.95) * 0.7) * 1 * 1 \\ &= -0.637 \end{align*} $$
loss.backward()
print('Gradient for output layer weights: ', output_layer_weights.grad)
# Prints "Gradient for output layer weights: tensor([[-0.5005], [-0.1820]])"
print('Gradient for output layer bias: ', output_layer_bias.grad)
# Prints "Gradient for output layer bias: tensor([-0.9100])"
print('Gradient for hidden layer weights: ', layer1_weights.grad)
# Prints "Gradient for hidden layer weights: tensor([[-0.0455, -0.3185]])"
print('Gradient for hidden layer bias: ', layer1_bias.grad)
# Prints "Gradient for hidden layer bias: tensor([-0.0910, -0.6370])"
Two Hidden Layers
Let’s conclude with one last example. This time, we’ll have two hidden layers. We’ll address the first hidden layer’s neurons with \(h_{1}\) and \(h_{2}\), and the second hidden layer’s neurons with \(j_{1}\) and \(j_{2}\).
$$ \begin{align*} h_{1} &= \operatorname{ReLU}(0.3 * 0.5 + 0.4) = 0.55 \\ h_{2} &= \operatorname{ReLU}(0.2 * 0.5 + 0.1) = 0.2 \\ \\ j_{1} &= \operatorname{ReLU}((0.1 * 0.55) + (0.6 * 0.2) + 0.3) \\ &= 0.475 \\ j_{2} &= \operatorname{ReLU}((0.8 * 0.55) + (0.7 * 0.2) + 0.2) \\ &= 0.78 \\ \\ o &= \operatorname{ReLU}((0.4 * 0.475) + (0.6 * 0.78) + 0.3) \\ &= 0.958 \\ \\ E &= (0.958 - 0.95)^2 = 0.000064 \end{align*} $$
inputs = torch.tensor([0.5])
actual = torch.tensor([0.95])
layer1_weights = torch.tensor([[0.3, 0.2]]).requires_grad_()
layer1_bias = torch.tensor([0.4, 0.1]).requires_grad_()
layer2_weights = torch.tensor([[0.1, 0.8], [0.6, 0.7]]).requires_grad_()
layer2_bias = torch.tensor([0.3, 0.2]).requires_grad_()
output_layer_weights = torch.tensor([[0.4], [0.6]]).requires_grad_()
output_layer_bias = torch.tensor([0.3]).requires_grad_()
layer1_out = activation_function(inputs@layer1_weights + layer1_bias)
print("Layer 1 weights: ", layer1_out)
# Prints "Layer 1 weights: tensor([0.5500, 0.2000], grad_fn=<ReluBackward0>)"
layer2_out = activation_function(layer1_out@layer2_weights + layer2_bias)
print("Layer 2 weights: ", layer2_out)
# Prints "Layer 2 weights: tensor([0.4750, 0.7800], grad_fn=<ReluBackward0>)"
output_layer_out = activation_function(layer2_out@output_layer_weights + output_layer_bias)
print("Prediction: ", output_layer_out)
# Prints "Prediction: tensor([0.9580], grad_fn=<ReluBackward0>)"
loss = loss_function(output_layer_out, actual)
print("Loss: ", loss)
# Prints "Loss: tensor(6.4001e-05, grad_fn=<MseLossBackward0>)"
For backpropagation, the formulas for the output layer’s gradients are:
$$ \begin{align*} \frac{\partial E}{\partial w_{o1}} &= 2(0.958 - 0.95) * 1 * .475 \\ &= 0.0076 \\ \frac{\partial E}{\partial w_{o2}} &= 2(0.958 - 0.95) * 1 * .78 \\ &= 0.01248 \\ \frac{\partial E}{\partial b_{o}} &= 2(0.958 - 0.95) * 1 * 1 \\ &= 0.016 \end{align*} $$
For the second hidden layer, we’ll use \(w_{j11}\) and \(w_{j12}\) to refer to the two weights feeding into the first neuron (\(j_{1}\)) and \(w_{j21}\) and \(w_{j22}\) to refer to the two weights feeding into the second neuron, \(j_{2}\). The gradients for the set of weights and bias feeding into the first neuron, \(j_{1}\), are:
$$ \begin{align*} \frac{\partial E}{\partial w_{j11}} &= (2(0.958 - 0.95) * 0.4) * 1 * 0.55 \\ &= 0.00352 \\ \frac{\partial E}{\partial w_{j12}} &= (2(0.958 - 0.95) * 0.4) * 1 * 0.2 \\ &= 0.00128 \\ \frac{\partial E}{\partial b_{j1}} &= (2(0.958 - 0.95) * 0.4) * 1 * 1 \\ &= 0.0064 \end{align*} $$
The gradients for the set of weights and bias feeding into the second neuron, \(j_{2}\), are:
$$ \begin{align*} \frac{\partial E}{\partial w_{j21}} &= (2(0.958 - 0.95) * 0.6) * 1 * 0.55 \\ &= 0.00528 \\ \frac{\partial E}{\partial w_{j22}} &= (2(0.958 - 0.95) * 0.6) * 1 * 0.2 \\ &= 0.00192 \\ \frac{\partial E}{\partial b_{j2}} &= (2(0.958 - 0.95) * 0.6) * 1 * 1 \\ &= 0.0096 \end{align*} $$
The formulas to calculate the gradients for the first hidden layer get pretty long, so we’ll start by defining some variables to keep things readable. To start, we’ve seen \(2(0.958 - 0.95)\) used throughout. This is the derivative of the mean squared error, \(\operatorname{mse}^\prime(0.958, 0.95)\):
$$ \begin{align*} \frac{\partial E}{\partial o} &= \operatorname{mse}^\prime(0.958, 0.95) \\ &= 2(0.958 - 0.95) \\ &= 0.016 \end{align*} $$
Next, we’ll take the \(\operatorname{mse}^\prime\) value from above and walk the path back from the output node to \(h_{1}\), multiplying the weights as we go. There are two paths to get to \(h_{1}\), so we’ll start with the topmost path:
$$ \begin{align*} a &= \operatorname{mse}^\prime(0.958, 0.95) * w_{o1} * w_{j11} \\ &= 0.016 * 0.4 * 0.1 \\ &= 0.00064 \end{align*} $$
Next, we’ll do the bottom path to \(h_{1}\):
$$ \begin{align*} b &= \operatorname{mse}^\prime(0.958, 0.95) * w_{o2} * w_{j21} \\ &= 0.016 * 0.6 * 0.8 \\ &= 0.00768 \end{align*} $$
With these done, we just need to add the two values together and multiply the sum by \(\operatorname{ReLU}^\prime\) (always 1 in our examples) and the input value, 0.5 to get our answer:
$$ \begin{align*} \frac{\partial E}{\partial w_{h1}} &= (a + b) * 1 * i \\ &= (0.00064 + 0.00768) * 1 * 0.5 \\ &= 0.00416 \end{align*} $$
We can do something similar to what we’ve seen before for \(h_{1}\)’s bias:
$$ \begin{align*} \frac{\partial E}{\partial b_{h1}} &= (a + b) * 1 * 1 \\ &= (0.00064 + 0.00768) * 1 * 1 \\ &= 0.00832 \end{align*} $$
Now we just need to do \(h_{2}\)’s weights. Like before, we’ll start by taking the \(\operatorname{mse}^\prime\) value and walking the topmost path to \(h_{2}\):
$$ \begin{align*} c &= \operatorname{mse}^\prime(0.958, 0.95) * w_{o1} * w_{j12} \\ &= 0.016 * 0.4 * 0.6 \\ &= 0.00384 \end{align*} $$
And then the bottom-most path:
$$ \begin{align*} d &= \operatorname{mse}^\prime(0.958, 0.95) * w_{o2} * w_{j22} \\ &= 0.016 * 0.6 * 0.7 \\ &= 0.00672 \end{align*} $$
And then, like before:
$$ \begin{align*} \frac{\partial E}{\partial w_{h2}} &= (c + d) * 1 * 0.5 \\ &= (0.00384 + 0.00672) * 1 * 0.5 \\ &= 0.00528 \\ \\ \frac{\partial E}{\partial b_{h2}} &= (c + d) * 1 * 1 \\ &= (0.00384 + 0.00672) * 1 * 1 \\ &= 0.01056 \end{align*} $$
print('Gradient for output layer weights: ', output_layer_weights.grad)
# Prints "Gradient for output layer weights: tensor([[0.0076], [0.0125]])"
print('Gradient for output layer bias: ', output_layer_bias.grad)
# Prints "Gradient for output layer bias: tensor([0.0160])"
print('Gradient for hidden layer 2 weights: ', layer2_weights.grad)
# Prints "Gradient for hidden layer 2 weights: tensor([[0.0035, 0.0053], [0.0013, 0.0019]])"
print('Gradient for hidden layer 2 bias: ', layer2_bias.grad)
# Prints "Gradient for hidden layer 2 bias: tensor([0.0064, 0.0096])"
print('Gradient for hidden layer 1 weights: ', layer1_weights.grad)
# Prints "Gradient for hidden layer 1 weights: tensor([[0.0042, 0.0053]])"
print('Gradient for hidden layer 1 bias: ', layer1_bias.grad)
# Prints "Gradient for hidden layer 1 bias: tensor([0.0083, 0.0106])"
Wrapping Up
Phew, so much for going another year without using calculus! If you’ve found this helpful or spot any errors, please let me know! Feel free to shoot me an e-mail at rahim@sonawalla.org.