Backpropagation

Forward Pass

The forward pass, sometimes also called forward propogration, is the process of calculating the output of a neural network given an input. This is done by running the input through the network layer by layer, and applying the activation function to the output of each layer. The output of the last layer is the output of the network.

Say we have a simple neural network with three linear layers, the input layer of size 2, a hidden layer of size 2 and an output layer of size 1. We will use the sigmoid activation function for the hidden layer and the output layer.

Each linear layer is defined by a weight matrix $\boldsymbol{W}$ and a bias vector $\boldsymbol{b}$ . The vector $\boldsymbol{z}$ contains the pre-activations and the vector $\boldsymbol{a}$ the activatiosn, i.e. outputs of the layer.

\begin{align*} \boldsymbol{z} &= \boldsymbol{W}\boldsymbol{x} + \boldsymbol{b} \\ \boldsymbol{a} &= \sigma(\boldsymbol{z}) \end{align*}

Let's say we have the following input, weights and biases:

Variable	Value
$x1$	0.888
$x2$	-0.49
$w1$	1.76
$w2$	0.4
$w3$	0.97
$w4$	2.24
$w5$	1.86
$w6$	-0.97
$b1$	0
$b2$	0
$b3$	0

Then we can calculate the output of the network as follows. First we calculate the output of the hidden layer.

Evaluation Trace or Wengert list??? https://pub.towardsai.net/a-gentle-introduction-to-automatic-differentiation-74e7eb9a75af (opens in a new tab)

\begin{align*} a1 &= x1w1 + x2w2 + b1 \\ &= 0.888 * 1.76 + -0.49 * 0.4 + 0 \\ &= 1.367 \\ h1 &= \sigma(a1) \\ &= \frac{1}{1 + e^{-a1}} \\ &= \frac{1}{1 + e^{-1.367}} \\ &= 0.797 \end{align*}

If we do the same for the other neuron we get $a2 = 0.888 * 0.97 + -0.49 * 2.24 + 0 = -0.236$ and $h2 = \sigma(-0.236) = 0.441$ . Now we have our hidden layer outputs, we can calculate the output of the network.

\begin{align*} a3 &= h1w5 + h2w6 + b3 \\ &= 0.797 * 1.86 + 0.441 * -0.97 + 0 \\ &= 1.055 \\ y &= \sigma(a3) \\ &= \frac{1}{1 + e^{-a3}} \\ &= \frac{1}{1 + e^{-1.055}} \\ &= 0.741 \end{align*}

The forward pass of the simple neural network.

We can also write these calculations in matrix form which is more efficient and easier to generalise to larger networks.

Todo

Do matrix form

Backpropagation

The backpropagation algorithm is the key to training neural networks. It is the process of calculating the gradients of the loss function with respect to the weights and biases of the network. These gradients are then used to update the weights and biases using gradient descent to minimise the loss function.

The backpropagation algorithm is based on the chain rule from calculus. So lets start with a brief reminder of the chain rule.

Chain Rule

If we have a the differentiable functions $f(x)$ and $g(x)$ and the composite function $h(x) = f(g(x))$ , i.e where the function $f$ is applied to the output of $g$ , then the derivative of $h$ with respect to $x$ is given by:

h'(x) = f'(g(x))g'(x) \text{ or } \frac{dh}{dx} = \frac{df}{dg}\frac{dg}{dx}

Notice that the denominator $dg$ is the the same as the following numerator, this can be thougth of as "the chain". The chain rule also makes sense intuitively, if we think of $dg$ cancelling out in the numerator and denominator.

It is a simple but powerful rule that allows us to calculate the derivative of a composite function. For example, if we have $h(x) = (x^2 + 1)^3$ , then we can write $h(x) = f(g(x))$ where $f(x) = x^3$ and $g(x) = x^2 + 1$ . The derivative of $h$ is then given by:

\begin{align*} h'(x) &= f'(g(x))g'(x) \\ &= 3(x^2 + 1)^2 * 2x \\ &= 6x(x^2 + 1)^2 \end{align*}

This also works for more obvious composite functions such as $h(x) = \sin(x^2 + 1)$ .

The key take away is that the derivative of a composite function can be calculated step by step, by first calculating the derivative of the most inner function, then the next inner function and so on. This is the key idea behind backpropagation as a neural network is just one big composite function with lots of variables and lots of inner functions.

TODO: Multiple variables

show the idea. The chain rule. then the full derivation.

Show nice reformulation of the loss function to make the calculation easier.

Vanishing and Exploding Gradients

The vanishing and exploding gradient problem is a problem that occurs when training very deep neural networks. It is caused by the chain rule and the fact that the gradient of the loss function is calculated by multiplying the gradients of each layer together. If the gradients are small, then multiplying them together will make them even smaller. This is the vanishing gradient problem. If the gradients are large, then multiplying them together will make them even larger. This is the exploding gradient problem.

There are many possible solutions to this problem. Some of the most common are:

Different activation function such as ReLU or Leaky ReLU.
Batch Normalisation.
Residual connections, also known as skip connections.

We can see the vanishing gradient problem pretty easily by looking at the derivative of the sigmoid function.

The derivative of the sigmoid function is always less then 0.25, multiplying this together for each layer will make the gradient very small.

Automatic Differentiation

computational graph, show how it works. Why does plus copy the gradient?

So far we have been calculating the gradients of the loss function with respect to the weights and biases by hand. This is fine for toy examples, but it is not practical for larger networks. Luckily, there is a technique called automatic differentiation that allows us to calculate the gradients automatically.

Automatic differentiation is also based on the chain rule and the fact that every operation in a neural network uses elementary operations such as addition, multiplication and elementary functions such as sine, cosine and the exponential function.

Using these elementary operations, we can build a computational graph of the neural network. A computational graph is a directed acyclic graph where the nodes represent the operations and the edges represent the flow of data. The computational graph is then used to calculate the gradients of the loss function with respect to the weights and biases.

Gradient Descent MLP - Multi Layer Perceptron