Let’s Learn: Neural Nets #6 — Backpropagation (or how Neural Networks learn)

The sixth instalment in my journey through neural nets, this time focusing on understanding — or trying to — how neural networks learn.

5 min readJul 2, 2022

Disclaimer: if you’re here for an in-depth and mathematical explanation of backward propagation you’re in the wrong place. In classic fashion, I’m interested in knowing just enough to be able to use the thing, so this article will be from more of a “practitioner’s” perspective.

So far, I’ve learnt about nodes¹, activation functions², weights and biases³, and layers⁴. I feel like I now have a good feeling for how the constituent parts of a neural network fit together to form a single cohesive model.

Now that I know how everything is connected — see what I did there ?! — I want to know how a network learns.

First up, the cost function.

The cost function

Also called a “loss function”, a cost function quite simply measures the difference between model predictions and the truth. To get a bit maths-y:

There are numerous cost functions which we can use, dependent on the problem we’re trying to model⁵:

Loss functions for regression include mean squared error (with root mean squared error as an extension), mean absolute error, mean squared log error.
Loss functions used in binary classification include binary cross-entropy, hinge loss, and squared hinge loss.
As well as functions like multi-class cross-entropy loss for multi-class classification tasks.

Before we move on from cost functions, it’s probably worth mentioning a few things.

We know that there are many different cost functions that are currently used. You could probably use a function of your choice providing that your chosen function appropriately captures the nuances of the problem being addressed. So for instance, if the problem at hand is to minimise the implied monetary cost of prediction error then you could bake the monetary cost into the cost function itself.

However, we shouldn’t be too care-free in choosing a cost function. Since some optimisation techniques rely on gradient descent, we should probably ensure that any cost function is differentiable.

Another consideration to take into account is local and global minima. Ideally, the cost function shouldn’t contain any local maxima or minima:

… since the goal of model training is to minimise the global value of C by changing the set of model parameters. We are of course saying this assuming a fixed input and target set X and Y respectively.

Generating a set of model parameters which give a value of C as low as possible could cause us to run into issues of over-fitting (which of course is no bueno).

Theoretically the set of model parameters which generate the lowest possible value of C will be the “optimal” set. There will be reasons — some of them practical — why we never arrive at the set of “optimal” parameters.

I can think of two: (1) where we use early stopping constraints to limit the likelihood of over-fitting, and (2) if the user’s (manual) specification of the search space is too narrow there is a possibility of completely missing the optimal parameter(s).

Now, what are the parameters in a neural network? If we consider the structure of our network fixed(i.e. we will not flex the number and size of hidden layers) then our model parameters would just be the weights and biases. So to tune the network, we search for the set of weights and biases which minimises the cost function C.

Backward propagation of errors

… or “back propagation”, is

…a general optimization method for performing automatic differentiation of complex nested functions⁷.

But why are we talking about nested functions all of a sudden?

If we think about it, that’s exactly what a neural network really is. Remember (generally) that a node takes in the observation, weight, and bias and outputs a value. This value is then fed forward into another node where the process is repeated until eventually the network generates an output (or prediction) — that is, the network prediction really is just the output of a (likely massive) nested function.

In my weird and wonky mathematical notation:

Without going too much further into the details and confusing things, the derivation of the set of equations needed to solve for each weight and bias is essentially backward propagation.

The actual calculation of each weight and bias is a different story however.

Luckily — and of course — we have a number of optimisation algorithms to efficiently solve for the optimal set of weights and biases. Sanket Doshi provides a great summary in their article⁸, which I’ll summarise as follows:

Sanket ranks Adam as the best overall optimiser and recommends it based on its efficiency and speed; optimisers with dynamic learning rates are suggested for sparse data; mini-batch gradient descent is recommended if a user were to use a gradient descent based approach.

Wrapping up

In this (delayed) edition, we covered:

The who, what, where, and why of cost functions. We now have a basic grasp of these important little beasties and how they play a role in a neural network.

We also saw how a neural network can be thought of as a complex nested function, and how the flexing of weights and biases actually impacts the outcome of the model. We also did a bit of (badly annotated) maths to see what backward propagation is all about. We stretched our imagination to picture how calculus’s chain rule could potentially be applied without going near any sort of vector derivative notation (#nonabla).

We took a brief look at some optimisation algorithms available for doing the heavy lifting and actually solving for the optimal set of weights and biases. We learned that we should probably start with the Adam optimiser but should be aware of other approaches which are useful in certain conditions.

Onwards and upwards!

References

Let’s Learn: Neural Nets #6 — Backpropagation (or how Neural Networks learn)

The sixth instalment in my journey through neural nets, this time focusing on understanding — or trying to — how neural networks learn.

The cost function

Backward propagation of errors

Wrapping up

Written by Bradley Stephen Shaw

No responses yet