Let’s Learn: Neural Nets #5— Layers

The fifth instalment in my journey through neural nets, this time focusing on arranging nodes into layers.

8 min readMar 23, 2022

Lasagna — my favourite kind of layer!

So far, I’ve learned about nodes² and activation functions³, and also covered weight and biases⁴ — basically, the building blocks of a neural network.

It now feels like I have an understanding of the individual elements of a neural net and need to understand how they fit together to form a single model.

Which brings me to part 4 of my learning series: layering in neural networks.

Layers

We know the following:

Nodes are the building blocks of neural nets. We can think of nodes as functions which receive inputs (weights and biases), apply a rule (activation function), and pass on the output.
We know that the nodes are organised in some — sensible — manner… but we don’t really know how, or why.

Let’s fix that.

What are layers?

From this great StackOverflow answer⁷:

Layer is a general term that applies to a collection of ‘nodes’ operating together at a specific depth within a neural network.
The input layer is contains your raw data (you can think of each variable as a ‘node’).
The hidden layer(s) are where the black magic happens in neural networks. Each layer is trying to learn different aspects about the data by minimizing an error/cost function. The most intuitive way to understand these layers is in the context of ‘image recognition’ such as a face. The first layer may learn edge detection, the second may detect eyes, third a nose, etc. This is not exactly what is happening but the idea is to break the problem up in to components that different levels of abstraction can piece together much like our own brains work (hence the name ‘neural networks’).
The output layer is the simplest, usually consisting of a single output for classification problems. Although it is a single ‘node’ it is still considered a layer in a neural network as it could contain multiple nodes.

V7 Labs⁶ builds on this:

Input Layer
The input layer takes raw input from the domain. No computation is performed at this layer. Nodes here just pass on the information (features) to the hidden layer.
Hidden Layer
As the name suggests, the nodes of this layer are not exposed. They provide an abstraction to the neural network.
The hidden layer performs all kinds of computation on the features entered through the input layer and transfers the result to the output layer.
Output Layer
It’s the final layer of the network that brings the information learned through the hidden layer and delivers the final value as a result.

Why do we need layers?

It sounds like each layer tries to learn a subtly different facet of the modelling task.

Since I work in an industry which tries to model quite difficult problems, does that mean that I should just build larger and deeper neural nets?

Probably not! Like most things data science, we would likely need to balance predictive power and complexity — I imagine that the choice of model shape and size (in terms of hidden layers and nodes) affects the model in the various ways.

Too few layers and nodes could result in under-fitting — that is, producing a model which is not able to adequately capture the nuance of the task at hand

Conversely, too many layers and nodes could result in over-fitting — that is, producing a model which is predictive on data is has built on but produces poor predictions when exposed to previously unseen data. I would imagine that we can mitigate this by using approaches like cross-validation, early stopping and potentially some form of pruning (questionable really, as this might be a hangover from my time using tree-based models).

Too many layers and nodes could result in long training times, require enormous computational resource, and produce models which are difficult to use in a realistic environment. While more practical in nature, considerations like these shouldn’t be discarded, especially if we actually need to deploy a complex model into a live environment which is subject to strict time requirements (e.g. the need to generate an online insurance quote within a given time-frame, likely a few seconds).

Fine — we want to make our models only as complex as they need to be, bearing in mind the implications of an overly-complex model. Now, is there a “default” model shape and size?

Sizing guidelines

Articles and forum discussions recommend the following rules of thumb (bold formatting is my own).

Input layer

From a useful CrossValidated⁸ post:

Simple — every [neural network] has exactly one of them — no exceptions that I’m aware of. With respect to the number of neurons comprising this layer, this parameter is completely and uniquely determined once you know the shape of your training data. Specifically, the number of neurons comprising that layer is equal to the number of features (columns) in your data. Some [neural network] configurations add one additional node for a bias term.⁸

Output layer

From the same CrossValidated⁸ post:

Like the Input layer, every [neural network] has exactly one output layer. Determining its size (number of neurons) is simple; it is completely determined by the chosen model configuration. If the [neural network] is a regressor, then the output layer has a single node. If the [neural network] is a classifier, then it also has a single node unless softmax is used in which case the output layer has one node per class label in your model.⁸

The hidden layer(s)

There seems to be a few different ideas about how to determine the number and size of the hidden layer(s). Let’s look at a few.

From comp.ai¹⁰:

You may not need any hidden layers at all. Linear and generalized linear
models are useful in a wide variety of applications (McCullagh and Nelder
1989). And even if the function you want to learn is mildly nonlinear, you
may get better generalization with a simple linear model than with a
complicated nonlinear model if there is too little data or too much noise to
estimate the non-linearities accurately.¹⁰

…

In MLPs with step/threshold/Heaviside activation functions, you need two
hidden layers for full generality (Sontag 1992). For further discussion, see
Bishop (1995, 121-126).¹⁰

…

In MLPs with any of a wide variety of continuous nonlinear hidden-layer
activation functions, one hidden layer with an arbitrarily large number of
units suffices for the "universal approximation" property (e.g., Hornik,
Stinchcombe and White 1989; Hornik 1993; for more references, see Bishop 1995, 130, and Ripley, 1996, 173-180). But there is no theory yet to tell you how many hidden units are needed to approximate any given function.¹⁰

…

If you have only one input, there seems to be no advantage to using more than one hidden layer.¹⁰

It’s beginning to sound like using one hidden layer is usually enough, providing that it is “big enough”.

Heaton research⁹ gives quite a succinct summary:

Right — we probably only need one (or two) hidden layers in most cases. How many neurons should we include in the hidden layers?

Again, there seem to be a few “rules of thumb”…

Heaton Research⁹:

The number of hidden neurons should be between the size of the input layer and the size of the output layer.
The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer.
The number of hidden neurons should be less than twice the size of the input layer.

A useful CrossValidated post¹¹:

As explained by this excellent NN Design text, you want to limit the number of free parameters in your model (its degree or number of nonzero weights) to a small portion of the degrees of freedom in your data. The degrees of freedom in your data is the number samples * degrees of freedom (dimensions) in each sample or Ns∗(Ni+No)Ns∗(Ni+No) (assuming they’re all independent). So αα is a way to indicate how general you want your model to be, or how much you want to prevent overfitting.
For an automated procedure you’d start with an αα of 2 (twice as many degrees of freedom in your training data as your model) and work your way up to 10 if the error (loss) for your training dataset is significantly smaller than for your test dataset.¹¹

Or alternatively, from a different CrossValidated post¹²:

… which is based on the geometric pyramid rule.

comp.ai¹³ goes into greater detail about how to size hidden layers — it’s well worth a read but I feel perhaps too advanced for this article at this point in my learning. I’ll have to return to it later.

Wrapping up

Again, lots of learning and I’m quite simply staggered by the amount of material and research that’s out there.

I’m going to try and summarise:

In a neural network, neurons are organised into layers.
Each layer has a specific function. The input layer receives the input data, the hidden layer(s) learn the task at hand, and the output layer converts the results of the hidden layer(s) into an “answer” or prediction.
There is only one input and output layer. There can be numerous hidden layers. In most cases one hidden layer is sufficient.
The number of neurons in each layer varies. There are as many neurons in the input layer as there are columns in the training data. The size of the output layer is determined by the task at hand (regression, classification, softmax classification). The size of the hidden layer(s) is determined by the user. There are various “rules of thumb” available for guidance.
The number and size of hidden layers needs to be determined taking into account under-fitting, over-fitting, and computational time. Too many can result in long run times, over-fitting and poor generalisation while too few can result in under-fitting.

References