Backpropagation, short for "backward propagation of errors," is an algorithm for supervised learning of artificial neural networks using gradient descent. Given an artificial neural network and an error function, the method calculates the gradient of the error function with respect to the neural network's weights. It is a generalization of the delta rule for perceptrons to multilayer feedforward neural networks.
The "backwards" part of the name stems from the fact that calculation of the gradient proceeds backwards through the network, with the gradient of the final layer of weights being calculated first and the gradient of the first layer of weights being calculated last. Partial computations of the gradient from one layer are reused in the computation of the gradient for the previous layer. This backwards flow of the error information allows for efficient computation of the gradient at each layer versus the naive approach of calculating the gradient of each layer separately.
Backpropagation's popularity has experienced a recent resurgence given the widespread adoption of deep neural networks for image recognition and speech recognition. It is considered an efficient algorithm, and modern implementations take advantage of specialized GPUs to further improve performance.
Backpropagation was invented in the 1970s as a general optimization method for performing automatic differentiation of complex nested functions. However, it wasn't until 1986, with the publishing of a paper by Rumelhart, Hinton, and Williams, titled "Learning Representations by Back-Propagating Errors," that the importance of the algorithm was appreciated by the machine learning community at large.
Researchers had long been interested in finding a way to train multilayer artificial neural networks that could automatically discover good "internal representations," i.e. features that make learning easier and more accurate. Features can be thought of as the stereotypical input to a specific node that activates that node (i.e. causes it to output a positive value near 1). Since a node's activation is dependent on its incoming weights and bias, researchers say a node has learned a feature if its weights and bias cause that node to activate when the feature is present in its input.
By the 1980s, hand-engineering features had become the de facto standard in many fields, especially in computer vision, since experts knew from experiments which features (e.g. lines, circles, edges, blobs in computer vision) made learning simpler. However, hand-engineering successful features requires a lot of knowledge and practice. More importantly, since it is not automatic, it is usually very slow.
Backpropagation was one of the first methods able to demonstrate that artificial neural networks could learn good internal representations, i.e. their hidden layers learned nontrivial features. Experts examining multilayer feedforward networks trained using backpropagation actually found that many nodes learned features similar to those designed by human experts and those found by neuroscientists investigating biological neural networks in mammalian brains (e.g. certain nodes learned to detect edges, while others computed Gabor filters). Even more importantly, because of the efficiency of the algorithm and the fact that domain experts were no longer required to discover appropriate features, backpropagation allowed artificial neural networks to be applied to a much wider field of problems that were previously off-limits due to time and cost constraints.
Backpropagation is analogous to calculating the delta rule for a multilayer feedforward network. Thus, like the delta rule, backpropagation requires three things:
1) Dataset consisting of input-output pairs , where is the input and is the desired output of the network on input . The set of input-output pairs of size is denoted .
2) A feedforward neural network, as formally defined in the article concerning feedforward neural networks, whose parameters are collectively denoted . In backpropagation, the parameters of primary interest are , the weight between node in layer and node in layer , and , the bias for node in layer . There are no connections between nodes in the same layer and layers are fully connected.
3) An error function, , which defines the error between the desired output and the calculated output of the neural network on input for a set of input-output pairs and a particular value of the parameters .
Training a neural network with gradient descent requires the calculation of the gradient of the error function with respect to the weights and biases . Then, according to the learning rate , each iteration of gradient descent updates the weights and biases collectively denoted according to
where denotes the parameters of the neural network at iteration in gradient descent.
What's the Target?
As mentioned in the previous section, one major problem in training multilayer feedforward neural networks is in deciding how to learn good internal representations, i.e. what the weights and biases for hidden layer nodes should be. Unlike the perceptron, which has the delta rule for approximating a well-defined target output, hidden layer nodes don't have a target output since they are used as intermediate steps in the computation.
Since hidden layer nodes have no target output, one can't simply define an error function that is specific to that node. Instead, any error function for that node will be dependent on the values of the parameters in the previous layers (since previous layers determine the input for that node) and following layers since the output of that node will affect the computation of the error function This coupling of parameters between layers can make the math quite messy (primarily as a result of using the product rule, discussed below), and if not implemented cleverly, can make the final gradient descent calculations slow. Backpropagation addresses both of these issues by simplifying the mathematics of gradient descent, while also facilitating its efficient calculation.
The formulation below is for a neural network with one output, but the algorithm can be applied to a network with any number of outputs by consistent application of the chain rule and power rule. Thus, for all the following examples, input-output pairs will be of the form , i.e. the target value is not a vector.
Remembering the general formulation for a feedforward neural network,
weight for node in layer for incoming node
bias for node in layer
product sum plus bias (activation) for node in layer
output for node in layer
number of nodes in layer
activation function for the hidden layer nodes
activation function for the output layer nodes
The error function in classic backpropagation is the mean squared error
where is the target value for input-output pair and is the computed output of the network on input . Again, other error functions can be used, but the mean squared error's historical association with backpropagation and its convenient mathematical properties make it a good choice for learning the method.
The derivation of the backpropagation algorithm is fairly straightforward. It follows from the use of the chain rule and product rule in differential calculus. Application of these rules is dependent on the differentiation of the activation function, one of the reasons the heaviside step function is not used (being discontinuous and thus, non-differentiable).
For the rest of this section, the derivative of a function will be denoted , so that the sigmoid function's derivative is .
To simplify the mathematics further, the bias for node in layer will be incorporated into the weights as with a fixed output of for node in layer . Thus,
To see that this is equivalent to the original formulation, note that
where the left side is the original formulation and the right side is the new formulation.
Using the notation above, backpropagation attempts to minimize the following error function with respect to the neural network's weights:
by calculating, for each weight the value of . Since the error function can be decomposed into a sum over individual error terms for each individual input-output pair, the derivative can be calculated with respect to each input-output pair individually and then combined at the end (since the derivative of a sum of functions is the sum of the derivatives of each function):
Thus, for the purposes of derivation, the backpropagation algorithm will concern itself with only one input-output pair. Once this is derived, the general form for all input-output pairs in can be generated by combining the individual gradients. Thus, the error function in question for derivation is
where the subscript in , , and is omitted for simplification.
Error Function Derivatives
The derivation of the backpropagation algorithm begins by applying the chain rule to the error function partial derivative
where is the activation (product-sum plus bias) of node in layer before it is passed to the nonlinear activation function (in this case, the sigmoid function) to generate the output. This decomposition of the partial derivative basically says that the change in the error function due to a weight is a product of the change in the error function due to the activation times the change in the activation due to the weight .
The first term is usually called the error, for reasons discussed below. It is denoted
The second term can be calculated from the equation for above:
Thus, the partial derivative of the error function with respect to a weight is
Thus, the partial derivative of a weight is a product of the error term at node in layer , and the output of node in layer . This makes intuitive sense since the weight connects the output of node in layer to the input of node in layer in the computation graph.
It is important to note that the above partial derivatives have all been calculated without any consideration of a particular error function or activation function. However, since the error term still needs to be calculated, and is dependent on the error function , at this point it is necessary to introduce specific functions for both of these. As mentioned previously, classic backpropagation uses the mean squared error function (which is the squared error function for the single input-output pair case) and the sigmoid activation function.
The calculation of the error will be shown to be dependent on the values of error terms in the next layer. Thus, computation of the error terms will proceed backwards from the output layer down to the input layer. This is where backpropagation, or backwards propagation of errors, gets its name.
The Output Layer
Starting from the final layer, backpropagation attempts to define the value , where is the final layer the subscript is and not because this derivation concerns a one-output neural network, so there is only one output node For example, a four-layer neural network will have for the final layer, for the second to last layer, and so on. Expressing the error function in terms of the value since is a partial derivative with respect to gives
where is the activation function for the output layer.
Thus, applying the partial derivative and using the chain rule gives
Putting it all together, the partial derivative of the error function with respect to a weight in the final layer is
The Hidden Layers
Now the question arises of how to calculate the partial derivatives of layers other than the output layer. Luckily, the chain rule for multivariate functions comes to the rescue again. Observe the following equation for the error term in layer
where ranges from to (the number of nodes in the next layer). Note that, because the bias input corresponding to is fixed, its value is not dependent on the outputs of previous layers, and thus does not take on the value .
Plugging in the error term gives the following equation:
Remembering the definition of
where is the activation function for the hidden layers,
Plugging this into the above equation yields a final equation for the error term in the hidden layers, called the backpropagation formula:
Putting it all together, the partial derivative of the error function with respect to a weight in the hidden layers for is
Backpropagation as Backwards Computation
This equation is where backpropagation gets its name. Namely, the error at layer is dependent on the errors at the next layer . Thus, errors flow backward, from the last layer to the first layer. All that is needed is to compute the first error terms based on the computed output and target output . Then, the error terms for the previous layer are computed by performing a product sum weighted by of the error terms for the next layer and scaling it by , repeated until the input layer is reached.
This backwards propagation of errors is very similar to the forward computation that calculates the neural network's output. Thus, calculating the output is often called the forward phase while calculating the error terms and derivatives is often called the backward phase. While going in the forward direction, the inputs are repeatedly recombined from the first layer to the last by product sums dependent on the weights and transformed by nonlinear activation functions and . In the backward direction, the "inputs" are the final layer's error terms, which are repeatedly recombined from the last layer to the first by product sums dependent on the weights and transformed by nonlinear scaling factors and .
Furthermore, because the computations for backwards phase are dependent on the activations and outputs of the nodes in the previous (the non-error term for all layers) and next layer (the error term for hidden layers), all of these values must be computed before the backwards phase can commence. Thus, the forward phase precedes the backward phase for every iteration of gradient descent. In the forward phase, activations and outputs will be remembered for use in the backwards phase. Once the backwards phase is completed and the partial derivatives are known, the weights and associated biases can be updated by gradient descent. This process is repeated until a local minimum is found or convergence criterion is met.
Using the terms defined in the section titled Formal Definition and the equations derived in the section titled Deriving the Gradients, the backpropagation algorithm is dependent on the following five equations:
For the partial derivatives,
For the final layer's error term,
For the hidden layers' error terms,
For combining the partial derivatives for each input-output pair,
For updating the weights,
The General Algorithm
The backpropagation algorithm proceeds in the following steps, assuming a suitable learning rate and random initialization of the parameters
1) Calculate the forward phase for each input-output pair and store the results , , and for each node in layer by proceeding from layer , the input layer, to layer , the output layer.
2) Calculate the backward phase for each input-output pair and store the results for each weight connecting node in layer to node in layer by proceeding from layer , the output layer, to layer , the input layer.
a) Evaluate the error term for the final layer by using the second equation.
b) Backpropagate the error terms for the hidden layers , working backwards from the final hidden layer , by repeatedly using the third equation.
c) Evaluate the partial derivatives of the individual error with respect to by using the first equation.
3) Combine the individual gradients for each input-output pair to get the total gradient for the entire set of input-output pairs by using the fourth equation (a simple average of the individual gradients).
4) Update the weights according to the learning rate and total gradient by using the fifth equation (moving in the direction of the negative gradient).
Backpropagation In Sigmoidal Neural Networks
The classic backpropagation algorithm was designed for regression problems with sigmoidal activation units. While backpropagation can be applied to classification problems as well as networks with non-sigmoidal activation functions, the sigmoid function has convenient mathematical properties which, when combined with an appropriate output activation function, greatly simplify the algorithm's understanding. Thus, in the classic formulation, the activation function for hidden nodes is sigmoidal and the output activation function is the identity function (the network output is just a weighted sum of its hidden layer, i.e. the activation).
Backpropagation is actually a major motivating factor in the historical use of sigmoid activation functions due to its convenient derivative:
Thus, calculating the derivative of the sigmoid function requires nothing more than remembering the output and plugging it into the equation above.
Furthermore, the derivative of the output activation function is also very simple:
Thus, using these two activation functions removes the need to remember the activation values and in addition to the output values and , greatly reducing the memory footprint of the algorithm. This is because the derivative for the sigmoid activation function in the backwards phase only needs to recall the output of that function in the forward phase, and is not dependent on the actual activation value, which is the case in the more general formulation of backpropagation where must be calculated. Similarly, the derivative for the identity activation function doesn't depend on anything since it is a constant.
Thus, for a feedforward neural network with sigmoidal hidden units and an identity output unit, the error term equations are as follows:
For the final layer's error term,
For the hidden layers' error terms,
The following code example is for a sigmoidal neural network as described in the previous subsection. It has one hidden layer and one output node in the output layer. The code is written in Python3 and makes heavy use of the NumPy library for performing matrix math. Because the calculations of the gradient for individual input-output pairs can be done in parallel, and many calculations are based on taking the dot product of two vectors, matrices are a natural way to represent the input data, output data, and layer weights. NumPy's efficient computation of matrix products and the ability to use modern GPUs (which are optimized for matrix operations) can give significant speedups in both the forward and backward phases of computation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
X is the set of inputs and the matrix
y is the set of outputs . The number of nodes in the hidden layer can be customized by setting the value of the variable
num_hidden. The learning rate is controlled by the variable
alpha. The number of iterations of gradient descent is controlled by the variable
By changing these variables and comparing the output of the program to the target values
y, one can see how these variables control how well backpropagation can learn the dataset
y. For example, more nodes in the hidden layer and more iterations of gradient descent will generally improve the fit to the training dataset. However, using too large or too small a learning rate can cause the model to diverge or converge too slowly, respectively.