Feedforward neural networks are artificial neural networks where the connections between units do not form a cycle. Feedforward neural networks were the first type of artificial neural network invented and are simpler than their counterpart, recurrent neural networks. They are called feedforward because information only travels forward in the network (no loops), first through the input nodes, then through the hidden nodes (if present), and finally through the output nodes.
Feedfoward neural networks are primarily used for supervised learning in cases where the data to be learned is neither sequential nor time-dependent. That is, feedforward neural networks compute a function on fixed size input such that for training pairs . On the other hand, recurrent neural networks learn sequential data, computing on variable length input such that for training pairs for all .
The simplest type of feedforward neural network is the perceptron, a feedforward neural network with no hidden units. Thus, a perceptron has only an input layer and an output layer. The output units are computed directly from the sum of the product of their weights with the corresponding input units, plus some bias.
Historically, the perceptron's output has been binary, meaning it outputs a value of or . This is achieved by passing the aforementioned product sum into the step function . This is defined as
For a binary perceptron with -dimensional input , -dimensional weight vector , and bias , the -dimensional output is
Since the perceptron divides the input space into two classes, and , depending on the values of and it is known as a linear classifier. The diagram to the right displays one such linear classifier. The line separating the two classes is known as the classification boundary or decision boundary. In the case of a two-dimensional input (as in the diagram) it is a line, while in higher dimensions this boundary is a hyperplane. The weight vector defines the slope of the classification boundary while the bias defines the intercept.
More general single-layer peceptrons can use activation functions other than the step function . Typical choices are the identity function the sigmoid function and the hyperbolic tangent Use of any of these functions ensures the output is a continuous number (as opposed to binary), and thus not every activation function yields a linear classifier.
Generally speaking, a perceptron with activation function has output
In order for a perceptron to learn to correctly classify a set of input-output pairs , it has to adjust the weights and bias in order to learn a good classification boundary. The figure below shows many possible classification boundaries, the best of which is the boundary labeled . If the perceptron uses an activation function other than the step function (e.g. the sigmoid function), then the weights and bias should be adjusted so that the output is close to the true label .
Typically, the learning process requires the definition of an error function that quantifies the difference between the computed output of the perceptron and the true value for an input over a set of multiple input-output pairs . Historically, this error function is the mean squared error (MSE), defined for a set of input-output pairs as
where denotes the output of the perceptron on input with activation function . The factor of is included in order to simplify the calculation of the derivative later. Thus, when for all input-output pairs in , so a natural objective is to attempt to change and such that is as close to zero as possible. Thus, minimizing with respect to and should yield a good classification boundary.
is typically minimized using gradient descent, meaning the perceptron adjusts and in the direction of the negative gradient of the error function. Gradient descent works for any error function, not just the mean squared error. This iterative process reduces the value of the error function until it converges on a value, usually a local minimum. The values of and are typically set randomly and then updated using gradient descent. If the random initializations of and are denoted and , respectively, then gradient descent updates and according to the equations
where and are the values of and after the iteration of gradient descent, and is the partial derivative of with respect to . is known as the learning rate, which controls the step size gradient descent takes each iteration, typically chosen to be a small value, e.g. Values of that are too large cause learning to be suboptimal (by failing to converge), while values of that are too small make learning slow (by taking too long to converge).
The weight delta and bias delta are calculated using the delta rule. The delta rule is a special case of backpropagation for single-layer perceptrons, and is used to calculate the updates (or deltas) of the perceptron parameters. The delta rule can be derived by consistent application of the chain rule and power rule for calculating the partial derivatives and For a perceptron with a mean squared error function and activation function , the delta rules for the weight vector and bias are
where and . Note that the presence of on the right side of the delta rule for implies that the weight vector's update delta is also a vector, which is expected.
Training the Perceptron
Thus, given a set of input-output pairs learning consists of iteratively updating the values of and according to the delta rules. This consists of two distinct computational phases:
1. Calculate the forward values. That is, for calculate and for all .
2. Calculate the backward values. Using the partial derivatives of the error function, update the weight vector and bias according to gradient descent. Specifically, use the delta rules and the values calculated in the forward phase to calculate the deltas for each.
The backward phase is so named because in multi-layer perceptrons there is no simple delta rule, so calculation of the partial derivatives proceeds backwards from the error between the target output and actual output (this is where backpropagation gets its name). In a single-layer perceptron, there is no backward flow of information (there is only one "layer" of parameters), but the naming convention applies all the same.
Once the backward values are computed, they may be used to update the values of the weight vector and bias . The process repeats until the error function converges. Once the error function has converged, the weight vector and bias can be fixed, and the forward phase used to calculate predicted values of the true output for any input . If the perceptron has learned the underlying function mapping inputs to outputs i.e. not just remembered every pair of it will even predict the correct values for input-output pairs it was not trained on, known as generalization. Ultimately, generalization is the primary goal of supervised learning, since it is desirable and a practical necessity to learn an unknown function based on a small sample of the set of all possible input-output pairs.
It was mentioned earlier that single-layer perceptrons are linear classifiers. That is, they can only learn linearly separable patterns. Linearly separable patterns are datasets or functions that can be separated by a linear boundary (a line or hyperplane). Marvin Minsky and Seymour Papert showed in their seminal 1969 book Perceptrons that it was impossible for a perceptron to learn even simple non-linearly separable functions such as the XOR function. The XOR, or "exclusive or", function is a simple function on two binary inputs and is often found in bit twiddling hacks. A plot of and truth table for XOR is below.
Notice that, for the plot of the XOR function, it is impossible to find a linear boundary that separates the black and white inputs from one another. This is because XOR is not a linearly separable function, and by extension, perceptrons cannot learn the XOR function. Similar analogs exist in higher dimensions, i.e. more than two inputs.
Many other (indeed, most other) functions are not linearly separable, so what is needed is an extension to the perceptron. The obvious extension is to add more layers of units so that there are nonlinear computations in between the input and output. For a long time, it was assumed by many in the field that adding more layers of units would fail to solve the linear separability problem (even though Minsky and Papert knew that such an extension could learn the XOR function), so research in the field of artificial neural networks stagnated for a good decade. Indeed, this assumption turned out to be very wrong, as multi-layer perceptrons, covered in the next section, can learn practically any function of interest.
The mulit-layer perceptron (MLP) is an artificial neural network composed of many perceptrons. Unlike single-layer perceptrons, MLPs are capable of learning to compute non-linearly separable functions. Because they can learn nonlinear functions, they are one of the primary machine learning techniques for both regression and classification in supervised learning.
MLPs are usually organized into something called layers. As discussed in the sections on neural networks as graphs and neural networks as layers, the generalized artificial neural network consists of an input layer, some number (possibly zero) of hidden layers, and an output layer. In the case of a single-layer perceptron, there are no hidden layers, so the total number of layers is two. MLPs, on the other hand, have at least one hidden layer, each composed of multiple perceptrons. An example of a feedforward neural network with two hidden layers is below.
It was mentioned in the introduction that feedforward neural networks have the property that information (i.e. computation) flows forward through the network, i.e. there are no loops in the computation graph (it is a directed acyclic graph, or DAG). Another way of saying this is that the layers are connected in such a way that no layer's output depends on itself. In the above figure, this is clear since each layer (other than the input layer) is only connected to the layer directly to its left. Feedforward neural networks, by having DAGs for their computation graphs, have a greatly simplified learning algorithm compared to recurrent neural networks, which have cycles in their dependency graphs.
Generally speaking, if one is given a graph representing a feedforward network, it can always be grouped into layers such that each layer depends only on layers to its left. This can be done by performing a topological sort on the nodes and grouping nodes with the same depth into the same layer, where layer consists of all nodes in the graph with depth in the topological sort. Then, arranging layers from left to right, each layer's nodes will only depend on nodes from layers to its left.
The following defines a prototypical -layer meaning hidden layers MLP that computes a one-dimensional output on an -dimensional input .
- The output perceptron has an activation function and hidden layer perceptrons have activation functions .
- Every perceptron in layer is connected to every perceptron in layer layers are "fully connected." Thus, every perceptron depends on the outputs of all the perceptrons in the previous layer (this is without loss of generality since the weight connecting two perceptrons can still be zero, which is the same as no connection being present).
- There are no connections between perceptrons in the same layer.
and use the following denotation:
weight for perceptron in layer for incoming node a perceptron if in layer
bias for perceptron in layer
product sum plus bias for perceptron in layer
output for node in layer
number of nodes in layer
weight vector for perceptron in layer , i.e.
output vector for layer , i.e.
Then, computation of the MLP's output proceeds according to the following steps:
Initialize the input layer
Set the values of the outputs for nodes in the input layer to their associated inputs in the vector , i.e. .
Calculate the product sums and outputs of each hidden layer in order from to
For from to
a. compute for
b. compute for
Compute the output for output layer
where denotes the output (the result of step 3 in the computation above) of the MLP on input . Analogous to the single-layer perceptron, minimizing with respect to all and will yield a good classification boundary. Using gradient descent to adjust the parameters and with a learning rate of yields the following delta equations for each iteration:
The expansion of the right-hand side of the delta rules can be found using backpropagation, so called because the gradient information flows backwards through the network (i.e. in the direction opposite the output computation flow). This gradient flow originates in the final layer , proportional to the difference between the target output and actual output .
Thus, one iteration of training for MLPs consists of two distinct computational phases:
Here is a tutorial helps you to understand how Python implements feed forward neural networks using Numpy.
- Calculate the forward values. That is, for calculate all and for all .
- Calculate the backward values. Using the partial derivatives of the error function, update the weights and biases according to gradient descent. Specifically, use backpropagation and the values calculated in the forward phase to calculate the deltas for each.