Perceptron
The perceptron is a machine learning algorithm used to determine whether an input belongs to one class or another. For example, the perceptron algorithm can determine the AND operator—given binary inputs \(x_1\) and \(x_2\), is (\(x_1\) AND \(x_2\)) equal to 0 or 1?
The perceptron algorithm was one of the first artificial neural networks to be produced and is the building block for one of the most commonly used neural networks, the multilayer perceptron.
Properties
The perceptron algorithm is frequently used in supervised learning, which is a machine learning task that has the advantage of being trained on labeled data. This is contrasted with unsupervised learning, which is trained on unlabeled data. Specifically, the perceptron algorithm focuses on binary classified data, objects that are either members of one class or another. Additionally, it allows for online learning, which simply means that it processes elements in the training dataset one at a time (which can be useful for large datasets).
Furthermore, the perceptron algorithm is a type of linear classifier, which classifies data points by using a linear combination of the variables used. As seen in the graph above, a linear classifier uses lines \(\big(\)e.g. \(H_1, H_2\), or \(H_3\big)\) to classify data points—any object on one side of the line is part of one class and any object on the other side is part of the other class. In this example, a successful linear classifier could use \(H_1\) or \(H_2\) to discriminate between the two classes, whereas \(H_3\) would be a poor decision boundary.
An interesting consequence of the perceptron's properties is that it is unable to learn an XOR function! As we see above, OR and AND functions are linearly separable, which means that there exists a line that can separate all data points of one class from all data points of the other. However, the XOR function is not linearly separable, and therefore the perceptron algorithm (a linear classifier) cannot successfully learn the concept. This is a principal reason why the perceptron algorithm by itself is not used for complex machine learning tasks, but is rather a building block for a neural network that can handle linearly inseparable classifications.
Definition
The perceptron is an algorithm used to produce a binary classifier. That is, the algorithm takes binary classified input data, along with their class membership, and outputs a line that attempts to separate data of one class from data of the other: data points on one side of the line are of one class and data points on the other side are of the other.
Specifically, given an input with \(k\) variables \(x_1, x_2, ..., x_k\), a line is a linear combination of these variables: \(w_1 x_1 + w_2 x_2 + \cdots + w_k x_k + b = 0\), where \(w_0, w_1, ..., w_k\) and \(b\) are constants. Note that this can also be written as \(\boldsymbol{w} \cdot \boldsymbol{x} + b = 0\), where \(\text{}\cdot\text{}\) is the dot product between the two vectors \(\boldsymbol{w}\) and \(\boldsymbol{x}\).
The perceptron algorithm returns values of \(w_0, w_1, ..., w_k\) and \(b\) such that data points on one side of the line are of one class and data points on the other side are of the other. Mathematically, the values of \(\boldsymbol{w}\) and \(b\) are used by the binary classifier in the following way: If \(\boldsymbol{w} \cdot \boldsymbol{x} + b > 0\), the classifier returns 1; otherwise, it returns 0. Note that 1 represents membership of one class and 0 represents membership of the other. This can be seen more clearly with the AND operator, replicated below for convenience.
So what do \(\boldsymbol{w}\) and \(b\) stand for? \(\boldsymbol{w}\) represents the weights of the \(k\) variables. Simply, a variable's weight determines how steep the line is relative to that variable. A weight is needed for every variable; otherwise, the line would be flat relative to that variable, which may prevent the line from successfully classifying the data. Furthermore, \(b\) represents the bias of the data. Essentially, this prevents the line from being dependent on the origin \((\)the point (0,0)\()\)—the bias shifts the line up or down to better classify the data.
Supervised Learning
The perceptron algorithm learns to separate data by changing weights and bias over time, where time is denoted as the number of times the algorithm has been run. As such, \(\boldsymbol{w(t)}\) represents the value of the weights at time \(t\) and \(b(t)\) represents the value of the bias at time \(t\).
Additionally, \(\alpha\) represents the learning rate, that is, how quickly the algorithm responds to changes. This value has the bound \(0 < \alpha \le 1 \). \(\alpha\) cannot be 0, as this would mean that no learning occurs. If \(\alpha\) is a large value, the algorithm has a propensity of oscillating around the solution, as illustrated later.
To better elucidate these concepts, the formal steps of the perceptron algorithm are detailed below. In the following, \(d_i\) represents the correct output value for input \(x_i\); one class is given \(d_i = 1\) if \(x_i\) is a member of that class and \(d_i = 0\) otherwise.
- Begin by setting \(\boldsymbol{w(0)}, b(0), t = 0\).
- For each input \(\boldsymbol{x_i}\), determine whether \(\boldsymbol{w(t)} \cdot \boldsymbol{x_i} + b > 0\). Let \(y_i\) be the output for input \(\boldsymbol{x_i}\) (1 if true, 0 if false).
- The weights and bias are now updated for the next iteration of the algorithm: \(\boldsymbol{w(t+1)} = \boldsymbol{w(t)} + \alpha(d_i - y_i)\boldsymbol{x_i}\) and \(b(t+1) = b(t) + \alpha(d_i - y_i)\) for all inputs.
- If the learning is offline (if the inputs can be scanned multiple times), steps 2 and 3 can be repeated until errors are minimized. Note: \(t\) is incremented on every iteration.
An example is as follows:
Suppose we are attempting to learn the AND operator for the following input-class pairs \(\big((x_1, x_2), d_i\big):\) \(\big((0, 0), 0\big), \big((0, 1), 0\big), \big((1, 0), 0\big),\) and \(\big((1, 1), 1\big).\) Let us use a learning rate of \(\alpha=0.5\) and run through the algorithm until we can classify all four points correctly.
w(0) = [0, 0], b(0) = 0
1 w(0) = [0, 0], b(0) = 0 y = [0, 0, 0, 0] w(1) = [0.5, 0.5], b(1) = 0.5 2 w(1) = [0.5, 0.5], b(1) = 0.5 y = [1, 1, 1, 1] w(2) = [0, 0]; b(2) = -1 3 w(2) = [0, 0], b(2) = -1 y = [0, 0, 0, 0] w(3) = [0.5, 0.5], b(3) = -0.5 4 w(3) = [0.5, 0.5], b(3) = -0.5 y = [0, 0, 0, 1] SUCCESS!
In the previous example, the perceptron algorithm terminates to the correct value fairly quickly. One reason this occurs is due to a well-chosen learning rate (\(\alpha\)). With a smaller \(\alpha\), the algorithm would take more iterations to finish, whereas a larger \(\alpha\) could result in the algorithm oscillating forever.
Implementation
An implementation of the perceptron algorithm is provided below (in Python):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
|
Summary
The perceptron algorithm is one of the most commonly used machine learning algorithms for binary classification. Some machine learning tasks that use the perceptron include determining gender, low vs. high risk for diseases, and virus detection. Basically, any task that involves classification into two groups can use the perceptron! Furthermore, the multilayer perceptron uses the perceptron algorithm to distinguish classes that are not linearly separable, which increases the number of tasks in which the perceptron can be used!
Overall, the perceptron algorithm (and the ideas behind it) is one of the main building blocks of neural networks, and its understanding is crucial for the development of more complex networks.
References
- cyc, . Graphic showing 3 Hyperplanes in 2D. H3 doesn't separate the 2 classes. H1 does, with a small margin and H2 with the maximum margin.. Retrieved May 26, 2016, from https://en.wikipedia.org/wiki/Linear_classifier#/media/File:Svm_separating_hyperplanes.png