Perceptron

The perceptron is a machine learning algorithm used to determine whether an input belongs to one class or another. For example, the perceptron algorithm can determine the AND operator—given binary inputs $x_1$ and $x_2$, is ($x_1$ AND $x_2$) equal to 0 or 1?

$The AND operation between two numbers. A red dot represents one class $(x_1$ AND $x_2 = 0)$ and a blue dot represents the other class $(x_1$ AND $x_2 = 1).$ The line is the result of the perceptron algorithm, which separates all data points of one class from those of the other.$ The AND operation between two numbers. A red dot represents one class $(x_1$ AND $x_2 = 0)$ and a blue dot represents the other class $(x_1$ AND $x_2 = 1).$ The line is the result of the perceptron algorithm, which separates all data points of one class from those of the other.

The perceptron algorithm was one of the first artificial neural networks to be produced and is the building block for one of the most commonly used neural networks, the multilayer perceptron.

Properties

The perceptron algorithm is frequently used in supervised learning, which is a machine learning task that has the advantage of being trained on labeled data. This is contrasted with unsupervised learning, which is trained on unlabeled data. Specifically, the perceptron algorithm focuses on binary classified data, objects that are either members of one class or another. Additionally, it allows for online learning, which simply means that it processes elements in the training dataset one at a time (which can be useful for large datasets).

An example of binary classified data and decision boundaries used by classifiers ^[1]

Furthermore, the perceptron algorithm is a type of linear classifier, which classifies data points by using a linear combination of the variables used. As seen in the graph above, a linear classifier uses lines $\big($e.g. $H_1, H_2$, or $H_3\big)$ to classify data points—any object on one side of the line is part of one class and any object on the other side is part of the other class. In this example, a successful linear classifier could use $H_1$ or $H_2$ to discriminate between the two classes, whereas $H_3$ would be a poor decision boundary.

An interesting consequence of the perceptron's properties is that it is unable to learn an XOR function! As we see above, OR and AND functions are linearly separable, which means that there exists a line that can separate all data points of one class from all data points of the other. However, the XOR function is not linearly separable, and therefore the perceptron algorithm (a linear classifier) cannot successfully learn the concept. This is a principal reason why the perceptron algorithm by itself is not used for complex machine learning tasks, but is rather a building block for a neural network that can handle linearly inseparable classifications.

Definition

The perceptron is an algorithm used to produce a binary classifier. That is, the algorithm takes binary classified input data, along with their class membership, and outputs a line that attempts to separate data of one class from data of the other: data points on one side of the line are of one class and data points on the other side are of the other.

Specifically, given an input with $k$ variables $x_1, x_2, ..., x_k$, a line is a linear combination of these variables: $w_1 x_1 + w_2 x_2 + \cdots + w_k x_k + b = 0$, where $w_0, w_1, ..., w_k$ and $b$ are constants. Note that this can also be written as $\boldsymbol{w} \cdot \boldsymbol{x} + b = 0$, where $\text{}\cdot\text{}$ is the dot product between the two vectors $\boldsymbol{w}$ and $\boldsymbol{x}$.

The perceptron algorithm returns values of $w_0, w_1, ..., w_k$ and $b$ such that data points on one side of the line are of one class and data points on the other side are of the other. Mathematically, the values of $\boldsymbol{w}$ and $b$ are used by the binary classifier in the following way: If $\boldsymbol{w} \cdot \boldsymbol{x} + b > 0$, the classifier returns 1; otherwise, it returns 0. Note that 1 represents membership of one class and 0 represents membership of the other. This can be seen more clearly with the AND operator, replicated below for convenience.

$The AND operation between two numbers: A red dot represents one class $(x_1$ AND $x_2 = 0)$ and a blue dot represents the other class $(x_1$ AND $x_2 = 1).$ The line is the result of the perceptron algorithm, which separates all data points of one class from those of the other.$ The AND operation between two numbers: A red dot represents one class $(x_1$ AND $x_2 = 0)$ and a blue dot represents the other class $(x_1$ AND $x_2 = 1).$ The line is the result of the perceptron algorithm, which separates all data points of one class from those of the other.

So what do $\boldsymbol{w}$ and $b$ stand for? $\boldsymbol{w}$ represents the weights of the $k$ variables. Simply, a variable's weight determines how steep the line is relative to that variable. A weight is needed for every variable; otherwise, the line would be flat relative to that variable, which may prevent the line from successfully classifying the data. Furthermore, $b$ represents the bias of the data. Essentially, this prevents the line from being dependent on the origin $($the point (0,0)$)$—the bias shifts the line up or down to better classify the data.

Supervised Learning

The perceptron algorithm learns to separate data by changing weights and bias over time, where time is denoted as the number of times the algorithm has been run. As such, $\boldsymbol{w(t)}$ represents the value of the weights at time $t$ and $b(t)$ represents the value of the bias at time $t$.

Additionally, $\alpha$ represents the learning rate, that is, how quickly the algorithm responds to changes. This value has the bound $0 < \alpha \le 1 $. $\alpha$ cannot be 0, as this would mean that no learning occurs. If $\alpha$ is a large value, the algorithm has a propensity of oscillating around the solution, as illustrated later.

To better elucidate these concepts, the formal steps of the perceptron algorithm are detailed below. In the following, $d_i$ represents the correct output value for input $x_i$; one class is given $d_i = 1$ if $x_i$ is a member of that class and $d_i = 0$ otherwise.

Begin by setting $\boldsymbol{w(0)}, b(0), t = 0$.
For each input $\boldsymbol{x_i}$, determine whether $\boldsymbol{w(t)} \cdot \boldsymbol{x_i} + b > 0$. Let $y_i$ be the output for input $\boldsymbol{x_i}$ (1 if true, 0 if false).
The weights and bias are now updated for the next iteration of the algorithm: $\boldsymbol{w(t+1)} = \boldsymbol{w(t)} + \alpha(d_i - y_i)\boldsymbol{x_i}$ and $b(t+1) = b(t) + \alpha(d_i - y_i)$ for all inputs.
If the learning is offline (if the inputs can be scanned multiple times), steps 2 and 3 can be repeated until errors are minimized. Note: $t$ is incremented on every iteration.

An example is as follows:

Suppose we are attempting to learn the AND operator for the following input-class pairs $\big((x_1, x_2), d_i\big):$ $\big((0, 0), 0\big), \big((0, 1), 0\big), \big((1, 0), 0\big),$ and $\big((1, 1), 1\big).$ Let us use a learning rate of $\alpha=0.5$ and run through the algorithm until we can classify all four points correctly.

w(0) = [0, 0], b(0) = 0

1 w(0) = [0, 0], b(0) = 0 y = [0, 0, 0, 0] w(1) = [0.5, 0.5], b(1) = 0.5

2 w(1) = [0.5, 0.5], b(1) = 0.5 y = [1, 1, 1, 1] w(2) = [0, 0]; b(2) = -1

3 w(2) = [0, 0], b(2) = -1 y = [0, 0, 0, 0] w(3) = [0.5, 0.5], b(3) = -0.5

4 w(3) = [0.5, 0.5], b(3) = -0.5 y = [0, 0, 0, 1] SUCCESS!

The perceptron algorithm over time. The green line represents the result of the perceptron algorithm after the second iteration and the black line represents the final results of the perceptron algorithm (after iteration 4).

In the previous example, the perceptron algorithm terminates to the correct value fairly quickly. One reason this occurs is due to a well-chosen learning rate ($\alpha$). With a smaller $\alpha$, the algorithm would take more iterations to finish, whereas a larger $\alpha$ could result in the algorithm oscillating forever.

Implementation

An implementation of the perceptron algorithm is provided below (in Python):

# Example of AND operator, as described above
alpha = 0.5
input_data = [([0, 0], 0), ([0, 1], 0), ([1, 0], 0), ([1, 1], 1)]
weights = [0, 0]
bias = 0

# Begin algorithm
def perceptron():
    # Repeat until we minimize error
    while True:
        # Start with the weights from t-1
        new_weights = [i for i in weights]
        new_bias = bias

        # For each input data point
        for input_datum in input_data: 
            # Add bias (intercept) to line
            comparison = bias
            list_of_vars = input_datum[0]

            # For each variable, compute the value of the line
            for index in range(len(list_of_vars)):
                comparison += weights[index] * list_of_vars[index]

            # Obtain the correct classification and the classification of the algorithm
            correct_value = input_datum[1]
            classified_value = int(comparison > 0)

            # If the values are different, add an error to the weights and the bias
            if classified_value != correct_value:
                for index in range(len(list_of_vars)):
                    new_weights[index] += alpha * (correct_value - classified_value) * list_of_vars[index]
                bias += alpha * (correct_value - classified_value)

        # If there is no change in weights or bias, return
        if new_weights == weights and new_bias == bias:
            return (new_weights, bias)

Summary

The perceptron algorithm is one of the most commonly used machine learning algorithms for binary classification. Some machine learning tasks that use the perceptron include determining gender, low vs. high risk for diseases, and virus detection. Basically, any task that involves classification into two groups can use the perceptron! Furthermore, the multilayer perceptron uses the perceptron algorithm to distinguish classes that are not linearly separable, which increases the number of tasks in which the perceptron can be used!

Overall, the perceptron algorithm (and the ideas behind it) is one of the main building blocks of neural networks, and its understanding is crucial for the development of more complex networks.

References

cyc, . Graphic showing 3 Hyperplanes in 2D. H3 doesn't separate the 2 classes. H1 does, with a small margin and H2 with the maximum margin.. Retrieved May 26, 2016, from https://en.wikipedia.org/wiki/Linear_classifier#/media/File:Svm_separating_hyperplanes.png

1	w(0) = [0, 0], b(0) = 0	y = [0, 0, 0, 0]	w(1) = [0.5, 0.5], b(1) = 0.5
2	w(1) = [0.5, 0.5], b(1) = 0.5	y = [1, 1, 1, 1]	w(2) = [0, 0]; b(2) = -1
3	w(2) = [0, 0], b(2) = -1	y = [0, 0, 0, 0]	w(3) = [0.5, 0.5], b(3) = -0.5
4	w(3) = [0.5, 0.5], b(3) = -0.5	y = [0, 0, 0, 1]	SUCCESS!

Contents