K-nearest Neighbors
k-nearest neighbors (or k-NN for short) is a simple machine learning algorithm that categorizes an input by using its k nearest neighbors.
For example, suppose a k-NN algorithm was given an input of data points of specific men and women's weight and height, as plotted below. To determine the gender of an unknown input (green point), k-NN can look at the nearest k neighbors (suppose $k=3$) and will determine that the input's gender is male. This method is a very simple and logical way of marking unknown inputs, with a high rate of success.
k-NN is used in a variety of machine learning tasks; for example, in computer vision, k-NN can help identify handwritten letters and in gene expression analysis, the algorithm is used to determine which genes contribute to a certain characteristic. Overall, k-nearest neighbors provides a combination of simplicity and effectiveness that makes it an attractive algorithm to use for many machine learning tasks.
Properties
k-nearest neighbors, as described above, has various properties that differentiate it from other machine learning algorithms. First, k-NN is non-parametric, which means that it does not make any assumptions about the probability distribution of the input. This is useful for applications with input properties that are unknown and therefore makes k-NN more robust than algorithms that are parametric. The contrast is that parametric machine learning algorithms tend to produce fewer errors than non-parametric ones, since taking input probabilities into account can influence decision making.
Furthermore, k-NN is a type of lazy learning, which is a learning method that generalizes data in the testing phase, rather than during the training phase. This is contrasted with eager learning, which generalizes data in the training phase rather than the testing phase. A benefit of lazy learning is that it can quickly adapt to changes, since it is not expecting a certain generalized dataset. However, a major downside is that a huge amount of computation occurs during testing (actual use) rather than pre-computation during training.
Classification and Regression
k-nearest neighbors can be used in classification or regression machine learning tasks. Classification involves placing input points into appropriate categories whereas regression involves establishing a relationship between input points and the rest of the data. In either of these cases, determining a neighbor can be performed using many different notions of distance, with the most common being Euclidean and Hamming distance. Euclidean distance is the most popular notion of distance--the length of a straight line between two points. Hamming distance is the same concept, but for strings distance is calculated as the number of positions where two strings differ. Furthermore, for certain multivariable tasks, distances must be normalized (or weighted) to accurately represent the correlation between variables and their strength of correlation.
For k-NN classification, an input is classified by a majority vote of its neighbors. That is, the algorithm obtains the class membership of its k neighbors and outputs the class that represents a majority of the k neighbors.
k-NN regression works in a similar manner. The value returned is the average value of the input's k neighbors.
Suppose we are trying to classify the green circle. Let us begin with $k=3$ (the solid line). In this case, the algorithm would return a red triangle, since it constitutes a majority of the 3 neighbors. Likewise, with $k=5$ (the dotted line), the algorithm would return a blue square.
If no majority is reached with the k neighbors, many courses of action can be taken. For example, one could use a plurality system or even use a different algorithm to determine the membership of that data point.
Suppose we have data points from a sine wave above (with some variance, of course) and our task is to produce a y value for a given x value. When given an input data point to classify, k-NN would return the average y value of the input's k neighbors. For example, if k-NN were asked to return the corresponding y value for $x=0$, the algorithm would find the k nearest points to $x=0$ and return the average y value corresponding to these k points. This algorithm would be simple, but very successful for most x values.
Parameter Selection
Parameter selection is performed for most machine learning algorithms, including k-NN. To determine the number of neighbors to consider when running the algorithm (k), a common method involves choosing the optimal k for a validation set (the one that reduces the percentage of errors) and using that for the test set. In general, a higher k reduces noise and localized anomalies but allows for more error near decision boundaries, and vice versa.
Pros and Cons
k-NN is one of many algorithms used in machine learning tasks, in fields such as computer vision and gene expression analysis. So why use k-NN over other algorithms? The following is a list of pros and cons k-NN has over alternatives.
Pros:
- Very easy to understand and implement. A k-NN implementation does not require much code and can be a quick and simple way to begin machine learning datasets.
- Does not assume any probability distributions on the input data. This can come in handy for inputs where the probability distribution is unknown and is therefore robust.
- Can quickly respond to changes in input. k-NN employs lazy learning, which generalizes during testing--this allows it to change during real-time use.
Cons:
- Sensitive to localized data. Since k-NN gets all of its information from the input's neighbors, localized anomalies affect outcomes significantly, rather than for an algorithm that uses a generalized view of the data.
- Computation time. Lazy learning requires that most of k-NN's computation be done during testing, rather than during training. This can be an issue for large datasets.
- Normalization. If one type of category occurs much more than another, classifying an input will be more biased towards that one category (since it is more likely to be neighbors with the input). This can be mitigated by applying a lower weight to more common categories and a higher weight to less common categories; however, this can still cause errors near decision boundaries.
- Dimensions. In the case of many dimensions, inputs can commonly be "close" to many data points. This reduces the effectiveness of k-NN, since the algorithm relies on a correlation between closeness and similarity. One workaround for this issue is dimension reduction, which reduces the number of working variable dimensions (but can lose variable trends in the process).
References
- Ajanki, A. Example of k-nearest neighbour classificationnb. Retrieved May 28, 2016, from https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm#/media/File:KnnClassification.svg