kMeans Clustering
Kmeans clustering is a traditional, simple machine learning algorithm that is trained on a test data set and then able to classify a new data set using a prime, \(k\) number of clusters defined a priori.
An objective of machine learning is to derive techniques for unsupervised learning on data. This kind of data analysis is very helpful in many applications that require classification of data, such as identifying cancerous cells within a large sample, clustering words with similar definitions for better search engine accuracy, identifying outliers in student’s academic performance for better refinement of habits, or even for detecting landmines in a battlefield ^{[2]}.
Building a Baseball Team Using Classification
Imagine a high school baseball coach wants to use data analysis to predict whether potential new recruits will be spectacular, mediocre, or dismal.
He has data on the players currently on his team: position, years of experience, batting average, onbase percentage, and number of stolen bases per games. He also has some data on the players he is considering recruiting. How might he go about selecting a new player that fills any gaps of skill currently on his team?
Contents
Kmeans algorithm
This clustering algorithm separates data into the best suited group based on the information the algorithm already has. Data is separated in \(k\) different clusters, which are usually chosen to be far enough apart from each other spatially, in Eucledian Distance, to be able to produce effective data mining results. Each cluster has a center, called the centroid, and a data point is clustered into a certain cluster based on how close the features are to the centroid.
Kmeans algorithm iteratively minimizes the distances between every data point and its centroid in order to find the most optimal solution for all the data points.
 \(k\) random points of the data set are chosen to be centroids.
 Distances between every data point and the \(k\) centroids are calculated and stored.
 Based on distance calculates, each point is assigned to the nearest cluster
 New cluster centroid positions are updated: similar to finding a mean in the point locations
 If the centroid locations changed, the process repeats from step 2, until the calculated new center stays the same, which signals that the clusters' members and centroids are now set.
Finding the minimal distances between all the points implies that data points have been separated in order to form the most compact clusters possible, with the least variance within them. In other words, no other iteration could have a lower average distance between the centroids and the data points found within them.
Technical Analysis
The Kmeans algorithm defined above aims at minimizing an objective function, which in this case is the squared error function.
The objective function for the Kmeans clustering algorithm is the squared error function:
\(J = \sum_{i=1}^{k} \sum_{j=1}^{n} (x_ i – v_j)^2 = 1 \)
where,
\(x_i – v_j \) is the Eucledian distance between a point, \(x_i\), and a centroid, \(v_j\), iterated over all k points in the \(i^{th}\) cluster, for all n clusters.
^{[5]}
In simpler terms, the objective function attempts to pick centroids that minimize the distance to all points belonging to its respective cluster so that the centroids are more symbolic of the surrounding cluster of data points.
Pseudocode
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 

Why use the squared error function rather than, say, the absolute error? The reason is because the squared error has nicer mathematical properties than absolute error. For further reference on these properties, check out this blog post by Ben Khun.
Complexity
Although the algorithm seems quite simple, finding the optimal solution to the problem for observations in either d dimensions or for k clusters is NPHard. However, if k and d are fixed, the problem can be solved in time \(O(n^{dk+1}log(n))\) using Lloyd’s Algorithm, a common kclustering algorithm, where n is the number of entities to be clustered.^{[6]}. However, the running time for this algorithm on nicely clustered data can be quite small, as minimization of the objective function will occur quickly.
Because the algorithm computes the distances between each of the k cluster centers and their respective data points every single iteration, a few iterations are enough to make further adjustments not worth the time complexity tradeoff. In other words, because further iterations won’t change the assignments of the majority of data points but their distances still need to be calculated, the algorithm becomes inefficient if convergence is the goal. For this reason, several variations of Lloyd’s kmeans clustering algorithm have been developed in order to speed up the process at later stages; where these variations include using the triangleinequality, amongst others.
Example
Before applying kmeans clustering to a data set, data has to go from characteristics of an object to numerical data that can be analyzed.
Categorizing Baseball Players
How would the baseball coach use kmeans clustering to predict whether new recruits will be good? Each trait of a player can be represented as a feature vector. By converting characteristics into numbers in a feature vector, the players become comparable in a vector space so that their differences can be better quantified.
The coach has the same type of information on both current players and new potential ones. Using kmeans clustering on the entire set of data points, he can figure out which of his current, knownlevel (remarkable, mediocre, dismal) players are closest to the new ones, and which of the new players would fill the most voids within his team.
When to Use KMeans Clustering
KMeans clustering is a fast, robust, and simple algorithm that gives reliable results when data sets are distinct or well separated from each other in a linear fashion. It is best used when the number of cluster centers, is specified due to a welldefined list of types shown in the data. However, it is important to keep in mind that KMeans clustering may not perform well if it contains heavily overlapping data, if the Euclidean distance does not measure the underlying factors well, or if the data is noisy or full of outliers ^{[7]}.
References
 Brickley, D. 1k_overview. Retrieved from https://www.flickr.com/photos/danbri/6233990550/in/photolistauSQcG87Yxj7EmVCpsptEDsu7m1WFX5EFMsei4x1vegRfqk81wUmZa4D73j87Yxiytty5TDbAhHzE5tMAuK7MfnED7rbggn7rfbmmrq7j4fa1hTYEgxCMpH57XUAWa6Nx348hAC1DounzPddybf7LfCTKB3dybeZddy5Mutdybfdbow6s3XwcCF9scqjRDSrrXXNDdybfjmfCBeqBdy5MGxfCTM13nouGAk71tzae4xwxcCk2tddi8kfyKH8hABZB7y4a3You3ZstftDTEuosZ261oun8p5cSNeMy4eJiRW
 Naik, A. Clustering Algorithm Applications. Retrieved from https://sites.google.com/site/dataclusteringalgorithms/clusteringalgorithmapplications
 Petersen, J. Kmeans. Retrieved from http://pypr.sourceforge.net/kmeans.html#kmeansexample
 None, N. kmeans clustering. Retrieved May 02, 2016, from https://en.wikipedia.org/wiki/Kmeans_clustering
 Matteucci, M. KMeans Clustering. Retrieved June 14, 2016, from http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html
 Inaba, M. (1994). Applications of weighted Voronoi diagrams and randomization to variancebased kclustering. ACM Symposium on Computational Geometry, 10th, 332339.
 Naik, a. kMeans Clustering Algorithm. Retrieved June 14,2016, from https://sites.google.com/site/dataclusteringalgorithms/kmeansclusteringalgorithm