If you’ve ever taken a class in statistics before, linear regression is probably a familiar concept. This is no coincidence. At its core, machine learning is about taking in information and expanding on it, so it's natural that techniques from statistics play an important role in machine learning.
It is possible to use statistical techniques to find a best-fit line, by first calculating five values about our data. If we represent our data sets as collections of points on a scatter plot, these values are the means of and , the standard deviations of and , and the correlation coefficient.
If there are data points, then the mean of is simply the sum of all values divided by . Correspondingly, the mean of is the sum of all values divided by .
After calculating the means typically denoted as and we can find the (sample) standard deviations for the data set through the following formulae: The standard deviation of a data set gives a good idea of how close an average data point will be to the mean. A low standard deviation means that data points tend to cluster around the mean, while a large standard deviation usually means that they will be more spread out.
Out of the following sets of numbers, which will have the highest standard deviation? Use your intuition!
Say that we are studying the correlation between voltage and a light bulb's brightness. Amazingly, we always know exactly what voltage we are using, but we haven’t been so lucky when measuring bulb brightness. Here are our options:
If we always run the light bulb with the same voltage and measure the brightness ten times, which of these devices will collect data with the highest standard deviation, all else being equal? Assume that the lightbulb is perfectly consistent and not a source of any fluctuations.
After we have and we only have the correlation coefficient, usually denoted by left to calculate. This is difficult to calculate by hand, but the process for doing so is actually quite simple.
The first step is to convert and to standard units. For we must put each value through the formula which outputs the number of standard deviations is above the mean. We will call the updated value for , . Each value should be put through the analogous process with and
Now that we’ve changed our units, we can find our correlation coefficient by taking the average of the product of the updated and for each of our points.
This is given by The formula for the correlation coefficient may seem inscrutable, but it’s actually quite easy to interpret. Its value ranges from to and it indicates how linearly correlated the data is.
If is close to zero, then the data is barely correlated at all, at least not with a linear relationship. However, if is close to then the data is correlated and can be approximated well by a best-fit line with a positive slope. Conversely, if is close to the data is correlated and can be approximated well by a best-fit line with a negative slope.
Which of the following values would be closest to the correlation coefficient of the graph below?
Now, with all this information, we can finally calculate a best-fit line. In standard units, this is simply the line where is the correlation coefficient. A little algebra will show that in the original units, this translates to the equation In other words, this version of the best-fit line has a slope of and must pass through the point .
Although this line cannot go through every point in the data set, it does do a good job of representing them as a whole. It usually gives a good estimate for the expected value of at a given value of at least if the relationship between the two is somewhat linear.
Alfred has worked hard analyzing his data, but he needs help with the last step. He has calculated that and Which of the following equations gives the best-fit line he needs?