Back to all chapters
# Machine Learning

Advanced quantitative techniques to analyze data where humans fall short.

Least squares linear regression is probably the most well-known type of regression, but there are many other variants which can minimize the problems associated with it.

A common one is known as ridge regression. This method is very similar to least squares regression but modifies the error function slightly.

Previously, we used the sum of square errors of the regression line as a measure of error, but in ridge regression we seek to minimize the squared values of coefficients as well. This gives the error function \[\small \text{Error} = \sum_{i=1}^{n} (y_i - m_1x_{1i} - m_2x_{2i} - … - m_px_{pi} - b)^2 + \lambda \sum_{i=1}^{p} (m_i^2).\] Here, the value of lambda changes how aggressively coefficients are dampened. Notice that this error function does not penalize the size of the \(y\)-intercept.

It is not at all obvious why lasso would have behavior significantly differing from ridge regression, but there is an interesting geometric reason for the differences. However, to demonstrate this, we must first change the way we view both techniques.

In ridge regression, it turns out that for any values of \(\lambda\) we pick, it’s possible to find a value for \(\lambda_2\) such that minimizing \[\sum_{i=1}^{n} (y_i - m_1x_{1i} - m_2x_{2i} - … - m_px_{pi} - b)^2 + \lambda \sum_{i=1}^{p} \big(m_i^2\big)\] is equivalent to minimizing the SSE when \[\sum_{i=1}^{p} \big(m_i^2\big) \leq \lambda_2.\] Similarly, for any value of \(\lambda\) there is some value of \(\lambda_2\) such that using lasso is equivalent to minimizing the SSE when \[\sum_{i=1}^{p} (m_i) \leq \lambda_2.\] A useful way to view the SSE when there are two predictor variables is shown in the pictures below. Here, the \(x\)- and \(y\)-axes represent the values of coefficients in a best-fit plane, and the ellipses show all pairs of coefficients which produce a certain value of the SSE for a data set. As the SSE increases, the ellipses get larger.

Also in the pictures are two areas. The diamond represents the coefficient values allowed by lasso. The disk represents the possible coefficient values in ridge regression. Viewing these pictures, which form of linear regression will most likely lead to coefficients of zero?

A group of scientists wants to analyze bacterial growth in Petri dishes. They have done a dozen tests, and each time they have recorded every single detail of the environment. The pH levels of the dishes, sugar content of the food, and even the light levels in the room have been recorded. A total of fourteen variables have been taken into account.

In a classic example of overzealous testing, a rogue scientist has added another variable to the mix, his average mood on a scale from zero to ten. When a best-fit equation is generated with this variable included, how will the SSE most likely change? How will the average error on new data change?

The previous question is a good example of a case where ridge regression or lasso would be very useful. Each of these techniques will penalize a best-fit line for having large coefficients, so they are likely to produce equations that make minimal use of predictor variables that have little sway over the result. Because the equation has a limited “budget,” it can only afford to give large weights to variables which are important.

In the case shown previously, this means that something with as little predictive power as the scientist’s mood the day of the test will be largely ignored. In fact, if lasso is used, the variable will probably be ignored entirely.

Of course, linear regression is just one of many techniques. One with comparable simplicity is known as K-nearest neighbors regression.

To use K-nearest neighbors regression, or KNN regression for short, we must start with a data set. Say we have an arbitrary point \(\vec{x}\) which holds values for all of our predictor variables. We want to estimate the corresponding value of the resultant variable, using the data set and \(\vec{x}\).

Now, we plot the predictor variables for the points in our data set, ignoring the resultants, and pick out the \(k\) points geometrically closest to \(\vec{x}\). The estimate KNN regression provides is simply the average of the resultant values for these points.

One useful property of KNN regression is that it makes very few assumptions about the data sets it builds on. Unlike linear regression, which assumes linear relationships, KNN regression can accommodate nearly anything.

Additionally, by adjusting the value of \(k,\) we can change the flexibility of KNN regression. If we want to account for even the smallest trends in our data set, we can pick a very small \(k\)-value. On the other hand, larger values of \(k\) will eliminate smaller deviations in favor of larger trends.

Below, we have three data sets--A, B, C--represented by either tables or scatter plots. We want to reorder them and analyze the first one with K-nearest neighbor regression, the second with lasso, and the third with normal linear regression. Which line-up will give the best results?

Data Set A: \[\begin{array}{c|c|c|c|c} x_1 & x_2 & x_3 & x_4 & y \\ \hline 5 & 8 & 97 & 2 & 3 \\ \hline 2 & 7 & 0 & 2 & 4\\ \hline 2 & 6 & 4 & 12 & 14 \\ \hline 15 & 6 & -20 & 5 & 6 \\ \hline 4 & 8 & 2 & 6 & 5\\ \end{array}\]

Data Set B:

Data Set C:

×

Problem Loading...

Note Loading...

Set Loading...