Machine Learning
# Linear Regression

One of the benefits of least squares regression is that it is easy to generalize from its use on scatter plots to 3D or even higher dimensional data.

Previously, we learned that when least squares regression is used on two-dimensional data, the SSE is given by the formula \[SSE = \sum_{i=1}^{n} (y_i - mx_i - b)^2.\] This gives us a good idea of what a higher dimensional error function will look like.

If there are \(p\) predictor variables \(\{x_1, x_2,\ldots, x_p\}\) and one response variable \(y,\) then a linear equation which outputs \(y\) will take the form \[y = m_1x_1+m_2x_2+\cdots+m_px_p+b.\]

**Given this information, what is a reasonable formula for the error when there is more than one predictor variable?**

Remember, the squared error of a single point is the squared difference between the \(y\)-value and the predicted \(y\)-value at that point. The SSE for the best-fit function is the sum of the squared errors for each point.

Earlier, we derived a formula for a best-fit line. Now, we will attempt to modify this formula so that it works for higher dimensional linear regression. Instead of outputting a best-fit line, this formula will now output a best-fit hyperplane--a linear equation in higher dimensions.

In the last chapter, we started our derivation by representing our best-fit equation with a vector. We can do so again with \[\vec{x} = \begin{bmatrix} m_1 \\ m_2 \\ \vdots \\ m_n \\ b \\ \end{bmatrix}.\] Now, we must create a matrix \(A\) which, when multiplied with \(\vec{x},\) outputs a vector containing the predicted value of \(y\) for each data point in the set.

Previously, we did this by making \(A\)’s first column the \(x\)-values of all data points and its second column a line of ones. Now, we can achieve the same results for higher dimensions by adding another column to \(A\) for each additional predictor variable. This is shown below for a data set with \(n\) points and \(p\) predictor variables: \[A = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{2p} & 1 \\ x_{21} & x_{22} & \cdots & x_{2p} & 1 \\ \vdots & ~ & ~ & ~ & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} & 1 \\ \end{bmatrix}. \] Once again, we will also initialize the vector \(\vec{b}\) to contain the \(y\)-values of every data point.

As it turns out, from this point on the derivation is exactly the same as before. We have to find the vector \(\vec{x}\) for which \(A\vec{x}\) is as close as possible to \(b,\) and once again we can do this by solving the equation \[A^T\vec{b} = A^TA\vec{x}.\] After that, we have our answer. The elements of \(\vec{x}\) will give the coefficient values for the best-fit hyperplane.

Alfred is back and this time he’s remembered there are multiple types of trees. He’s managed to compile a table of the seeds he planted each spring as well as the number of new sprouts each fall. Using this information, identify the matrix \(A\) which he needs to create in the process of calculating a best-fit linear equation.

\[\begin{array}{c|c|c} \text{Oak Seeds} & \text{Maple Seeds} & \text{New Growths} \\ \hline 10&5&9 \\ \hline 4 & 8&7\\ \hline 4 & 3& 5 \\ \hline 6 & 2&4\\ \end{array}\]

At this point, we can find a best-fit hyperplane for any conceivable data set, as long as there are more data points than predictors. But there’s one major problem. What if the points in a data set are very predictable, but not in a linear fashion?

As it turns out, there is a simple way to expand on our previous model. We can just add new, nonlinear terms to our function and update the rest of our math accordingly.

Generally, this is done by adding powers of the predictor variables, in which case this process is known as polynomial regression.

For instance, say we have a simple data set in which there is one predictor variable \(x\) and one response variable \(y\). The only twist is that we suspect \(y\) to be best represented by a second degree polynomial of \(x\).

Instead of representing the data with a best-fit line \[y = mx + b,\] we should now represent it with a best-fit polynomial \[y = m_1x^2+m_2x+b.\] In many ways, this is the same as creating another predictor variable. We have taken each point in our data set and added another value, \(x^2.\) After this step, we can calculate the coefficients as we normally would in higher dimensional linear regression.

Franklin is in the business of building toy race cars and is analyzing the relationship between the weight and top speed of a car when all else is held equal. So far he’s managed to collect just five data points, but he’s convinced that the relationship should be modeled with a cubic polynomial.

Given the table below, which matrix \(A\) must he construct in the process of calculating the best-fit curve? \[\begin{array}{c|c} x & y \\ \hline 5&30 \\ \hline 4 & 26\\ \hline 6 & 20 \\ \hline 3 & 18\\ \hline 7 & 15 \end{array}\]

1: \( A = \begin{bmatrix} 155 & 1 \\ 84 & 1\\ 258 & 1\\ 39 & 1 \\ 399 & 1 \end{bmatrix} \hspace{1cm} \) 2: \( A = \begin{bmatrix} 5 & 5 & 5 & 1 \\ 4 & 4 & 4 & 1\\ 6 & 6 & 6 & 1\\ 3 & 3 & 3 & 1 \\ 7 & 7 & 7 & 1 \end{bmatrix} \)

3: \( A = \begin{bmatrix} 125 & 25 & 5 & 1 \\ 64 & 16 & 4 & 1\\ 216 & 36 & 6 & 1\\ 27 & 9 & 3 & 1 \\ 343 & 49 & 7 & 1 \end{bmatrix} \)

×

Problem Loading...

Note Loading...

Set Loading...