# Least Squares

In most real-world cases, systems of linear equations won’t have a solution because they are overdetermined--i.e. there are more equations than variables. This happens most often in modeling, such as when one tries to find a relationship between two variables (such as height and weight). It is very rare that such variables would have a perfectly linear (or quadratic, or exponential, etc.) relationship, but there is often a “close”-to-linear relationship. This is especially important in machine learning, where finding a perfect relationship is often actually undesirable because it may not generalize well to new data.

Throughout this chapter, we’ll thus explore finding approximate solutions to the matrix equation $$Ax = b$$.

## Least Squares

### Vector Spaces

# Least Squares

First of all, we need to formally define what we mean by “approximate” solutions. We want $$Ax$$ to be close to $$b$$, meaning we want $$Ax - b$$ to be “small” (ideally 0). Thus a natural choice is trying to minimize the squared loss $R(x) = \lvert\lvert Ax - b \rvert\rvert.$ Let’s start with a small example. Suppose $A = \begin{pmatrix}1 & 1 \\ 1 & 2 \\ 1 & 3\end{pmatrix}, b = \begin{pmatrix}5 \\ 0 \\ 0\end{pmatrix}.$ Obviously we can’t satisfy all three equations. If we tried to satisfy just the first two, we’d get $$x_1 = \begin{pmatrix}10\\-5\end{pmatrix}$$. Alternatively, we could try $$x_2 = \begin{pmatrix}7 \\ -3\end{pmatrix}$$. What are $$R(x_1)$$ and $$R(x_2)?$$

## Least Squares

### Vector Spaces

# Least Squares

Let’s keep looking at our example $A = \begin{pmatrix}1 & 1 \\ 1 & 2 \\ 1 & 3\end{pmatrix}, b = \begin{pmatrix}5 \\ 0 \\ 0\end{pmatrix}.$ Suppose we’ll use $$\begin{pmatrix}c \\ -2 \end{pmatrix}$$ as our solution, for some $$c$$. Which $$c$$ minimizes the squared loss?

## Least Squares

### Vector Spaces

# Least Squares

Now let’s generalize this. Again we are looking at $A = \begin{pmatrix}1 & 1 \\ 1 & 2 \\ 1 & 3\end{pmatrix}, b = \begin{pmatrix}5 \\ 0 \\ 0\end{pmatrix}.$ Suppose we’ll use $$\begin{pmatrix}c_1 \\ c_2\end{pmatrix}$$ as our solution, for some $$c_1, c_2$$. Which $$c_1$$ minimizes the squared loss, in terms of $$c_2?$$

## Least Squares

### Vector Spaces

# Least Squares

Now let’s extend this further. Again our setup is $A = \begin{pmatrix}1 & 1 \\ 1 & 2 \\ 1 & 3\end{pmatrix}, b = \begin{pmatrix}5 \\ 0 \\ 0\end{pmatrix},$ and we use the solution $$\begin{pmatrix}c_1\\c_2\end{pmatrix}$$. Which $$c_2$$ minimizes the squared loss, in terms of $$c_1?$$

## Least Squares

### Vector Spaces

# Least Squares

We’ve now found two equations for the least squares solution $$\begin{pmatrix}c_1\\c_2\end{pmatrix}$$: \begin{align} c_1 &= -2c_2 + \frac{5}{3}\\ c_2 &= -\frac{3}{7}c_1 + \frac{5}{14}. \end{align} What is the resulting solution?

## Least Squares

### Vector Spaces

# Least Squares

Remember, we're trying to solve the system $A^TA\begin{pmatrix}c_1\\c_2\end{pmatrix} = A^Tb.$ Notice that we got the system $\begin{pmatrix}3&6\\6&14\end{pmatrix}\begin{pmatrix}c_1\\c_2\end{pmatrix} = \begin{pmatrix}5\\5\end{pmatrix},$ so if we can just show that $$A^{T} A = \begin{pmatrix}3&6\\6&14\end{pmatrix},$$ we'd be finished. Consider the columns of $$A$$ $\begin{pmatrix}1\\1\\1\end{pmatrix}, \begin{pmatrix}1\\2\\3\end{pmatrix}$ and note that \begin{align} \begin{pmatrix}1\\1\\1\end{pmatrix} \cdot \begin{pmatrix}1\\1\\1\end{pmatrix} &= 3, \begin{pmatrix}1\\1\\1\end{pmatrix} \cdot \begin{pmatrix}1\\2\\3\end{pmatrix} = 6\\ \begin{pmatrix}1\\2\\3\end{pmatrix} \cdot \begin{pmatrix}1\\1\\1\end{pmatrix} &= 6, \begin{pmatrix}1\\2\\3\end{pmatrix} \cdot \begin{pmatrix}1\\2\\3\end{pmatrix} = 14, \end{align} which are precisely the entries of our matrix above! In fact, putting everything together, we have $A^TA\begin{pmatrix}c_1\\c_2\end{pmatrix} = A^Tb.$

## Least Squares

### Vector Spaces

# Least Squares

We’ve now seen how to get the “best” solution to a system of linear equations given by $$Ax = b$$: solve $$A^TAx = A^Tb$$. This finds its main application in modeling data.

For instance, suppose we have some height to weight data in the form of $$n$$ ordered pairs $$(h_i, w_i)$$, and want to model this as a line (so we want to write weight as a linear function of height). The first step is to formulate this as a least-squares problem. Which of the following makes the most sense for $$A$$?

## Least Squares

### Vector Spaces

# Least Squares

Again, suppose we have $$n$$ data points $$(h_i, w_i)$$ that we want to fit to a line. Which of the following makes the most sense for $$b?$$

## Least Squares

### Vector Spaces

# Least Squares

Now we’ve seen that the $$n$$ data points can be succinctly described by the matrix equation $\begin{pmatrix}1&h_1\\1&h_2\\\vdots&\vdots\\1&h_n\end{pmatrix}\begin{pmatrix}c_1\\c_2\end{pmatrix} = \begin{pmatrix}w_1 \\ w_2 \\ \vdots \\ w_n \end{pmatrix}.$ Let’s look at the case where $$\sum h_i = 0$$, since we can always shift the heights to make this true (by subtracting the average height from them all). Recall that getting the least squares approximation involves solving the matrix equation $A^TA\begin{pmatrix}c_1\\c_2\end{pmatrix} = A^Tb.$ What is $$c_1$$?

## Least Squares

### Vector Spaces

# Least Squares

Again, in the matrix equation $\begin{pmatrix}1&h_1\\1&h_2\\\vdots&\vdots\\1&h_n\end{pmatrix}\begin{pmatrix}c_1\\c_2\end{pmatrix} = \begin{pmatrix}w_1 \\ w_2 \\ \vdots \\ w_n \end{pmatrix},$ we’ll assume $$\sum h_i = 0$$ for convenience. Recall that getting the least squares approximation involves solving the matrix equation $A^TA\begin{pmatrix}c_1\\c_2\end{pmatrix} = A^Tb.$ What is $$c_2?$$

## Least Squares

### Vector Spaces

# Least Squares

In this quiz, we looked at the problem of finding the “best” solution when a perfect solution doesn’t exist. The metric we used was the least squares distance, i.e. finding the $$x$$ that minimizes $$||Ax - b||$$ $$($$of course, this is 0 if $$Ax = b$$, so a perfect solution would still be found$$).$$ As we saw, finding the least squares solution is equivalent to solving the matrix equation $A^TAx = A^Tb.$ This finds its most direct application in fitting data to a line, which is extremely common in the real world. As we saw, this can be solved directly: if we have $$n$$ points $$(x_i, y_i)$$, we first shift so that $$\sum x_i = 0$$, and then the solution is $c_1 = \frac{\sum y_i}{n}, c_2 = \frac{\sum x_iy_i}{\sum x_i^2},$ where $$y = c_1 + c_2x$$.

## Least Squares

### Vector Spaces

×