Linear regression is a technique used to model the relationships between observed variables. The idea behind simple linear regression is to "fit" the observations of two variables into a linear relationship between them. Graphically, the task is to draw the line that is "best-fitting" or "closest" to the points where and are observations of the two variables which are expected to depend linearly on each other.
Regression is a common process used in many applications of statistics in the real world. There are two main types of applications:
Predictions: After a series of observations of variables, regression analysis gives a statistical model for the relationship between the variables. This model can be used to generate predictions: given two variables and the model can predict values of given future observations of This idea is used to predict variables in countless situations, e.g. the outcome of political elections, the behavior of the stock market, or the performance of a professional athlete.
Correlation: The model given by a regression analysis will often fit some kinds of data better than others. This can be used to analyze correlations between variables and to refine a statistical model to incorporate further inputs: if the model describes certain subsets of the data points very well, but is a poor predictor for other data points, it can be instructive to examine the differences between the different types of data points for a possible explanation. This type of application is common in scientific tests, e.g. of the effects of a proposed drug on the patients in a controlled study.
Although many measures of best fit are possible, for most applications the best-fitting line is found using the method of least squares. That is, viewing as a linear function of the method finds the linear function which minimizes the sum of the squares of the errors in the approximations of the by
Here is an example to illustrate the process.
Find the best-fitting line for the data points
To find the line of best fit through these five points, the goal is to minimize the sum of the squares of the differences between the -coordinates and the predicted -coordinates based on the line and the -coordinates. This is This is a quadratic polynomial in and and is minimized by taking partial derivatives with respect to using the chain rule, and setting them equal to This gives which reduces to which has the unique solution and So the best-fitting line is
Note that it was not necessary to use a line to model the data. For instance, a quadratic curve with the five points plugged into the expression for the sum of the squares of the errors, and with partial derivatives as in the example, would give three equations in the three unknowns As long as there are "enough" points, the resulting equations will have a unique solution; see below for a more rigorous discussion of the linear algebra involved in general.
The above derivation can be carried out in general: given points the slope and intercept of the least-squares line satisfy the equations
where are given by the following formulas:
In the example above, these are
Solving for and leads to the equations
for the best-fit line
The choice of quantity to minimize when finding a best-fit line is by no means unique. The sum of the errors, or the sum of the absolute values of the errors, often seems more natural. Why is least squares the standard?
One reason is that the equations involved in solving for the best-fit line are straightforward, as can be seen in the above example. Equations involving absolute value functions are more difficult to work with than polynomial equations. Another qualitative reason is that it is generally preferred to penalize a single large error rather than many "medium-sized" errors. But this does not necessarily explain why the exponent is preferred to, say, or
The most convincing justification of least squares is the following result due to Gauss:
Suppose We measure values and compute errors If these errors (e.g. errors made in measurement) are independent and normally distributed, then consider, for any possible linear function the probability of getting the measurements if were the correct model. The least-squares line is the line for which is maximized.
That is, the least-squares line gives the model that is most likely to be correct, under natural assumptions about sampling errors.
The following theorem generalizes the least squares process and shows how to find the best-fitting line using matrix algebra:
Suppose is an matrix, where Suppose is an column vector. The vector that minimizes equals as long as has rank
This is a standard theorem of linear algebra. The idea is to split as a sum of a vector in the column space of and a vector perpendicular to the column space of which is a vector in the null space of Then is solvable, and is the projection of onto the column space of so it is the vector that minimizes the distance to as desired. Now so the result folows.
The previous example can be rewritten in matrix language: we seek a least-squares approximation to the equation This equation has no solutions (since no line goes through all five points), but the least squares solution is given by multiplying both sides by and solving which is the same system of equations we got by taking partial derivatives, and leads again to the unique solution and
- Sewaqu, . Linear regression. Retrieved November 5, 2010, from https://en.wikipedia.org/wiki/Least_squares#/media/File:Linear_regression.svg