A multivariate normal distribution is a vector in multiple normally distributed variables, such that any linear combination of the variables is also normally distributed. It is mostly useful in extending the central limit theorem to multiple variables, but also has applications to bayesian inference and thus machine learning, where the multivariate normal distribution is used to approximate the features of some characteristics; for instance, in detecting faces in pictures.
Equivalently, multivariate distributions can be viewed as a linear transformation of a collection of independent standard normal random variables, meaning that if is another random vector whose components are all standard random variables, there exists a matrix and vector such that
If is multivariate normal, it also has a mean vector such that
where is the expected value (or mean) of . also has a covariance matrix satisfying
where is the covariance of and .
Furthermore, is completely defined by and , so it is convenient to write
The p.d.f. of a multivariate normal distribution is given by
In understanding this equation, it is useful to compare with the p.d.f. of the univariate normal distribution, or the normal distribution in 1 variable, which is given by
In this univariate case, is a quadratic function of , which is a parabola that opens downward due to the negative leading coefficient. In the multivariate case, is a quadratic form in the vector . Since is positive definite, this quadratic form is negative definite, and so opens a "bowl" oriented downward in an analogous way to how the parabola in the univariate case opens downwards.
The leading coefficient in the univariate case does not depend on , and is chosen in such a way that
Similarly, the leading coefficient in the multivariate case does not depend on , and is chosen in such a way that
It is also worth noting that the multivariate formula reduces to the univariate one in the case , as in this case .
One main importance of the multivariate distribution is an extension of the central limit theorem to multiple variables:
Suppose is a sequence of independent, identically distributed random vectors with common mean vector and positive-definite \covariance matrix . Then
The multivariate normal distribution is useful in analyzing the relationship between multiple normally distributed variables, and thus has heavy application to biology and economics where the relationship between approximately-normal variables is of great interest. For instance, one of the earliest uses of the multivariate distribution was in analyzing the relationship between a father's height and the height of their eldest son, resolving a question Darwin posed in On the Origin of Species. That work revealed that:
- Both fathers' and son' height were normally distributed with mean 68 and variance 3 (in inches).
- For any given height , the height of sons whose fathers were of height was also normally distributed, and in fact the average height was a linear function of .
In essence, Galton had discovered the conditional distribution of the multivariate normal: if is partitioned into , into , and into , then
whose importance mainly extends to the fact that the conditional distribution of two multivariate normal variables is a normal distribution, where the parameters are functions of .
In modern time, the multivariate normal distribution is incredibly important in machine learning, whose purpose is (very roughly speaking) to categorize input data into labels , based on some training pairs . One major approach involves analyzing the distribution , and approximating it with a multivariate normal distribution, the validity of which can be checked using various normality tests; paradoxically, however, classifying based on multivariate normal distributions has been successful in practice even when it is known to be a poor model for the data.