The central limit theorem is a theorem about independent random variables, which says roughly that the probability distribution of the average of independent random variables will converge to a normal distribution, as the number of observations increases. The somewhat surprising strength of the theorem is that (under certain natural conditions) there is essentially no assumption on the probability distribution of the variables themselves; the theorem remains true no matter what the individual probability distributions are.
Let be the random variable obtained by rolling a fair die and recording
Then is not normally distributed; it has a discrete probability density function, with expected value and variance The central limit theorem says that the distribution of for large is very close to a normal distribution, with expected value and variance
You're in a game show called "Magic Box," where you're given 4 identical balls to throw into boxes randomly. Each box will earn you the point(s) written on it multiplied by the number of balls thrown into it. For example, if you put one ball into each box, you'll get points.
What is the expected value of points earned from this game?
In plain language, given a population of any distribution with mean and variance the sampling distribution of the mean approaches a normal distribution with mean and variance
The central limit theorem most often applies to a situation in which the variables being averaged have identical probability distribution functions, so the distribution in question is an average measurement over a large number of trials--for example, flipping a coin, rolling a die, or observing the output of a random number generator. There are generalizations of the theorem to other situations, but this wiki will concentrate on the standard applications.
First, the formal statement requires a definition of "converging in distribution" which formalizes the qualitative behavior that the averages get closer and closer to the normal distribution as increases:
A sequence of random variables converges in distribution to a random variable if
for any real number at which the function is continuous.
Classical Central Limit Theorem
Let be independent, identically distributed ("i.i.d.") random variables with and Let Then the variables converge in distribution to the normal distribution with mean and variance
I roll a fair die times. Estimate the probability that at least of the rolls show a or
For the in the introduction, and so converges in distribution to the normal distribution with mean and variance For large should be very nearly normal with mean and variance Multiplying by gives that should be roughly normal with mean and variance
Note that is the sum of the or the number of dice rolls showing a or For the mean is and the variance is roughly so a value of or more is one standard deviation to the right of the mean. Since roughly 68% of the area under the normal curve lies within one standard deviation of the mean, the answer is percent.
It was not necessary to manipulate the variables; we could have worked with instead of The relevant computation would be
which is exactly one standard deviation for a normal distribution with variance But it is often more natural to work with the sum instead of the normalized average.
Note also that the facts about the mean and the average stated in the theorem are elementary, as long as the variables are independent: the mean of the average is the average of the means, and the variance of the average is the average of the variances. The above example required the central limit theorem only in order to get a good estimate for the probability that the sum would exceed its mean by one standard deviation; the central limit theorem gives an assurance that using the relevant estimate for a normally distributed variable will be roughly accurate.
For finer approximations involving discrete variables, the standard convention is to employ a continuity correction involving adjusting the bounds of the normally distributed limit variable by half of a unit. For instance, in the example in the previous section, the estimate that the dice showed at least s and s used where was the count of s and s, and using an approximation of as a continuous normally distributed variable.
But in fact is discrete; its values are always integers. If instead we note that the amount also equals or we get (slightly) different answers when approximating by a continuous variable. The solution is to adjust by or half of the minimum increment of which tends to give the most accurate approximation in real-world situations. So the most accurate answer to the above exercise would use which necessitates a computation using the normal distribution for instead of This gives a value of instead of which is closer to the correct answer.
You have an (extremely) biased coin that shows heads with probability and tails with probability .
If you toss it times, what is the probability that less than tails show up, to three significant figures? Make use of a normal distribution as an approximation to solve this problem.
Note: The case of tails is not to be included in the probability.
This problem is part of the set Extremely Biased Coins.
The central limit theorem can be used to answer questions about sampling procedures. It can be used in reverse, to approximate the size of a sample given the desired probability; and it can be used to examine and evaluate assumptions about the initial variables
A scientist discovers a potentially harmful compound present in human blood. A study shows that the distribution of levels of the compound among adult men has a mean value of with standard deviation The scientist wishes to take a sample of adult men for another study. How many men must she sample so that the probability that the mean value of the level of the compound in her sample is between and is at least
The sampling average will have mean and variance where is the number of samples. A probability requires us to be within standard deviations of the mean, for a normal distribution. We want this to be so standard deviation should be and hence the variance is This gives
So the sample should have at least men.
A coin is tossed 200 times. It comes up heads 120 times. Is the coin fair?
The central limit theorem says that the number of heads is approximately normally distributed, with mean and variance Two standard deviations above the mean is So this is nearly a 3-sigma event. The probability that the coin comes up heads at least 120 times is approximately which comes from looking up in a -table of values of the function for a normally distributed variable At a confidence level of or even we reject the null hypothesis that the coin is fair.