Confidence Intervals
A Confidence Interval is a region constructed using sampled data, of fixed size, from a population (sample space) following a certain probability distribution. The interval is constructed to contain a chosen population statistic with prescribed probability. Simplified, the confidence interval is the probability that some value lies within a range.
For example the value \(\mu\), as in the statement: "given a sample of size \(n\), assume a 95% confidence interval \((a, b)\) is constructed to contain \(\mu\)." This means that if all possible samples, of size \(n\), from the population are considered, then 95% of the confidence intervals constructed from each of the samples will contain \(\mu\). The fact that a confidence interval with certain probability has been constructed, does not guarantee that the interval constructed will contain the true statistic of the population. Just that, if the sample was chosen at random, then there is a 95% confidence (chance) that \((a, b)\) contains \(\mu\). For a confidence interval associated with a random sample, the proportion of all such intervals that contains the population mean is typically called, not the confidence interval but the confidence level (\(0.95\) in this case).
Suppose we wanted to know the IQ of the users on Brilliant.org. We draw a random sample of \(1,000\) users from a population of \(1,000,000\) (making this number up) and conclude, with a \(95\%\) confidence level that the average IQ of the whole population is 115. What does this mean for the whole population?
What it does not mean is that \(95\%\) of the \(1,000,000\) people have an IQ of 115.
Instead it means that if we took many random samples of the whole population, then tested all of those samples, in \(95\%\) of those samples the average IQ would be 115. Another way to state this is some \(p\) percent of the population, plus or minus our confidence interval has an IQ of \(115\). Below we'll work through how to find that \(p\).
Contents
Definitions
Estimators and Standard Errors
Given a sample \(\{x_1, x_2,\ldots , x_n\}\) of size \(n\) from a population with mean \(\mu \) and variance \(\sigma^2 \), the Sample Mean is \(\bar X =\frac{x_1+x_2+\ldots+ x_n}{n}\) which is also known as the population mean estimator. The Sample Variance is \(s^2=\frac{(x_1 -\bar X)^2 +(x_2 -\bar X)^2 +\ldots+ (x_n -\bar X)^2 }{n-1}\). The standard deviation of the sampling distribution \(\frac{\sigma}{\sqrt{n}}\) is known as the standard error.
Going back to the example above. We have a sample of \(1,000\) from the population of \(1,000,000\). Instead of taking \(1,000\) data points, let's suppose they gave us the mean IQ, \(\mu = 115\), and they told us that \(600\) of the sampled students have a 115 IQ or higher, and 400 of the sampled students are lower. Ultimately we're trying to find our confidence interval.
In this sample we can assign every student with an IQ of 115 or above a score of 1, and everyone below it with a score of 0. Just a binary yes/no are they at 115 or higher? This means that our sample mean would be: \[\bar X =\frac{x_1+x_2+\ldots+ x_n}{n} = \frac{600(1)+400(0)}{1000} = \frac{600}{1000} = 0.6 \] This is called the population mean (estimator) because it's the sample's estimation of the portion of the population (60%) with an IQ of 115 or higher. The challenge will say with 95% confidence that 60% plus or minus our confidence interval of our entire population has an IQ of 115 or higher. The sample variance is \[s^2=\frac{(x_1 -\bar X)^2 +(x_2 -\bar X)^2 +\ldots+ (x_n -\bar X)^2 }{n-1} = \frac{(600(1-0.6)^2 + 400(0-0.6)^2}{1000-1}=\frac{96+144}{999}=0.\overline{240}\] \[\sigma = s = \sqrt{s^2} = \sqrt{0.\overline{240}} = 0.490\] Therefore the standard error is: \[\frac{\sigma}{\sqrt{n}} = \frac{0.490}{\sqrt{1000}} = 0.0155\]
z-score The z-score (also called standard score) is the number of standard deviations that a data point is away from the mean. In the case of confidence intervals, the z-scores shows how many standard deviations from the mean an answer should be to fall into the desired confidence interval. For instance, if a problem asks for a \(95\%\) confidence interval, looking up the z-score reveals a standard deviation of \(1.96\) on either side of the mean which corresponds to the chart at the top which shows that 2 standard deviations from the mean is a \(95.44\%\) confidence interval.
\(100(1-\alpha)\)% Confidence Interval for the Population Mean (known variance)
Given a sample \(\{x_1, x_2,\ldots , x_n\}\) of size \(n\) from a population with mean \(\mu \) and variance \(\sigma^2 \), the Confidence Interval for the population mean with confidence level \(1-\alpha\) associated with the sample is \((\bar X -z_{\frac{\alpha}{2}}\cdot \frac{\sigma}{\sqrt{n}}, \bar X +z_{\frac{\alpha}{2}}\cdot \frac{\sigma}{\sqrt{n}})\).
\(100(1-\alpha)\)% Confidence Interval for the Population Mean (unknown variance)
For large samples, \(z\)-values:
Given a sample \(\{x_1, x_2,\ldots , x_n\}\) of size \(n\) from a population with mean \(\mu \) and variance \(\sigma^2 \), the Confidence Interval for the population mean with confidence level \(1-\alpha\) associated with the sample is \((\bar X -z_{\frac{\alpha}{2}}\cdot \frac{s}{\sqrt{n}}, \bar X +z_{\frac{\alpha}{2}}\cdot \frac{s}{\sqrt{n}})\), such that \(s\) is the sample standard deviation, i.e. the square root of \(s^2\).
For small samples, \(t\)-values with \(n-1\) degrees of freedom:
Given a sample \(\{x_1, x_2,\ldots , x_n\}\) of size \(n\) from a population with mean \(\mu \) and variance \(\sigma^2 \), the Confidence Interval for the population mean with confidence level \(1-\alpha\) associated with the sample is \((\bar X -t_{\frac{\alpha}{2},n-1}\cdot \frac{s}{\sqrt{n}}, \bar X +t_{\frac{\alpha}{2},n-1}\cdot \frac{s}{\sqrt{n}})\), such that \(s\) is the sample standard deviation, i.e. the square root of \(s^2\).
Picking up again on the example. We know that we want a \(95\%\) confidence interval. This means that we need to look up the z-score that corresponds to 95%. Counter-intuitively, we don't look up \(95\%\) instead we look up \(50\% + \frac{95\%}{2} = 97.5\%\). For more on why, check out the wikis on standard deviations and z-scores. To look up \(97.5\%\) we need a standard z-score table, like the one here. We find the closest number to our desired answer and then add their row and column header. In this case it reveals our z-score is \(1.96\). It's also possible to derive the z-score, but for all standard distributions they are always the same.
\(z\sigma_x = 1.96 \cdot 0.0155 = 0.0304\) which is actually our confidence interval. We can now say, with \(95\%\) certainty, that \(60\% \pm 0.0304\%\) of our total population has an IQ of 115 or higher! This means that if we took 100 random samples of 1000 Brilliant users, that in 95 of the samples our total number of Brilliant users, \(p\) with an IQ at 115 or higher would be \(569.6 \le p \le 630.4\).
Sampling Distributions and the Central Limit Theorem
As is referenced in the introduction, if repeated random samples (with replacement and of the same size) are taken from a population, different statistical measures can be computed from each of these samples. For a given statistical measure, for instance the mean, the probability distribution of the means of all samples of size \(n\) is called the sampling distribution of the mean. If the population has mean \(\mu \) and variance \(\sigma^2\), then for large values of \(n\) the central limit theorem implies that the sampling distribution for size \(n\), is approximately normally distributed with mean \(\mu \) and variance \(\frac{\sigma^2}{n}\).
Additional notation simplifies calculations (using the normal distribution). The probability corresponding to the set of values of a random variable which are at most a fixed value \(x\), is denoted \(P(X\leq x)\). If the random variable \(X\sim \mathcal{N}(\mu,\sigma^2)\) represents the distribution of interest, using a simple transformation \(Z=\frac{X-\mu}{\sigma}\) translates the random variable \(X\) into \(Z\sim \mathcal{N}(0,1)\). For a fixed real number \(\alpha\in (0,1) \), \(z_\alpha\) denotes the solution of the equation \(P(Z\geq z)=\alpha\). For the t-distribution with \(n-1\) degrees of freedom, \(t_{\alpha,n-1}\) denotes the solution of the equation \(P(T\geq t)=\alpha\).
Examples and Problems
How does one measure the growth, in height, of a person at \(\pi\) seconds after midnight, the day after their 21st birthday?
In a theoretical setting, it is possible to deal with this type of question. But, in practice, precise measurements becomes difficult. Since giving up is not an option, the next best thing is to approximate, minimizing errors in measurements along the way. It is assumed that these errors are distributed such that they follow a known probability distribution.
Construction-Known Variance
Given a sample of size \(n\) following a probability distribution with mean \(\mu\) and variance \(\sigma^2\). Since the sample mean \(\bar X\) is an estimator for the population mean \(\mu\), approximated by a normal distribution with mean \(\mu\) and variance \(\frac{\sigma^2}{n}\), we are interested in the probability of our estimator be close to the population mean.
\[P(\lvert\bar X - \mu \rvert < \delta)=1-\alpha \]
We must determine the value \(\delta\) that works in our assumptions. Since \(\bar X \) is approximately normally distributed with variance \(\frac{\sigma^2}{n}\), the random variable can be transformed to follow a \(\mathcal{N}(0,1)\).
\[\begin{align} 1-\alpha &= P\left (\left | \frac{\bar X - \mu}{\frac{\sigma}{\sqrt{n}}}\right | < \frac{\delta}{\frac{\sigma}{\sqrt{n}}}\right ) \\ &= P\left (\left | Z \right | < \frac{\delta}{\frac{\sigma}{\sqrt{n}}}\right ) \\ &= P\left (-\frac{\delta}{\frac{\sigma}{\sqrt{n}}} < Z < \frac{\delta}{\frac{\sigma}{\sqrt{n}}}\right ) \\ \end{align}\]
Following previously defined notation we have,
\[ \frac{\delta}{\frac{\sigma}{\sqrt{n}}}=z_{\frac{\alpha}{2}} \]
Which implies,
\[ \delta=z_{\frac{\alpha}{2}} \cdot \frac{\sigma}{\sqrt{n}} \]
Therefore, in order to obtain an interval with confidence level \( 1 - \alpha \) we must have,
\[ - \delta < \bar X - \mu < \delta \] \[ - z_{\frac{\alpha}{2}} \cdot \frac{\sigma}{\sqrt{n}} < \bar X - \mu < z_{\frac{\alpha}{2}} \cdot \frac{\sigma}{\sqrt{n}} \] \[ \bar X - z_{\frac{\alpha}{2}} \cdot \frac{\sigma}{\sqrt{n}} < \mu < \bar X + z_{\frac{\alpha}{2}} \cdot \frac{\sigma}{\sqrt{n}} \]
The following table shows the most common confidence levels together with their corresponding z-values, used to construct confidence intervals.
Confidence Level \(100(1-\alpha)\%\) | \[\alpha\] | \[\frac{\alpha}{2}\] | \[z_{\frac{\alpha}{2}}\] |
\[90\%\] | \[0.10\] | \[0.050\] | \[1.645\] |
\[95\%\] | \[0.05\] | \[0.025\] | \[1.960\] |
\[99\%\] | \[0.01\] | \[0.005\] | \[2.575\] |
Tables for t-values with degrees of freedom, are more complicated.
A veterinarian is studying a particular side effect of a new dogs heartworm medication, the side effect consists in patches of hair loss in the subject. The manufacturer offered data for a sample of size 50 from a population of 1,000 subjects. The side effects became prevalent if the dosage was over 330 mcg of active ingredient, and the drug became ineffective if the dosage was under 260 mcg of active ingredient, as a 95% confidence interval from a sample of 50 subjects. What is the population variance?
Solution
According to the problem, the interval \((260,330)\) is a 95% confidence interval for the population mean of the drug effectiveness. Since confidence intervals are constructed with the sample mean at its center, then \(\bar X=\frac{260+330}{2}=295\). Also, we must have \(\bar X + z_{0.025} \cdot \frac{\sigma}{\sqrt{50}}=330 \), which implies \(\sigma^2=50\cdot\left ( \frac{330-\bar X}{z_{0.025}} \right )^2 \). Therefore the population variance is \(\sigma^2=50\cdot\left ( \frac{330-295}{1.96} \right )^2\approx 15943.88 \).
According to data from NOAA (National Oceanographic and Atmospheric Administration), monthly sea level fluctuation taken from the San Francisco Bay Area followed a \(95\%\) confidence interval \((1.75,2.13)\) mm/year. If the standard deviation of the fluctuation is \(3.6478\) mm/year, how many years did the data cover?