Hypergeometric Distribution
The hypergeometric distribution, intuitively, is the probability distribution of the number of red marbles drawn from a set of red and blue marbles, without replacement of the marbles. In contrast, the binomial distribution measures the probability distribution of the number of red marbles drawn with replacement of the marbles. It is useful for situations in which observed information cannot re-occur, such as poker (and other card games) in which the observance of a card implies it will not be drawn again in the hand. It is also applicable to many of the same situations that the binomial distribution is useful for, including risk management and statistical significance.
Contents
Formal Definition
Consider a population and an attribute, where the attribute takes one of two mutually exclusive states and every member of the population is in one of those two states. For example, the attribute might be "over/under 30 years old," "is/isn't a lawyer," "passed/failed a test," and so on. Furthermore, the population will be sampled without replacement, meaning that the draws are not independent: each draw affects the next since each draw reduces the size of the population.
Given the size of the population \(N\) and the number of people \(K\) that have a desired attribute, the hypergeometric distribution measures the probability of drawing exactly \(k\) people with the desired attribute over \(n\) trials.
For example, if a bag of marbles is known to contain 10 red and 6 blue marbles, the hypergeometric distribution can be used to find the probability that exactly 2 of 3 drawn marbles are red.
Finding the Hypergeometric Distribution
If the population size is \(N\), the number of people with the desired attribute is \(K\), and there are \(n\) draws, the probability of drawing exactly \(k\) people with the desired attribute is
\[\text{Pr}(X = k) = f(k; N, K, n) = \frac{\binom{K}{k} \binom{N-K}{n-k}}{\binom{N}{n}}.\]
This formula can be derived by selecting \(k\) of the \(K\) possible successes in \(\binom{K}{k}\) ways, then selecting \((n-k)\) of the \((N-K)\) possible failures in \(\binom{N-K}{n-k}\), and finally accounting for the total \(\binom{N}{n}\) possible \(n\)-person draws.
A bag of marbles contains 13 red marbles and 8 blue marbles. If five marbles are drawn from the bag, what is the resulting hypergeometric distribution?
Here, the population size is \(13+8=21\), there are \(13\) objects with the desired attribute (redness), and there are 5 draws. The above formula then applies directly:
\[\begin{align} \text{Pr}(X = 0) = f(0; 21, 13, 5) = \frac{\binom{13}{0} \binom{8}{5}}{\binom{21}{5}} &\approx .003\\ \text{Pr}(X = 1) = f(1; 21, 13, 5) = \frac{\binom{13}{1} \binom{8}{4}}{\binom{21}{5}} &\approx .045\\ \text{Pr}(X = 2) = f(2; 21, 13, 5) = \frac{\binom{13}{2} \binom{8}{3}}{\binom{21}{5}} &\approx .215\\ \text{Pr}(X = 3) = f(3; 21, 13, 5) = \frac{\binom{13}{3} \binom{8}{2}}{\binom{21}{5}} &\approx .394\\ \text{Pr}(X = 4) = f(4; 21, 13, 5) = \frac{\binom{13}{4} \binom{8}{1}}{\binom{21}{5}} &\approx .281\\ \text{Pr}(X = 5) = f(5; 21, 13, 5) = \frac{\binom{13}{5} \binom{8}{0}}{\binom{21}{5}} &\approx .063.\ _\square \end{align}\]
This can be represented pictorially:
A gambler shows you a box with 5 white and 2 black marbles in it. All the marbles are identical except for their color. He invites you to draw without replacement 3 marbles from the box while you are blindfolded, and you lose if you draw a black marble.
If you lose $10 for losing the game, how much should you get paid for winning it for your mathematical expectation to be zero (i.e. to make it a fair game)?
Properties of the Hypergeometric Distribution
There are several important values that give information about a particular probability distribution. The most important are these:
- The mean, or expected value, of a distribution gives useful information about what average one would expect from a large number of repeated trials.
- The median of a distribution is another measure of central tendency, useful when the distribution contains outliers (i.e. particularly large/small values) that make the mean misleading.
- The mode of a distribution is the value that has the highest probability of occurring.
- The variance of a distribution measures how "spread out" the data is. Related is the standard deviation, the square root of the variance, useful due to being in the same units as the data.
Three of these values—the mean, mode, and variance—are generally calculable for a hypergeometric distribution. The median, however, is not generally determined.
The mean is intuitive, in the same sense that it is for a binomial distribution:
The mean of \(f(k; N, K, n)\) is \(\frac{nK}{N}.\)
The mode is significantly more complex:
The mode of \(f(k; N, K, n)\) is \[\left\lfloor\frac{(n+1)(K+1)}{N+2}\right\rfloor.\]
The variance is even more involved:
The variance of \(f(k; N, K, n)\) is \[n\frac{K}{N}\frac{N-K}{N}\frac{N-n}{N-1}.\]
It is also worth noting that, as expected, the probabilities of each \(k\) sum up to 1:
\[\sum_{k=0}^{n}f(k; N, K, n) = \sum_{k=0}^{n}\frac{\binom{K}{k} \binom{N-K}{n-k}}{\binom{N}{n}}=1,\]
which is a consequence of Vandermonde's identity.
Additionally, the symmetry of the problem gives the following identity:
\[\frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}}=\frac{\binom{n}{k}\binom{N-n}{K-k}}{\binom{N}{K}}.\]
Practical Applications
As mentioned in the introduction, card games are excellent illustrations of the hypergeometric distribution's use. Here is an example:
In the game of Texas Hold'em, players are each dealt two private cards, and five community cards are dealt face-up on the table. Each player makes the best 5-card hand they can with their two private cards and the five community cards. What is the probability that a particular player can make a flush of spades (i.e. 5 spades)?
This situation can be modeled by a hypergeometric distribution where the population size is 52 (the number of cards), the number of objects with the desired attribute (spades) is 13, and there are 7 draws. The player needs at least 5 successes, so the probability is
\[\begin{align} f(5; 52, 13, 7)+f(6; 52, 13, 7)+f(7; 52, 13, 7) &=\frac{\binom{13}{5} \binom{39}{2}}{\binom{52}{7}}+\frac{\binom{13}{6} \binom{39}{1}}{\binom{52}{7}}+\frac{\binom{13}{7} \binom{39}{0}}{\binom{52}{7}} \\\\ &\approx 0.0076.\ _\square \end{align}\]
It can also be used once some information is already observed. Here is another example:
Bob is playing Texas Hold'em, and his two private cards are both spades. What is the probability he finishes with a flush of spades?
This situation can be modeled by a hypergeometric distribution where the population size is 50 (the number of remaining cards), the number of remaining objects with the desired attribute (spades) is 11, and there are 5 draws. The player needs at least 3 successes, so the probability is
\[\begin{align} f(3; 50, 11, 5)+f(4; 50, 11, 5)+f(5; 50, 11, 5) &=\frac{\binom{11}{3} \binom{39}{2}}{\binom{50}{5}}+\frac{\binom{11}{4} \binom{39}{1}}{\binom{50}{5}}+\frac{\binom{11}{5} \binom{39}{0}}{\binom{50}{5}} \\\\ &\approx 0.064.\ _\square \end{align}\]
Hypergeometric Test
The hypergeometric test is used to determine the statistical significance of having drawn \(k\) objects with a desired property from a population of size \(N\) with \(K\) total objects that have the desired property. In other words, it tests to see whether a sample is truly random or whether it over-represents (or under-represents) a particular demographic.