The geometric distribution, intuitively speaking, is the probability distribution of the number of tails one must flip before the first head using a weighted coin. It is useful for modeling situations in which it is necessary to know how many attempts are likely necessary for success, and thus has applications to population modeling, econometrics, return on investment (ROI) of research, and so on.
There are, unfortunately, two widely used definitions of the geometric distribution, and the choice of which to use is a matter of context and convention. Fortunately, they are equivalent in spirit, as will be shown momentarily.
A Bernoulli trial, or Bernoulli experiment, is an experiment satisfying two key properties:
- There are exactly two complementary outcomes, success and failure.
- The probability of success is the same every time the experiment is repeated.
Unfortunately, there are two widely different definitions of the geometric distribution, with no clear consensus on which is to be used. Hence, the choice of definition is a matter of context and local convention. Fortunately, they are very similar. A series of Bernoulli trials is conducted until a success occurs, and a random variable is defined as either
- the number of trials in the series, or
- the number of failures in the series.
In either case, the geometric distribution is defined as the probability distribution of .
Fortunately, these definitions are essentially equivalent, as they are simply shifted versions of each other. For this reason, the former is sometimes referred to as the shifted geometric distribution. In accordance with this convention, this article will use the latter definition for the geometric distribution; in particular, represents the number of failures in the series of trials.
For example, consider rolling a fair die until a 1 is rolled. Rolling the die once is a Bernoulli trial, since there are exactly two possible outcomes (either a 1 is rolled or a 1 is not rolled), and their probabilities stay constant at and . The resulting number of times a 1 is not rolled is represented by the random variable , and the geometric distribution is the probability distribution of .
For a geometric distribution with probability of success, the probability that exactly failures occur before the first success is
This is written as , denoting the probability that the random variable is equal to , or as , denoting the geometric distribution with parameters and .
Note that the geometric distribution satisfies the important property of being memoryless, meaning that if a success has not yet occurred at some given point, the probability distribution of the number of additional failures does not depend on the number of failures already observed. For instance, suppose a die is being rolled until a 1 is observed. If the additional information were provided that the die had already been rolled three times without a 1 being observed, the probability distribution of the number of further rolls is the same as it would be without the additional information.
This fact can also be observed from the above formula, as starting from any particular value does not affect the relative probabilities of . This is due to the fact that the successive probabilities form a geometric series, which also lends its name to the distribution.
A die is rolled until a 1 occurs. What is the resulting geometric distribution?
The probability of success of a single trial is , so the above formula can be used directly:
This can also be represented pictorially, as in the following picture:
There are several important values that give information about a particular probability distribution. The most important are as follows:
- The mean or expected value of a distribution gives useful information about what average one would expect from a large number of repeated trials.
- The median of a distribution is another measure of central tendency, useful when the distribution contains outliers (i.e. particularly large/small values) that make the mean misleading.
- The mode of a distribution is the value that has the highest probability of occurring.
- The variance of a distribution measures how "spread out" the data is. Related is the standard deviation--the square root of the variance--useful due to being in the same units as the data.
Three of these values--the mean, mode, and variance--are generally calculable for a geometric distribution. The median, however, is not generally determined.
The easiest to calculate is the mode, as it is simply equal to 0 in all cases, except for the trivial case in which every value is a mode. This is due to the fact that when .
The mean is somewhat more difficult to calculate, but it is reasonably intuitive:
The mean of a geometric distribution with parameter is , or .
The simplest proof involves calculating the mean for the shifted geometric distribution, and applying it to the normal geometric distribution. In the shifted geometric distribution, suppose that the expected number of trials is . There is a probability that only one trial is necessary, and a probability of that an identical scenario is reached, in which case the expected number of trials is again (this is a consequence of the fact that the distribution is memoryless). As such, the equation
holds, so .
As a result, the expected value of the number of failures before reaching a success is one less than the total number of trials, meaning that the expected number of failures is .
Note that this makes intuitive sense: for example, if an event has a probability of occurring per day, it is natural that to expect the event would occur in 5 days.
A similar strategy can be used for the variance:
The variance of a geometric distribution with parameter is .
Note that the variance of the geometric distribution and the variance of the shifted geometric distribution are identical, as variance is a measure of dispersion, which is unaffected by shifting.
The geometric distribution has the interesting property of being memoryless. Let be a geometrically distributed random variable, and and two positive real numbers. Then by this property
The geometric distribution is useful for determining the likelihood of a success given a limited number of trials, which is highly applicable to the real world in which unlimited (and unrestricted) trials are rare. Therefore, it is unsurprising that a variety of scenarios are modeled well by geometric distributions:
- In sports, particularly in baseball, a geometric distribution is useful in analyzing the probability a batter earns a hit before he receives three strikes; here, the goal is to reach a success within 3 trials.
- In cost-benefit analyses, such as a company deciding whether to fund research trials that, if successful, will earn the company some estimated profit, the goal is to reach a success before the cost outweighs the potential gain.
- In time management, the goal is to complete a task before some set amount of time.
Other applications, similar to the above ones, are easily constructed as well; in fact, the geometric distribution is applied on an intuitive level in daily life on a regular basis.
A baseball player has a 30% chance of getting a hit on any given pitch. Ignoring balls, what is the probability that the player earns a hit before he strikes out (which requires three strikes)?
In this instance, a success is a hit and a failure is a strike. The player needs to have either 0, 1, or 2 failures in order to get a hit before striking out, so the probability of a hit is
Knowledge of this probability is useful, for instance, in deciding whether to intentionally walk the batter (in the hopes that the next batter, who has a lower batting percentage, will strike out).
A programmer has a 90% chance of finding a bug every time he compiles his code, and it takes him two hours to rewrite his code every time he discovers a bug. What is the probability that he will finish his program by the end of his workday?
Assume that a workday is 8 hours and that the programmer compiles his code immediately at the beginning of the day.
In this instance, a success is a bug-free compilation, and a failure is the discovery of a bug. The programmer needs to have 0, 1, 2, or 3 failures, so his probability of finishing his program is
This information is useful for determining whether the programmer should spend his day writing the program or performing some other tasks during that time.