Birthday Problem

The birthday problem (also called the birthday paradox) deals with the probability that in a set of $n$ randomly selected people, at least two people share the same birthday.

Though it is not technically a paradox, it is often referred to as such because the probability is counter-intuitively high.

The birthday problem is an answer to the following question:

In a set of $n$ randomly selected people, what is the probability that at least two people share the same birthday?
What is the smallest value of $n$ where the probability is at least $50$ % or $99$ %?

Let $p(n)$ be the probability that at least two of a group of $n$ randomly selected people share the same birthday. By the pigeonhole principle, since there are 366 possibilities for birthdays (including February 29), it follows that when $n \geq 367$ , $p(n) = 100$ %. The counterintuitive part of the answer is that for smaller $n,$ the relationship between $n$ and $p(n)$ is (very) non-linear.

In fact, the thresholds to surpass $50$ % and $99$ % are quite small: $50$ % probability is reached with just $23$ people and $99$ % with just $70$ people.

Calculating the probability

For simplicity, ignore leap years. Then the probability that two people will share the same birthday out of $n$ is $1-\frac{365 \times 364 \times \cdots \times (366-n)}{365^n} = 1-\frac{365!}{(365-n)! 365^n}.$

As in the introduction, let $p(n)$ be the probability that in a set of $n$ randomly chosen people at least two people share the same birthday. Then $1-p(n)$ is the probability that every single one of them has distinct birthdays.

The number of ways to pick $n$ distinct birthdays from a set of $365$ days (when the order in which you pick the birthdays matters) is $365 \times 364 \times \cdots \times (366-n);$ this is because each successive birthday has one fewer choice of days left. This is the numerator of $1-p(n).$

The number of possibilities for the birthdays for $n$ people (again the order matters) is $365^n$ . This is the denominator of $1-p(n).$

The formula follows immediately:

$1-p(n)=\frac{365 \times 364 \times \cdots \times (366-n)}{365^n}=\frac{365!}{(365-n)! 365^n} \implies p(n)=1-\frac{365!}{(365-n)! 365^n}.\ _\square$

One intuitive explanation of the phenomenon that $p(n)$ is large for small $n$ is that there are $\binom{n}{2}$ pairs of people, all of whom can share a birthday. So the probability of at least one pair matching increases more rapidly than the number of people. (This is only a crude approximation of the true behavior of $p(n),$ because the probability of multiple matches grows rapidly as well, and must be taken into account.)

The probability $p(n)$ should be not confused with the probability that there is at least one person with a fixed birthday, which is much lower. That is, if I go to a party, the probability that two people at the party share a birthday is much higher than the probability that someone at the party shares a birthday with me. The latter probability is computed in the following example.

Suppose that (ignoring leap years) the probability that a person's birthday is any given day is $\frac1{365}.$ (This does not match the real world, as certain birthdays are more common than others, but it is a useful simplification.) What is the probability that at least one of $n$ randomly chosen people was born on April 23? How large does $n$ have to be before this probability exceeds $50$ %?

The probability that no one has that birthday is $\left(\frac{364}{365}\right)^n,$ so the answer to the first question is $p(n) = 1-\left(1-\frac1{365}\right)^n.$ The smallest $n$ for which $p(n)>0.5$ is $n=253.$ (Intuitively, this is larger than $365/2$ because people in the room may share the same birthday; if everyone was guaranteed to have a different birthday, the answer would be $\lceil 365/2 \rceil = 183.)$

Note that this is not hard to obtain approximately, using the Taylor series approximation $\ln(1-x) \approx -x$ for small $x$ : if $p(n) \approx 0.5,$ then $\left(1-\frac1{365}\right)^n \approx 0.5,$ so $\begin{aligned} n \ln\left( 1-\frac1{365}\right) &\approx \ln(0.5) \\ -n/365 &\approx \ln(0.5)\\ n &\approx -365\ln(0.5) = 365\ln(2) \approx 253. \end{aligned}$

Application to cryptography

Most cryptosystems use some kind of hash function to process messages. A hash function $f$ is not injective, but is created so that collisions, or instances where $x \ne y$ but $f(x)=f(y),$ are hard to discover. An attacker who can find collisions can access information or messages that are not meant to be public. The birthday attack is a restatement of the birthday paradox that measures how collision-resistant a well-chosen hash function is. For instance, suppose that a hash function is chosen with a 64-bit range; that is, its image is a nonnegative integer less than $2^{64}.$ How many values does an attacker need to compute before the probability of a collision is greater than $50$ %?

Just as with the birthday paradox, the probability of a collision from $n$ random samples is $p_M(n) = 1-\frac{M!}{(M-n)!M^n}$ where $M = 2^{64}.$

It turns out that $p_M(n) =0.5$ when $n$ is on the order of $\sqrt{M}.$ This represents a significant improvement on the sizes involved: for instance, $2^{32}$ is roughly $4 \times 10^9,$ while $2^{64}$ is greater than $10^{19}.$ This vulnerability necessitates the use of a large hash range in practical applications.

We can derive this result through the use of a Taylor series approximation. The probability of no shared birthday within a group of $n$ people can be expressed:

$\hat{p}(n) = 1\times \left(1-\frac{1}{365} \right)\times \left(1-\frac{2}{365} \right)\cdots \times \left(1-\frac{n-1}{365} \right)$

And more generally, the probability of no two objects in a group of $n$ colliding when there are $M$ possibilities can be expressed:

$\hat{p}_M(n) = 1\times \left(1-\frac{1}{M} \right)\times \left(1-\frac{2}{M} \right)\cdots \times \left(1-\frac{n-1}{M} \right)$

For $x<<1$ , the Taylor series approximation of $e^x = 1+x$ , which means that for $n<<M$ , we can write the terms in the equation above like: $\left(1-\frac{1}{M}\right) = e^{-1/M}$ . This gives us:

$\begin{aligned} \hat{p}_M(n) &\approx 1\times e^{1/M}\times e^{2/M} \cdots \times e^{-(n-1)/M} \\ &\approx 1 \times e^{-(1+2+\cdots+(n-1))/M} \\ &\approx e^{-(n\times (n-1)/2)/M}\end{aligned}$

For $n^2<<M$ , we can use the same Taylor series approximation as before, this time to convert from exponential back to $1-x$ :

$\hat{p}_M(n) \approx 1-\frac{n\times(n-1)}{2M}\approx 1-\frac{n^2}{2M}$

The probability of there being a collision is just the complement of this:

$p_M(n) \approx \frac{n^2}{2M}$

Plugging in $p_M(n) = 0.5$ to this equation, we can see why $n = \sqrt{M}$ corresponds roughly to a $50\%$ chance of a collision occurring:

$0.5 \approx \frac{n^2}{2M} \\ M \approx n^2 \\ n \approx \sqrt{M}$

Contents

Calculating the probability

Application to cryptography

See also