Birthday Problem
The birthday problem (also called the birthday paradox) deals with the probability that in a set of \(n\) randomly selected people, at least two people share the same birthday.
Though it is not technically a paradox, it is often referred to as such because the probability is counter-intuitively high.
The birthday problem is an answer to the following question:
In a set of \(n\) randomly selected people, what is the probability that at least two people share the same birthday?
What is the smallest value of \(n\) where the probability is at least \( 50 \)% or \( 99 \)%?
Let \(p(n)\) be the probability that at least two of a group of \(n\) randomly selected people share the same birthday. By the pigeonhole principle, since there are 366 possibilities for birthdays (including February 29), it follows that when \( n \geq 367 \), \(p(n) = 100\)%. The counterintuitive part of the answer is that for smaller \(n,\) the relationship between \(n\) and \(p(n)\) is (very) non-linear.
In fact, the thresholds to surpass \(50\)% and \(99\)% are quite small: \(50\)% probability is reached with just \(23\) people and \(99\)% with just \(70\) people.
Calculating the probability
For simplicity, ignore leap years. Then the probability that two people will share the same birthday out of \(n\) is \[ 1-\frac{365 \times 364 \times \cdots \times (366-n)}{365^n} = 1-\frac{365!}{(365-n)! 365^n}. \]
As in the introduction, let \(p(n)\) be the probability that in a set of \(n\) randomly chosen people at least two people share the same birthday. Then \( 1-p(n)\) is the probability that every single one of them has distinct birthdays.
The number of ways to pick \(n\) distinct birthdays from a set of \(365\) days (when the order in which you pick the birthdays matters) is \( 365 \times 364 \times \cdots \times (366-n);\) this is because each successive birthday has one fewer choice of days left. This is the numerator of \( 1-p(n).\)
The number of possibilities for the birthdays for \(n\) people (again the order matters) is \(365^n\). This is the denominator of \( 1-p(n).\)
The formula follows immediately:
\[1-p(n)=\frac{365 \times 364 \times \cdots \times (366-n)}{365^n}=\frac{365!}{(365-n)! 365^n} \implies p(n)=1-\frac{365!}{(365-n)! 365^n}.\ _\square\]
One intuitive explanation of the phenomenon that \(p(n)\) is large for small \(n\) is that there are \(\binom{n}{2}\) pairs of people, all of whom can share a birthday. So the probability of at least one pair matching increases more rapidly than the number of people. (This is only a crude approximation of the true behavior of \( p(n),\) because the probability of multiple matches grows rapidly as well, and must be taken into account.)
The probability \(p(n)\) should be not confused with the probability that there is at least one person with a fixed birthday, which is much lower. That is, if I go to a party, the probability that two people at the party share a birthday is much higher than the probability that someone at the party shares a birthday with me. The latter probability is computed in the following example.
Suppose that (ignoring leap years) the probability that a person's birthday is any given day is \( \frac1{365}.\) (This does not match the real world, as certain birthdays are more common than others, but it is a useful simplification.) What is the probability that at least one of \(n\) randomly chosen people was born on April 23? How large does \(n\) have to be before this probability exceeds \(50\)%?
The probability that no one has that birthday is \( \left(\frac{364}{365}\right)^n,\) so the answer to the first question is \[ p(n) = 1-\left(1-\frac1{365}\right)^n. \] The smallest \(n\) for which \(p(n)>0.5\) is \(n=253.\) (Intuitively, this is larger than \(365/2\) because people in the room may share the same birthday; if everyone was guaranteed to have a different birthday, the answer would be \( \lceil 365/2 \rceil = 183.)\)
Note that this is not hard to obtain approximately, using the Taylor series approximation \( \ln(1-x) \approx -x \) for small \( x\): if \(p(n) \approx 0.5,\) then \( \left(1-\frac1{365}\right)^n \approx 0.5,\) so \[ \begin{align} n \ln\left( 1-\frac1{365}\right) &\approx \ln(0.5) \\ -n/365 &\approx \ln(0.5)\\ n &\approx -365\ln(0.5) = 365\ln(2) \approx 253. \end{align} \]
Application to cryptography
Most cryptosystems use some kind of hash function to process messages. A hash function \(f\) is not injective, but is created so that collisions, or instances where \( x \ne y\) but \( f(x)=f(y), \) are hard to discover. An attacker who can find collisions can access information or messages that are not meant to be public. The birthday attack is a restatement of the birthday paradox that measures how collision-resistant a well-chosen hash function is. For instance, suppose that a hash function is chosen with a 64-bit range; that is, its image is a nonnegative integer less than \( 2^{64}.\) How many values does an attacker need to compute before the probability of a collision is greater than \(50\)%?
Just as with the birthday paradox, the probability of a collision from \(n\) random samples is \[ p_M(n) = 1-\frac{M!}{(M-n)!M^n} \] where \( M = 2^{64}.\)
It turns out that \( p_M(n) =0.5 \) when \( n \) is on the order of \( \sqrt{M}.\) This represents a significant improvement on the sizes involved: for instance, \( 2^{32} \) is roughly \( 4 \times 10^9,\) while \( 2^{64} \) is greater than \( 10^{19}.\) This vulnerability necessitates the use of a large hash range in practical applications.
We can derive this result through the use of a Taylor series approximation. The probability of no shared birthday within a group of \(n\) people can be expressed:
\[\hat{p}(n) = 1\times \left(1-\frac{1}{365} \right)\times \left(1-\frac{2}{365} \right)\cdots \times \left(1-\frac{n-1}{365} \right)\]
And more generally, the probability of no two objects in a group of \(n\) colliding when there are \(M\) possibilities can be expressed:
\[\hat{p}_M(n) = 1\times \left(1-\frac{1}{M} \right)\times \left(1-\frac{2}{M} \right)\cdots \times \left(1-\frac{n-1}{M} \right)\]
For \(x<<1\), the Taylor series approximation of \(e^x = 1+x\), which means that for \(n<<M\), we can write the terms in the equation above like: \(\left(1-\frac{1}{M}\right) = e^{-1/M}\). This gives us:
\[\begin{align} \hat{p}_M(n) &\approx 1\times e^{1/M}\times e^{2/M} \cdots \times e^{-(n-1)/M} \\ &\approx 1 \times e^{-(1+2+\cdots+(n-1))/M} \\ &\approx e^{-(n\times (n-1)/2)/M}\end{align}\]
For \(n^2<<M\), we can use the same Taylor series approximation as before, this time to convert from exponential back to \(1-x\):
\[\hat{p}_M(n) \approx 1-\frac{n\times(n-1)}{2M}\approx 1-\frac{n^2}{2M}\]
The probability of there being a collision is just the complement of this:
\[p_M(n) \approx \frac{n^2}{2M}\]
Plugging in \(p_M(n) = 0.5\) to this equation, we can see why \(n = \sqrt{M}\) corresponds roughly to a \(50\%\) chance of a collision occurring:
\[0.5 \approx \frac{n^2}{2M} \\ M \approx n^2 \\ n \approx \sqrt{M}\]