15th August 2023, 1 min read

How accurate is the birthday´s paradox formula?

Given a set of r random values from a large set (of size N), I have been using the formula 1-exp(-r**2/(2N)) to approximate the probability of a collision. It assumes that r is much smaller than N. The formula suggests that if you have hundreds of millions of random 64-bit numbers, you will start getting collisions with non-trivial probabilities, meaning that at least two values will be equal. At the one billion range, the probability of a collision is about 3% according to this formula.

It is somewhat unintuitive because if I give you two random 64-bit values, the probability of a collision is so low that it might as well be zero.

Though it is a textbook formula, we should still test it out to make sure that it is reasonable. Let us generate 32-bit random values for speed. I use a simple frequentist approximation: I generate many sets of 32-bit random values, I count the number of sets with a collision, and I divide this number by the total number of sets.

My results are as follows. The formula agrees with my results: I get a maximal error of 23%. The exact measured output depends on the random number generation and will vary depending on how you set it up. Nevertheless, it looks good! As you can see, if you even only 51,200 32-bit random values, the probability of a collision reaches 25%. My code is available.

number	theory	measured	relative error
100	0.000001	0.000001	error: 23%
200	0.000005	0.000005	error: 13%
400	0.000019	0.000014	error: 23%
800	0.000075	0.000073	error: 2%
1600	0.000298	0.000254	error: 15%
3200	0.001191	0.001079	error: 9%
6400	0.004757	0.004700	error: 1%
12800	0.018893	0.017570	error: 7%
25600	0.073456	0.071261	error: 3%
51200	0.263006	0.240400	error: 9%