Find me on Facebook,
LinkedIn,
Google+,
GitHub,
BitBucket,
Board Game Geek,
Amazon Wish List,
LastFM and
Twitter

Written by Sarah | With 9 comments

In today’s forensic science theory lectures we got taught that not only is DNA not unique, but there is an actual chance of two people having the same DNA profile. The lecturer first explained the birthday paradox, and then tried to explain it with DNA and got me terribly confused with what numbers go where in what equations. So I’ve read up on it and will now try and explain the birthday paradox and why there are potentially thousands of doppelgangers in the world.

If you ask someone their birthday, the chances that you will have the same one is 1 in 365. (I am ignoring leap years and assuming uniform distribution etc). However the birthday paradox is that if you ask a room full of 23 people, there is a 50% chance that two people will have the same birthday. Above 57 people and the chances tend towards 100%. I remember this at school in a class of 30 and we had two with the same birthday, and lo and behold we had two with the same in the forensics class today. It works because you’re not comparing your one birthday with everyone’s; you’re comparing everyone’s with everyone’s which improves the odds of finding a match dramatically. You can read more about it on Wikipedia.

The formula involves calculating 365!, which is just enormous, so approximating formulas are used such as the Taylor series, which I’ve implemented here in Python.

def computeBirthday(num, total): """ Computes a rough estimate of the probability to birthday problem based on Taylor series: p(n) =~ 1 - e^((-num^2)/(2*total)) """ denom = total * 2.0 nom = -(math.pow(num,2)) p = nom / denom e = math.pow(math.e, p) ans = (1 - e) * 100 print ('%s%%') % (ans) return ans

If you call *computeBirthday(23, 365)* you get the answer 51.55% (which is roughly the correct answer of 50.7%).
I’ve also made the reverse, which computes the approximate number of people needed to get a 50% change of a match. Calling *howManyFor50Percent(365)* gives the answer 22.9999, which is pretty much 23.

def howManyFor50Percent(n): """ Computes approx number of people needed to get a 50% change of matching N = 1/2 + squareroot(1/4 - (2*n) * ln(0.5)) """ sqrt = 0.25 - (2 * n) * math.log(0.5) ans = 0.5 + math.sqrt(sqrt) print ('%s') % (ans) return ans

I then applied this to DNA. This article explains that the birthday problem works in just the same way for DNA. I’ve been often told the odds of a random person having the same DNA as you is 1 in a billion. There is a national database in the UK of DNA which currently has 3.4million profiles according to the Home Office. So swapping ^{1}/_{365} for ^{1}/_{1000000000} and 23 people for 3,400,000 people – what will the result be? 100%! So there is 100% chance that at there is at least one matching DNA profile in the database. That itself is not so amazing. What is amazing is when you do *howManyFor50Percent(1000000000) *and get the answer 37,233. You only need 37,233 people before you get a 50% chance of a matching DNA profile!

I tried it with bigger numbers, and 274,000 seems to be the minimum number of people needed to have a 100% chance of finding a match. Assuming 6,796,000,000 people in the world, that means 24,803 people (6,796,000,000/274,000) in the world with dopplegangers! And 222 people in the UK (assuming population of 60,943,912). This of course is crude and approximate, and doesn’t consider twins, ethnicity, and family members. But still, I think it's awesome maths! If I have any calculations wrong, please comment :)

As you probably know, its actually impossible for two people to have identical DNA - unless their identical twins, triplets (etc). So is a DNA profile a set of (being lacking in proper biology/forensics terminology) "points of reference" in the DNA structure that is compared, and as such can have the same profile (but not truly the same structure, some genes (possibly redundant/dormant ones) will be different. Also, using a Taylor expansion, is it actually 100% guarantee of finding someone else with the same DNA profiling, or just 99.99999...% Because as far as I understand, calculating via these expansion algorithms always have an O(n^x) margin of error, where x is n + 1, where n is equal to the number of terms you calculated. Anyway, regardless, it's pretty scary to think that my DNA profile matches someone elses, and I'm pretty sure my DNA is now on a couple of different governmental databases - yay!

Mark Jones

Tue, 10 Nov 2009 03:36AM

I tend to think of DNA profiles as being analogous to md5 hashes. Make enough comparisons and you'll eventually get a hash collision. That said, I know a lot more about md5 than I do about DNA.

Peter Stewart

Tue, 10 Nov 2009 10:21AM

I see your point Mark - I think maybe DNA and DNA profiles are different (I don't know enough about the science TBH) so maybe you can have 222 people in the UK with the same DNA profile, but not *identical* DNA, which means they may look different. (I wonder how different? A perfect DNA profile match must mean a lot of common features?) Also, yes the Taylors series does give 99.99999...% but it's an approximation. I think the real answer tends to 100%, but it requires crazy large numbers!

Sarah

Tue, 10 Nov 2009 04:05PM

Firstly - feature request! I totally had no idea you responded to my comment, so email notifications? Secondly - keep up the forensics posts, I find them interesting :) Thirdly - Love the way the comment box is clearly separate from the post, I hate it when its indistinguishable from the post.

Mark Jones

Sun, 15 Nov 2009 04:49AM

I shall have a go at your feature request! :)

Sarah

Sun, 15 Nov 2009 09:29AM

Feature added! I hope it works :P

Sarah

Sun, 15 Nov 2009 04:21PM

Hey, I received that! But the link 404'ed on me, then Nybble disapproved ....

Mark Jones

Sun, 15 Nov 2009 11:11PM

Aha oops! Link now fixed?

Sarah

Mon, 16 Nov 2009 08:42AM

Yeps. That worked :)

Mark Jones

Mon, 16 Nov 2009 02:37PM

- Forensics & science (7)
- Homemade arts & crafts (11)
- Rabbits (8)
- Programming (6)
- Miscellaneous (25)
- Recipes (5)
- Reviews (14)
- Digital Forensics & Malware (42)
- Trips & Visits (16)
- Cyber Security & Threat Management (12)
- IT & Computers (7)
- HCI & Design (6)

encryption JavaScript DNA canvas walks nudge theory sausages data privacy foodies Amazon New Scientist Gullane RIPA Brigitte Reusch wifi art history Dean Village criminology England piggy bank open source steganography text guidelines forensics quantitative AES hacking insider fraud CV GPU Geocities snob fabrics ACPO laptop Firebug restaurant presents Lenzie Google Chrome chew Ian Kendall symposium proxy logs qualitative demographics qualifications stand-up abandoned buildings Sqlite asparagus crime python risotto naughty bunny Mweke ballistics handmade readability