Last year, I wrote a blog, Noise threshold on atDNA Matches, where I mathematically calculated that IBS noise cannot occur greater than 150 consecutive SNPs. However, this assumes that the population of the world has all possible genotypes. But in reality, the population does not have such diversity. One of the reasons I want to investigate this is because, I am eager to find IBS compound segments between myself and ancient DNA but unsure of thresholds to be used to eliminate noise.
For example, I said:
But in reality, what if there are only 2 genotypes say, AA and AG are found in populations and always universally match? This means, the probability is not 0.7 but always 1. So, the probability for each matching segment with consecutive SNPs drastically varies and purely depends on genotypes found in populations.
I started off with 150 SNPs / 1 Mb threshold and I didn't get any matches. So, I reduced SNPs to 100 SNPs and below are the matching segments with a random file.
Autosomal match with random file #1
Autosomal match with random file #2
Autosomal match with random file #3
The source code used to generate random autosomal files using genotypes found among populations and OpenSNP genotype frequencies can be downloaded from GitHub.
For example, I said:
Every genotype say, AG will match (A will match AA,AG,AT,AC -or- G will match AG,GG,GC,GT) - taking intersection, AG will match AA,AG,AT,AC,GG,GC,GT (7 genotypes out of possible 10).
But in reality, what if there are only 2 genotypes say, AA and AG are found in populations and always universally match? This means, the probability is not 0.7 but always 1. So, the probability for each matching segment with consecutive SNPs drastically varies and purely depends on genotypes found in populations.
Solution
To solve this problem, I took the genotype frequencies from OpenSNP and created a random file which has exactly the same SNPs as my autosomal file, expect the genotype is randomized based on "what is found in populations". I will create multiple random files and compare with my autosomal file to see how it matches. This will help us to figure out the actual noise threshold.I started off with 150 SNPs / 1 Mb threshold and I didn't get any matches. So, I reduced SNPs to 100 SNPs and below are the matching segments with a random file.
Autosomal match with random file #1
Chr Start Position End Position Len(Mb) SNPs
4 92845437 93847904 1.00247 108
8 78394149 79510525 1.11638 145
17 22034501 25874456 3.83995 103
Largest Segment: 3.83995 Mb
Total Shared: 5.9588 Mb
Autosomal match with random file #2
Chr Start Position End Position Len(Mb) SNPs
4 48960863 53399616 4.43875 106
18 14550375 19351344 4.80097 132
Largest Segment: 4.80097 Mb
Total Shared: 9.23972 Mb
Autosomal match with random file #3
Chr Start Position End Position Len(Mb) SNPs
3 162511139 163554331 1.04319 108
6 58047654 62510310 4.46266 105
Largest Segment: 4.46266 Mb
Total Shared: 5.50585 Mb
The source code used to generate random autosomal files using genotypes found among populations and OpenSNP genotype frequencies can be downloaded from GitHub.