23andMe V4 not compatible with some Autosomal Genetic Genealogy Tools!

June 9, 2014, 4:13 am

I got several complaints that 23andMe doesn't work with some of genetic genealogy tools. So, I went ahead and investigated why. I then learnt about the new V4 chip from 23andMe. After reading several forum posts and my own personal investigation, I found the following: Even though 23andMe V4 has around 596869 SNPs, out of 714533 SNPs in Family Tree DNA, only 310690 SNPs matches (for Chr 1-22 and X). So, to compare a V4 with FTDNA one must assume 403843 SNPs as matching which will give very inaccurate results and thus making it incompatible for doing any reasonable autosomal comparison. This may be the reason why FTDNA does not allow V4 transfers into their database. Hence, I regret to say that 23andMe V4 will not be compatible with the below Genetic Genealogy tools for now.

Genetic Genealogy Tools affected:

↧

$100 Off Coupon for Big Y Tests

June 9, 2014, 10:00 pm

≫ Next: Minor Thesis - An evidence-based Android cache forensics model

≪ Previous: 23andMe V4 not compatible with some Autosomal Genetic Genealogy Tools!

I received a coupon from Family Tree DNA which allows a $100 off for Big-Y test. Unfortunately, I don't intend to do any more Big-Y tests this year. Hence, I am posting my coupon code in my blog. Please note that this coupon code can only be used once - which makes it first come first serve.

The Coupon Code is FDS140876. You can order Big-Y from FamilyTreeDNA website and follow the pink banner.

↧

Minor Thesis - An evidence-based Android cache forensics model

August 26, 2014, 10:24 pm

≫ Next: Clovis-Anzick-1 ancient DNA have matches with living people!

≪ Previous: $100 Off Coupon for Big Y Tests

Please have a look at minor thesis submitted for my masters degree.

Thesis Abstract

Android is the most popular and widely used mobile operating systems. Although Android is one of the most actively researched area in the field of mobile forensics, analysis of Android caches is an understudied research topic – the focus of this thesis. Due to the diversity of caches and the developer’s heavy reliance on third-party libraries, this thesis proposes a cache taxonomy based on its usage, as the key to investigating Android caches is to first classify and identify them. This helps to ensure the choice of appropriate tool(s) to extract potential evidential data. A systematic process to forensically extract, analyse and investigate Android caches is proposed, which is based on the widely accepted McKemmish (1995) forensic model. The proposed Android Cache Forensic Process, the primary contribution of this thesis, is validated using nearly 100 popular apps. Previously unknown cache formats are decoded and several undocumented cache formats used commonly by Android apps are documented. Based on the findings, an Android Cache Viewer prototype is developed which is the secondary contribution of this thesis. This working prototype, as demonstrated in this thesis, is able to successfully decode Android caches and display the contents in a user friendly manner.

Source Code at GitHub.
License: MIT
Link: https://wiki.cis.unisa.edu.au/wiki/2014FelixChandrakumar-minorthesis

↧

Clovis-Anzick-1 ancient DNA have matches with living people!

September 21, 2014, 6:36 am

≫ Next: Ancient Amerindian DNA: How valid are the matches?

≪ Previous: Minor Thesis - An evidence-based Android cache forensics model

The genome sequence of a male infant (Anzick-1) recovered from the Anzick burial site in western Montana is converted into familiar formats to genetic genealogists which I made available here.

I also uploaded it to GEDMatch# F999912 (with FTDNA SNPs) and you might be surprised to hear this 12,500 year old infant has some 3rd cousins living today. Interestingly, most of the Y-haplogroup for the first 15 matches to kit F999912 is Q1a3a* and the haplogroup of the infant child is also Q-L54 (which is Q1a3a).

GEDMatch #F999912: One-to-Many matches.

Please leave your comments and possible questions for this mystery of 12,500 year old ancient DNA matching living people today ...

Update: 23-Sep-2014
Follow-up blog: Ancient Amerindian DNA: How valid are the matches?

Update: 28-Sep-2014
I uploaded a new kit filtered with SNPs used by most DNA companies: Uploaded New GEDMatch Kit for Clovis-Anzick-1. Please use GEDMatch kit# F999913 as I had deleted F999912 to avoid redundant kits. Inspite of this change, there is not much difference in matches. After some conversation with GEDMatch, there seems to be no-calls and SNPs found in kit not tested by DNA companies and vice-versa. Hence, I am trying to process from BAM files provided by the authors to see if I can get more SNPs to get to the bottom of this mystery.

Update: 8-Oct-2014
Uploaded a new kit F999919 processed from Clovis Anzick-1 BAM file. Refer: New Clovis Anzick-1 kit in GEDMatch: F999919

↧

Ancient Amerindian DNA: How valid are the matches?

September 23, 2014, 7:25 am

≫ Next: Mal’ta MA-1 Ancient DNA Analysis

≪ Previous: Clovis-Anzick-1 ancient DNA have matches with living people!

Update: 8-Oct-2014
Uploaded a new kit F999919 processed from Clovis Anzick-1 BAM file. Refer: New Clovis Anzick-1 kit in GEDMatch: F999919

This blog is a follow-up of my previous post, Clovis-Anzick-1 ancient DNA have matches with living people! If you haven't read my earlier post, I suggest reading that first.

Quick Recap

Just a quick recap, I processed the raw data for Clovis-Anzick-1 and uploaded into GEDMatch and to my surprise, there are matches as near as 3rd to 4th cousins. Now, that's a real problem because, the matches are to a DNA sample older than 12500 years. This is practically impossible and very mysterious. I will investigate step-by-step and see what are all the possibilities and failure points, which could solve the problem. But before that, we need to be absolutely sure that these matches are indeed valid. From the matches, I requested for phased kit and I indeed got one - Thanks to Mario Diaz and Veronica.

Phased Matches

Both Mario Diaz (F349738) and his daughter (F338998) match Clovis-Anzick-1. But for his daughter, the threshold had to be little lowered 5 cM/500 SNPs. This is acceptable because his daughter is the next generation. The phased kits are PF338998M1, PF338998P1 - since phasing is done on the daughter's kit.

Phased segment matching Clovis-Anzick-1

In the above diagram, 2 segments from the daughter which matches with Clovis-Anzick-1 sample also matches the phased paternal kit of the daugther (PF338998P1). This confirms that 2 matching segments from chromosome 1 and 6 from the daughter are indeed IBD (identity by descent). This does not mean the rest are IBS (Identity By State) or random matches. Instead the rest are compound segments caused by pedigree collapse through endogamous marriages.

One might ask the IBD is just for 1 generation and the segment for the father could very well be compound segment. Yes, it may be because the father's kit is not phased. However, we can still compare the matches with each other. If we take the matching segments from Clovis Anzick-1 and compare it with how those segments match each other, it is much evident that these segments are passed down to those matching people from a recent common ancestor.

This confirms that the matches are indeed valid and IBD.

Veronica (FN111284) her phased kit (PFN111284P1) matches perfectly at 5 cM/500 SNPs threshold with largest segment 8 cM.

This proves 3 segments matching the Clovis-Anzick-1 sample are IBD (Identity by Descent) through her father (paternal kit).

Triangulation

Some matches with Clovis-Anzick-1 can also be triangulated.

Triangulation with 3 people with each other and Clovis Anzick-1 at 7 cM/700+ threshold.

Triangulation between Clovis Anzick-1, M193252 and M174237

Triangulation between Clovis Anzick-1, FN111284 and A778817

If your kit is phased, and you and the phased kit have a matching segment with Clovis-Anzick-1, please let me know and I will be happy to post it here which helps to confirm IBD segments.

As mentioned earlier, the fact that most of the Y-haplogroup for the first 15 matches to kit ~~F999912~~ F999913 is Q1a3a* and the haplogroup of the Clovis-Anzick-1 is also Q-L54 (which is Q1a3a), confirms the paternal lineage.

If the matches are valid with IBD segments, how valid are the generations? Are they truly recent cousins? If we take only the IBD segments and based on IBD data of known relationships, the relation can only be very recent cousins.

If we know the matching segments are IBD and the relation is at recent cousin level, which could just be last century, how could it be 12500 years old and considered ancient? Well, that's an interesting question the authors of the paper needs to answer. Clearly something is wrong here. Irrespective of what went wrong, the conclusion is much evident.

Conclusion

The unexpected result causing this mystery matches could very well be due to some mistake or contamination of Clovis-Anzick-1 DNA sample itself. Roberta had blogged Analyzing the Native American Clovis Anzick Ancient Results where she shares her views on it. While I mostly agree regarding the matches having more Amerindian in admixture, the validity of the age as 12500 years old is never questioned. Even if we assume all the matches are due to endogamous marriages, 12500 years is impossible.

Clearly, an IBD segment of 5 cM above 500 SNPs with total IBD segments around 10+ cM cannot be 12500 years old. This is a fact and can be verified using known relationships in families and DNA companies are using these benchmarks all along for showing genetic matches. This fact is more than enough to conclude that the Clovis-Anzick-1 sample is not actually ancient. My best guess is, the infant boy's sample is just from the last century and it was wrongly labeled as 12500 years old or the sample got contaminated.

This is purely my opinion on what I can see but I could very well be wrong.

↧

Mal’ta MA-1 Ancient DNA Analysis

September 25, 2014, 9:13 pm

≫ Next: Uploaded New GEDMatch Kit for Clovis-Anzick-1

≪ Previous: Ancient Amerindian DNA: How valid are the matches?

After processing the Mal’ta MA-1 Ancient DNA sample, I proceeded with the output data to see what I can find.

Haplogroups

The Y-Haplogroup is R, but also positive for R1b1a2a1a2c1b4 (or R-CTS3087). The mtDNA is U, but also has defining mutation for K2a5, U5a1, U1a'c, K1b1 and U6a3a1. Hence, the Y-Haplogroup is R and Mt-Haplogroup is U.

Y-STR

LobSTR reports the following Y-STR values.

DYS458 = 16
DYS425 = 10
DYS462 = 12

Based on my experience, the actual Y-STR may be off by 1.

Telomere

The kit also has a telomere length of 4.7174 which may suggest the boy's age was roughly around 10-15 at the time of his death.

Plotting Telomere of 4.7171 kb
(Image adapted from http://learn.genetics.utah.edu/content/chromosomes/telomeres/)

Autosomal

I uploaded to GEDMatch as #F999914.

Based on MLDP K23b calculator,

MLDP K23b admixture for Mal'ta Boy

I am surprised at 4.5% of South Indian admixture which interests me. So, I proceeded with HarappaWorld.

HarappaWorld admixure for Mal'ta Boy

HarappaWorld places 10% of South Indian admixture.

I am waiting for batch processing to complete, but you can proceed with one-to-one comparison. I did with mine and nothing significant. I don't expect anyone to match because it is an ancient sample but there may also be surprises. Let me know what you find.

↧

Uploaded New GEDMatch Kit for Clovis-Anzick-1

September 26, 2014, 6:04 pm

≫ Next: Analyzing La Braña-Arintero Ancient DNA

≪ Previous: Mal’ta MA-1 Ancient DNA Analysis

Update: 8-Oct-2014
Uploaded a new kit F999919 processed from Clovis Anzick-1 BAM file. Refer: New Clovis Anzick-1 kit in GEDMatch: F999919

Clovis-Anzick DNA was uploaded to GEDMatch with FTDNA SNPs alone as kit# F999912, since I was unable to upload the complete file which is larger size and I didn't expect anyone to match because it is an ancient DNA. Interestingly, it had significant matches with living people which was unexpected. I did my own analysis of the matches which has triangulated and phased matches and I decided to investigate more. So, I decided to extract SNPs used by all DNA Companies (FTDNA, 23andMe v3 and v4, Ancestry). I created a new kit F999913 which has significantly more SNPs which replaces F999912.

Kits F999912 vs F999913

Looking at the GEDmatch DNA file diagnostic utility for F999912, Chromosome 21 and X are still below the GEDMatch threshold.

GEDMatch DNA Diagnostic Utility for F999912

However, the new Kit F999913, all SNPs are above the GEDMatch threshold.

GEDMatch DNA Diagnostic Utility for F999913

~~Once the batch processing for F999913 is complete, I will remove the kit F999912.~~ I had removed the kit F999912 and one-to-many comparison for F999913 is available in GEDMatch.

↧

Analyzing La Braña-Arintero Ancient DNA

September 27, 2014, 4:12 pm

≫ Next: Mal'ta MA-1 ancient DNA have matches with a few living people on X-Chromosome!

≪ Previous: Uploaded New GEDMatch Kit for Clovis-Anzick-1

Approximately 7,000-year-old Mesolithic skeleton discovered at the La Braña-Arintero site in León, Spain, had been sequenced to retrieve a complete pre-agricultural European human genome. I converted the raw sequence reads supplied in the scientific paper to formats familiar with genetic genealogists and uploaded here and also to GEDMatch as kit# F999915

Y-DNA

Y-Haplogroup: C-V183

Y-STRs:

DYS638 = 11
DYS461 = 12

Based on my experience, the values could be offset by 1.

mt-DNA

The mt-DNA Haplogroup for the kit is U5b2c1

Telomere

The kit has a telomere length of 5.83782 which may suggest the boy's age was roughly around 10 to 12 years at the time of his death.

Plotting telomere length of 5.878 kb
(Image adapted from http://learn.genetics.utah.edu/content/chromosomes/telomeres/)

Autosomal

Based on MLDP-K23b calculator, below is the admixture.

MLDP-K23b admixture for La Braña-Arintero Ancient DNA

Runs of Homozygosity

RoH reveals the parents of La Braña-Arintero are not related to each other in his genealogical time frame.

I did one-to-one with my DNA and nothing in common. May be you have something in common. Let me know what you think.

↧

Mal'ta MA-1 ancient DNA have matches with a few living people on X-Chromosome!

September 27, 2014, 8:09 pm

≫ Next: Mezmaiskaya neanderthal kit from GEDMatch removed

≪ Previous: Analyzing La Braña-Arintero Ancient DNA

I posted the processed sequence of an ancient genome of individual (MA-1), from Mal’ta in south-central Siberia in formats familiar to genetic genealogists which I made available here and uploaded to GEDMatch as kit# F999914. I also posted a blog analyzing what i found initially. I wasn't expecting any surprise matches but I was wrong again. Unlike the Clovis-Anzick-1 sample having matches on autosomal DNA, the Mal'ta MA-1 sample is having matches on X chromosome. My immediate thought was wow! These matches, although small deserves further investigation.

X-Matches

GEDMatch 1-to-many matches for Mal'ta Sample

So, I decided to Triangulate. Wonder what? I was able to successfully triangulate at 5 cM/500 SNPs (considering the fact the sample is ancient).

Triangulation

Triangulation between M175228,M133519 with MA-1 sample

Triangulation between M175228, M184156 with MA-1 sample

Triangulation between M061014, M000556 with MA-1 sample

Let me know your thoughts and comments!

↧

Mezmaiskaya neanderthal kit from GEDMatch removed

October 1, 2014, 7:53 pm

≫ Next: Linearbandkeramik (LBK) ancient DNA matches with living people!

≪ Previous: Mal'ta MA-1 ancient DNA have matches with a few living people on X-Chromosome!

I am revisiting some of the kits uploaded to GEDMatch with less SNPs common with DNA testing companies. I found one such kit which is Mezmaiskaya Neanderthal uploaded as kit F999909.

You can see how much less the number of SNPs are present using GEDmatch DNA file diagnostic utility. Since it doesn't add any value, I am removing it from GEDMatch. But the processed files are still available to download from y-str.org.

Mezmaiskaya Neanderthal Kit

This is to improve the quality of Ancient DNA kits uploaded to GEDMatch.

↧

Linearbandkeramik (LBK) ancient DNA matches with living people!

October 2, 2014, 4:53 pm

≫ Next: My Analysis of Motala-12 ancient DNA

≪ Previous: Mezmaiskaya neanderthal kit from GEDMatch removed

LBK sample is a ~7,500 year old early farmer from the Linearbandkeramik (LBK) culture from Stuttgart in Germany. I converted the raw data provided by authors in their publication Ancient human genomes suggest three ancestral populations for present-day Europeans into formats familiar to genetic genealogists which can be downloaded here. I uploaded to GEDMatch as kit# F999916 to see if there are any matches. It has been very interesting to see how ancient DNA match with us and what SNPs are common/uncommon and what is their admixture. I had been getting mixed results - matches with living people like the Clovis Anzick-1 sample, no matches with La Braña-Arintero and X-chromosome matches with Mal'ta (MA-1) at 5 cM/500 thresholds.

Today, I realized that batch processing for LBK (Linearbandkeramik ancient DNA) was completed and I was eager to see the results and I was amazed to find matches with living people again. LBK has 10 cM as largest (at 5 cM/500 SNPs threshold), but unlike the Clovis Anzick-1 sample, the total cM is not so high which suggests LBK is not from a highly endogamous population.

Admixture

MDLP K23b Admixture for LBK

GEDMatch Matches

One-to-Many matches for LBK

Unlike the Clovis Anzick-1 sample, LBK has a lot of SNPs common with DNA testing companies. Hence, it is easy to confirm or reject a matching segment. So, I proceed to compare one to one segments with no errors with a higher threshold (7 cM/700 SNPs) to see if anyone matches and yes there is.

This segment is a solid match with 9.4 cM/1267 SNPs.

When I tried to triangulate, I noticed the segment is broken into two or three pieces which suggests, there was a few occasions of cousin marriages or endogamy.

Runs of Homozygosity reveals LBK's parents are not related.

Conclusion

Ancient DNA are not supposed to match with living people as recent cousins. Just like Clovis-Anzick-1 ancient DNA, there are matches. I am not sure how to interpret these matches, but my rational tells me to incline towards the sample not being ancient. Let me know what you think.

↧

My Analysis of Motala-12 ancient DNA

October 4, 2014, 5:52 pm

≫ Next: Loschbour ancient DNA matches living people

≪ Previous: Linearbandkeramik (LBK) ancient DNA matches with living people!

The Motala samples come from the site of Kanaljorden in the town of Motala, Östergötland, Sweden. The site was excavated between 2009 and 2013. The authors state that these samples are between 7,013 and 6,701 years old. I converted the raw data supplied in this scientific paper to formats familiar with genetic genealogists. I also filtered with SNPs tested by DNA testing companies like FTDNA, 23andMe and Ancestry in order to upload to GEDMatch but found this ancient DNA has less SNPs that are common with them except Motala-12. Hence, I am not uploading the rest to GEDMatch. Motala-1, Motala-2, Motala-3, Motala-4, Motala-6, Motala-9 and Motala-12 are available for download. Motala-12 is uploaded to GEDMatch as kit# F999917

Admixture

MDLP K23b Calculator for F999917

Because this sample is European, I went ahead to use the Eurogenes calculator.

Eurogenes Calculator for F999917

There is a 0.75% if Amerindian in Motala-12 sample.

Y-DNA

The Y-Haplogroup corresponds to I-L460 (I2a). However, some SNPs for positive and some are negative for I2a1b, which may suggest a new lineage from I2a1b.

ISOGG Y-Tree for I2

Y-STR

DYS617 = 12
DYS385a/b = 14
DYS460 = 10
DYS464a/b/c/d = 15

LobSTR reports the above Y-STR values. Based on my experience, it may be off by 1.

Mt-DNA

Mt-Haplogroup is U2e1, which is same as Motala-2 and Motala-3 samples.

Parents Related?

Runs of Homozygosity reveals the parents of Motala-12 is not related to each other.

Comparison

When I compared Motala samples with each other using the entire unfiltered SNPs, I wasn't able to get any good segment matches, probability due to lack of SNPs. But I did get one significant match between Motala-12 and Motala-3.

Chr     Start Position  End Position    Len(Mb)        SNPs
4       10259           14393451        14.3832        873

Largest Segment: 14.3832 Mb
Total Shared: 14.3832 Mb

The above match is significant because, the thresholds used is 7 Mb / 700 SNPs allowing no error SNPs. So, Motala-3 is related to Motala-12 in it's genealogical timeframe.

The sample is tokenized in GEDMatch and available for 1-to-1 comparison. Please wait for 2 days before 1-to-many comparison results are available. Let me know what you find.

↧

Loschbour ancient DNA matches living people

October 6, 2014, 5:53 am

≫ Next: Clovis Anzick DNA Match: SNP by SNP Analysis

≪ Previous: My Analysis of Motala-12 ancient DNA

To investigate European population history around the time of the agricultural transition, the authors sequenced complete genomes from a ~8,000 year old skeleton from the Loschbour rock shelter in Heffingen, Luxembourg. I converted this raw data from into formats familiar to genetic genealogists which can be found here, and uploaded it as kit# F999918 in GEDMatch.

When I uploaded, I initially checked for admixture.

MLDP K23b Calculator for F999918

The calculator says admixture is predominantly European Hunter Gatherer and the rest European early Farmers.

I waited until the batch processing was complete, and there was a surprise. Similar to Clovis-Anzick, Mal'ta and LBK, Loschbour is also having matching with living people.

GEDMatch 1-to-many for F999918

I picked a few to see if I can triangulate, and yes I am able to at 5 cM / 500 SNPs threshold with no error in-between.

Triangulation

I am not sure what to make of these matches who match with a 8000 years old ancient DNA. Could the sample be contaminated? Is the sample really ancient? I don't know for sure. These matches raises lot of questions than answers. Let me know what you think.

↧

Clovis Anzick DNA Match: SNP by SNP Analysis

October 6, 2014, 7:53 am

≫ Next: New Clovis Anzick-1 kit in GEDMatch: F999919

≪ Previous: Loschbour ancient DNA matches living people

I initially uploaded F999912 only with FTDNA SNPs. Surprised with lots of matches, I then uploaded F999913 but with SNPs tested by all DNA companies and removed the previous kit. Both kits were from VCF files provided by the authors used in their experiment where Clovis Anzick sample is one of the genotype (second) among other samples used for comparison.

The bad news is, I am not sure if those SNPs were the complete set or some sort of filtered version used for their experiment. The good news is, we have the BAM files and the sequence reads available. However, their size is enormous and takes ages to process them. I began processing last week and still it seems to take another week, but chromosome 3 is complete.

I am grateful for the owner of kit# F334678 (Robin Frisella) who sent me her autosomal file for SNP by SNP comparison and allowed me to post the results.

GEDMatch 1-to-1 comparision

There are several segments matching with no errors at 5 cM / 500 SNPs. However, I will take the first segment match, chromosome 3 as I now have the SNPs of Clovis Anzick not only from VCF (F999913), but also from the BAM file.

A detailed SNP-by-SNP comparision : Clovis-Anzick-Segment-Analysis.xlsx

Preview of SNP by SNP Analysis

Based on the SNP by SNP analysis, for a matching segment from kit# F334678, it not only showed the segment match is so real with no errors (sites where F999913 had no-calls), the Clovis Anzick sample processed from BAM file had significantly more SNPs that are common with DNA testing companies. Hence, the VCFs contained only some filtered SNPs used by authors in their experiment and BAM files contains all SNPs. The BAM processing for Clovis Anzick is not yet complete and it will take atleast another week. Once processed, I will replace kit# F999913 which will give more accurate results/matches with no-calls issue for Clovis Anzick. I will try to retain the same kit number if possible.

↧

New Clovis Anzick-1 kit in GEDMatch: F999919

October 8, 2014, 5:33 am

≫ Next: Downloading files from Google Drive

≪ Previous: Clovis Anzick DNA Match: SNP by SNP Analysis

As described in the post Clovis Anzick DNA Match: SNP by SNP Analysis, the processing for Clovis Anzick ancient DNA BAM file which is 40 gigabytes and ran for 2 weeks is now completed, and it did emit a lot of SNPs common with DNA testing companies. To give a perspective of number SNPs it has, out of 1001421 SNPs which includes the complete collection of SNPs from FTDNA, 23andMe and Ancestry, the new kit has 926009 SNPs in common, which means the matches from new kit will be very accurate. The raw data can be downloaded from here.

I uploaded it as kit# F999919. I am not removing the older kit# F999913 which helps to compare segment matches in detail and investigate further.

Admixure differences:

MLDP K23b Admixure - F999913 vs F999919

One-to-One Comparison:

1-to1 Comparison: F999913 vs F999919

A quick comparison on a few matches at 7 cM/700 threshold with no errors reveal some segments getting dropped off while a few remain but with more SNPs, indicating a true matching segment. New segments also seems to appear indicating an increased the number of SNPs in the sample at those sites which is now matching the other person.

Largest Segment:

Largest segment at 10 cM / 1000 SNPs threshold with no error

Above is the largest segment so far I can identify, which is really awesome.

The kit is tokenized and available for 1-to-1 comparison but not yet available for 1-to many. Batch processing will take atleast 2 days. So check 1-to-many for F999919 kit after 2 days.

Let me know what you think. Can this Clovis Anzick-1 sample be 12500 years old and still match living people like 5th cousins?

↧

Downloading files from Google Drive

October 9, 2014, 4:02 pm

≫ Next: The true IBS noise range

≪ Previous: New Clovis Anzick-1 kit in GEDMatch: F999919

I believe most who are not familiar with using Google drive, face difficulty in downloading files esp., files that shows previews. Because, I had been using Google Drive extensively for /sharing project downloads, I thought it would be helpful if I write a small tutorial.

Let's say, if we are to download Neanderthal complete SNPs which is the file - 999902-autosomal-o37-results-full-snps.zip.

Clicking the file, actually shows a preview of what is inside it.

This is where many get confused how to download the file. All you need to do is click the download button just at the top.

Depending on the size, the file may get downloaded or it may ask for another confirmation if Google cannot scan the file. Generally, Google scans all files for viruses below 30 Mb. For all files above 30 Mb, a confirmation is asked. I always recommend you to scan files with a proper anti-virus after downloading any files from internet.

Just click 'Download Anyway', and the files gets downloaded as shown below.

The above simple steps will help you download the file. I am writing this blog because, I had requests from several people having difficultly downloading files from Google Drive.

↧

The true IBS noise range

October 10, 2014, 6:35 pm

≫ Next: Hinxton-2 Analysis

≪ Previous: Downloading files from Google Drive

Last year, I wrote a blog, Noise threshold on atDNA Matches, where I mathematically calculated that IBS noise cannot occur greater than 150 consecutive SNPs. However, this assumes that the population of the world has all possible genotypes. But in reality, the population does not have such diversity. One of the reasons I want to investigate this is because, I am eager to find IBS compound segments between myself and ancient DNA but unsure of thresholds to be used to eliminate noise.

For example, I said:

Every genotype say, AG will match (A will match AA,AG,AT,AC -or- G will match AG,GG,GC,GT) - taking intersection, AG will match AA,AG,AT,AC,GG,GC,GT (7 genotypes out of possible 10).

But in reality, what if there are only 2 genotypes say, AA and AG are found in populations and always universally match? This means, the probability is not 0.7 but always 1. So, the probability for each matching segment with consecutive SNPs drastically varies and purely depends on genotypes found in populations.

Solution

To solve this problem, I took the genotype frequencies from OpenSNP and created a random file which has exactly the same SNPs as my autosomal file, expect the genotype is randomized based on "what is found in populations". I will create multiple random files and compare with my autosomal file to see how it matches. This will help us to figure out the actual noise threshold.

I started off with 150 SNPs / 1 Mb threshold and I didn't get any matches. So, I reduced SNPs to 100 SNPs and below are the matching segments with a random file.

Autosomal match with random file #1

Chr     Start Position  End Position    Len(Mb) SNPs
4       92845437        93847904        1.00247 108
8       78394149        79510525        1.11638 145
17      22034501        25874456        3.83995 103

Largest Segment: 3.83995 Mb
Total Shared: 5.9588 Mb

Autosomal match with random file #2

Chr     Start Position  End Position    Len(Mb) SNPs
4       48960863        53399616        4.43875 106
18      14550375        19351344        4.80097 132

Largest Segment: 4.80097 Mb
Total Shared: 9.23972 Mb

Autosomal match with random file #3

Chr     Start Position  End Position    Len(Mb) SNPs
3       162511139       163554331       1.04319 108
6       58047654        62510310        4.46266 105

Largest Segment: 4.46266 Mb
Total Shared: 5.50585 Mb

The source code used to generate random autosomal files using genotypes found among populations and OpenSNP genotype frequencies can be downloaded from GitHub.

Conclusion

A true noise IBS segment cannot occur above the 150 consecutive SNPs for 1 Mb threshold. Anything above 150 SNPs / 1 Mb threshold must be an IBS compound segment among populations. Please note that 1 Mb unit varies with cM a bit. This result is however is based on OpenSNP genotype frequencies for each SNP. It is basically a IBS-noise vs IBS-compound segments test. While it seems 150 consecutive SNPs cannot just occur randomly to match someone, I am sill not sure how far back such a compound segment say, 200 SNPs/ 2 cM would go back in time coming from a common ancestor. Can it go to the very founder population? I don't know.

↧

Hinxton-2 Analysis

October 11, 2014, 8:22 am

≫ Next: IBS Noise Kit at GEDMatch

≪ Previous: The true IBS noise range

Hinxton-2 refers to an ancient sample ERS389796 which was provided by the authors of yet to be published paper. I was able to upload to GEDMatch as kit# F999921. Below is what I found based on my initial analysis.

Admixture

MLDP K23b Admixture Calculator for Hinxton-2 sample

Eurogenes K13 Calculator for Hinxton-2 sample

Parents

Runs of Homozygosity reveals the Hinxton-2 sample's parents are first cousins.

Mt-DNA

Mt-Haplogroup is H2a2b1

Telomere Length

The average telomere length from all sequence read runs gives 1.42. This means, the Hinxton-2 sample which belongs to a lady who lived 2500-1800 years back, died at the age of 65.

Telomere length for 1.42
(Image adapted from http://learn.genetics.utah.edu/content/chromosomes/telomeres/)

Eye Color

Eye color from GEDMatch

Comparison

The kit# F999921 is available for 1-to-1 comparison in GEDMatch. For 1-to-many, please wait a couple of days for batch processing.

GEDMatch Diagnostic Utility

Please take caution for matching segments at chromosome 21 since there are less SNPs in it.

Let me know what you find.

↧

IBS Noise Kit at GEDMatch

October 12, 2014, 4:37 am

≫ Next: Convert Ancient DNA Sequence Reads to VCF format

≪ Previous: Hinxton-2 Analysis

After writing the post - The true IBS noise range, I got a request to upload the random file to GEDMatch so see if it matches anyone. I initially hesitated but later I thought, why not? May be, this helps to understand and refine the IBS noise thresholds. So, I used the same code which I uploaded at GitHub to generated a random file with FTDNA SNPs and uploaded to GEDMatch as kit# F999901

Randomness

You can look into the source but to give a quick overview, the code uses the SecureRandom function to pick one of the genotypes found in population for each SNP. Hence, we get randomness of genotypes found within populations. This is not a mathematical randomness from all possible combinations. For example, if a random genotype is to be chosen mathematically for a SNP, it has a set of 10 possibilities {AA,TT,GG,CC,AT,AG,AC,GT,CT,GC}. But what if only AA and AC is found in populations? Hence, the code will use genotype frequency data from OpenSNP and select only from the data set of genotypes found within populations. In this case, the code will select either AA or AC for the SNP and not the other genotypes that are not found in populations.

Admixture

MDLP K23b Calculator for the random genotype kit

Dodecad V3 calculator for the random genotype kit

The above admixture looks like a real person but no! It will always give the same result for all random files because of entropy. No matter how hard you try to randomize, the admixture will always be same/similar because of entropy. What you see as admixture results is the percentage of SNPs assigned to each population used to create that calculator itself. So, basically you can find what percentage of SNPs does each calculator uses in calculating admixture results using these random files.

Take a look at 2 entirely different random files in Dodecad V3 showing the same/similar result.

2 different random files having same results

One to One Comparison

I did not have a single segment even as low as 100 SNPs / 1 cM. Please note that my earlier blog was using Mb and not cM. Mb and cM varies a bit.

This is an experiment intended to see how much can someone accidentally match through IBS noise. Random Genotype files uploaded at GEDMatch can be downloaded from here for further investigation if you require it.

Conclusion

You can use the code to generate as many random kits as you like using genotypes found among populations based on OpenSNP data and compare with as many kits you want, but the fact is, a matching segment above 200 SNPs / 2 cM is impossible to occur by chance. Hence, any matching segment above the threshold of 200 SNPs / 2 cM is a compound segment and any matching segment below this threshold may be noise.

Let me know if you find any matching segments that match accidentally.

↧

Convert Ancient DNA Sequence Reads to VCF format

October 13, 2014, 2:46 am

≫ Next: HVR1 and HVR2 for LBK ancient DNA matches with living people!

≪ Previous: IBS Noise Kit at GEDMatch

I accidentally stumbled on a post by genetiker in anthrogenica.com claiming what I uploaded as garbage files and showing that there are no Y-calls at certain regions using samtools. Because many people use the files what I uploaded and the Ancient DNA GEDMatch kits, I am in a position to reassure that the uploads are perfectly valid and contains SNPs with high confidence. To assure you, I will actually explain the entire process of converting an ancient DNA to formats familiar with genetic genealogists.

To start off with the process, scientists and researchers upload sequence reads, usually consists of several SRA files. Only a few upload as BAM files. These are massive files and requires patience, high end computing hardware and lots of disk space. If you have those, you can process them yourself. Let's say, you have downloaded ERR405813.sra, if you follow the below steps, you will be able to convert it to VCF or to a format that is familiar to you which is usually provided as raw data by DNA testing companies.

Step #1. Convert SRA to FASTQ

This step will convert the sequence reads to fastq files. fastq-dump is part of SRA Toolkit.

$ fastq-dump ERR405813.sra

FASTQ format stores sequences and Phred qualities in a single file.

Step #2. Convert FASTQ to SAM

This step we will use Burrows-Wheeler Aligner to convert FASTQ to SAM file. This step assumes that the correct reference file ref.fa is indexed and dictionary is generated.

$ bwa mem ref.fa ERR405813.fastq > ERR405813.sam

SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of an optional header section and an alignment section. For more information on SAM specification, please refer here.

Step #3. Add Read Group

Some tools that we will be using further like GATK requires read groups and the SAM file we generated above does not have @RG tag. So we will add it using picard. Because the sequence read is of a single sample, we just add it using the below command.

$ java -Xmx2g -jar AddOrReplaceReadGroups.jar INPUT=ERR405813.sam OUTPUT=ERR405813.bam SORT_ORDER=coordinate RGID=rgid RGLB=rglib RGPL=illumina RGPU=rgpu RGSM=sample

Step #4. Index the BAM

When we added Read Group, we did sort using coordinates. So, it is enough we just index the BAM file. This step produces new file ERR405813.bam.bai

$ samtools index ERR405813.bam

Step #5. Realigning the BAM file

In this step we will use GATK to realign the BAM file.

$ java -Xmx2g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref.fa -I ERR405813.bam -o bam.intervals
$ java -Xmx2g -jar GenomeAnalysisTK.jar -T IndelRealigner -R ref.fa -I ERR405813.bam -targetIntervals bam.intervals -o ERR405813_realigned.bam

Realigning BAM file is a very important process to get accurate results. The above steps emit intervals and performs local realignment of reads to correct misalignments due to the presence of indels.

Step #6. Index the realigned BAM file

This step will produce ERR405813_realigned.bam.bai file.

$ samtools index ERR405813_realigned.bam

Step #7. Invoke the variant caller

Now, you have to call the Variant Caller using GATK to emit all confident sites.

$ java -Xmx2g -jar GenomeAnalysisTK.jar -l INFO -R ref.fa -T UnifiedGenotyper -I ERR405813_realigned.bam -rf BadCigar -o bam_out.vcf --output_mode EMIT_ALL_CONFIDENT_SITES

Step #8. You have the VCF, what next?

Since you now have the VCF, you can verify everything you want. You can also use my open source program BAMAnalysisKitVCFParser to extract Y-SNPs, mt-Markers and generate all files familiar to genetic genealogists. If you intend to use my code, the VCF must be passed as a parameter and all required files must be present in relative paths as in GitHub and the gzipped files within snp138 folder must be extracted.

$ java -Xmx2g -classpath bamkit.jar;. fc.id.au.BAMAnalysisKitVCFParser bam_out.vcf

So, now you know how to convert a SRA file to VCF and other formats.

My Tips / Comments

You don't have to follow the exact steps as above. Except for SRA to SAM/BAM, You can basically modify the BAM Analysis Kit to convert BAM or SAM file to VCF and formats familiar to genetic genealogists.

E.g.,
After step 2, you can convert SAM to BAM using the below command and modify the BAM Analysis kit to process it.

$ samtools view -bS ERR405813.sam > ERR405813.bam

Then, replace the below line

bin\samtools\samtools.exe reheader header reads.bam > bam_wh.bam

with

bin\samtools\samtools.exe reheader header reads.bam > bam_wh_tmp.bam

echo Adding Read Group Header ...

bin\jre\bin\java.exe -Xmx2g -jar bin\picard\AddOrReplaceReadGroups.jar INPUT=bam_wh_tmp.bam OUTPUT=bam_wh.bam SORT_ORDER=coordinate RGID=rgid RGLB=rglib RGPL=illumina RGPU=rgpu RGSM=sample

from BAM Analysis kit and use it to process the BAM file which should be easy.

Looking at what genetiker posted (screenshot), I can say for sure, he had neither realigned his BAM file nor used a variant caller to emit confident sites. Realignment is so important that bases mismatching the reference near the misalignment can easily be mistaken as SNPs.

(screenshot from anthrogenica.com)

I suggest genetiker to use all steps correctly and completely before criticizing someone's work as garbage. This post is only intended to reassure that the processed Ancient DNA uploads by me can be used with great confidence. By posting this blog, I am not only making the process transparent, you can verify the output results yourself.

↧