I accidentally stumbled on a
post by genetiker in anthrogenica.com claiming what I uploaded as garbage files and showing that there are no Y-calls at certain regions using samtools. Because many people use the files what I uploaded and the Ancient DNA GEDMatch kits, I am in a position to reassure that the uploads are perfectly valid and contains SNPs with high confidence. To assure you, I will actually explain the entire process of converting an ancient DNA to formats familiar with genetic genealogists.
To start off with the process, scientists and researchers upload sequence reads, usually consists of several SRA files. Only a few upload as BAM files. These are massive files and requires patience, high end computing hardware and lots of disk space. If you have those, you can process them yourself. Let's say, you have downloaded ERR405813.sra, if you follow the below steps, you will be able to convert it to VCF or to a format that is familiar to you which is usually provided as raw data by DNA testing companies.
Step #1. Convert SRA to FASTQ
This step will convert the sequence reads to fastq files. fastq-dump is part of
SRA Toolkit.
$ fastq-dump ERR405813.sraFASTQ format stores sequences and Phred qualities in a single file.
Step #2. Convert FASTQ to SAM
This step we will use
Burrows-Wheeler Aligner to convert FASTQ to SAM file. This step assumes that the correct reference file ref.fa is indexed and dictionary is generated.
$ bwa mem ref.fa ERR405813.fastq > ERR405813.samSAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of an optional header section and an alignment section. For more information on SAM specification, please
refer here.
Step #3. Add Read Group
Some tools that we will be using further like
GATK requires read groups and the SAM file we generated above does not have @RG tag. So we will add it using
picard. Because the sequence read is of a single sample, we just add it using the below command.
$ java -Xmx2g -jar AddOrReplaceReadGroups.jar INPUT=ERR405813.sam OUTPUT=ERR405813.bam SORT_ORDER=coordinate RGID=rgid RGLB=rglib RGPL=illumina RGPU=rgpu RGSM=sampleStep #4. Index the BAM
When we added Read Group, we did sort using coordinates. So, it is enough we just index the BAM file. This step produces new file ERR405813.bam.bai
$ samtools index ERR405813.bamStep #5. Realigning the BAM file
In this step we will use
GATK to realign the BAM file.
$ java -Xmx2g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref.fa -I ERR405813.bam -o bam.intervals$ java -Xmx2g -jar GenomeAnalysisTK.jar -T IndelRealigner -R ref.fa -I ERR405813.bam -targetIntervals bam.intervals -o ERR405813_realigned.bamRealigning BAM file is a very important process to get accurate results. The above steps emit intervals and performs local realignment of reads to correct misalignments due to the presence of indels.
Step #6. Index the realigned BAM file
This step will produce ERR405813_realigned.bam.bai file.
$ samtools index ERR405813_realigned.bamStep #7. Invoke the variant caller
Now, you have to call the Variant Caller using
GATK to emit all confident sites.
$ java -Xmx2g -jar GenomeAnalysisTK.jar -l INFO -R ref.fa -T UnifiedGenotyper -I ERR405813_realigned.bam -rf BadCigar -o bam_out.vcf --output_mode EMIT_ALL_CONFIDENT_SITESStep #8. You have the VCF, what next?
Since you now have the VCF, you can verify everything you want. You can also use my open source program
BAMAnalysisKitVCFParser to extract Y-SNPs, mt-Markers and generate all files familiar to genetic genealogists. If you intend to use my code, the VCF must be passed as a parameter and all required files must be present in relative paths as in GitHub and the gzipped files within snp138 folder must be extracted.
$ java -Xmx2g -classpath bamkit.jar;. fc.id.au.BAMAnalysisKitVCFParser bam_out.vcfSo, now you know how to convert a SRA file to VCF and other formats.
My Tips / Comments
You don't have to follow the exact steps as above. Except for SRA to SAM/BAM, You can basically modify the
BAM Analysis Kit to convert BAM or SAM file to VCF and formats familiar to genetic genealogists.
E.g.,After step 2, you can convert SAM to BAM using the below command and modify the BAM Analysis kit to process it.
$ samtools view -bS ERR405813.sam > ERR405813.bamThen, replace the below line
bin\samtools\samtools.exe reheader header reads.bam > bam_wh.bamwith
bin\samtools\samtools.exe reheader header reads.bam > bam_wh_tmp.bam
echo Adding Read Group Header ...
bin\jre\bin\java.exe -Xmx2g -jar bin\picard\AddOrReplaceReadGroups.jar INPUT=bam_wh_tmp.bam OUTPUT=bam_wh.bam SORT_ORDER=coordinate RGID=rgid RGLB=rglib RGPL=illumina RGPU=rgpu RGSM=sample
from
BAM Analysis kit and use it to process the BAM file which should be easy.
Looking at what
genetiker posted (screenshot), I can say for sure, he had neither realigned his BAM file nor used a variant caller to emit confident sites. Realignment is so important that bases mismatching the reference near the misalignment can easily be mistaken as SNPs.
I suggest
genetiker to use all steps correctly and completely before criticizing someone's work as garbage. This post is only intended to reassure that the processed Ancient DNA uploads by me can be used with great confidence. By posting this blog, I am not only making the process transparent, you can verify the output results yourself.