Quantcast
Channel: Felix's Thought Logs
Viewing all articles
Browse latest Browse all 109

Convert Ancient DNA Sequence Reads to VCF format

$
0
0
I accidentally stumbled on a post by genetiker  in anthrogenica.com claiming what I uploaded as garbage files and showing that there are no Y-calls at certain regions using samtools. Because many people use the files what I uploaded and the Ancient DNA GEDMatch kits, I am in a position to reassure that the uploads are perfectly valid and contains SNPs with high confidence. To assure you, I will actually explain the entire process of converting an ancient DNA to formats familiar with genetic genealogists.

To start off with the process, scientists and researchers upload sequence reads, usually consists of several SRA files. Only a few upload as BAM files. These are massive files and requires patience, high end computing hardware and lots of disk space. If you have those, you can process them yourself. Let's say, you have downloaded ERR405813.sra, if you follow the below steps, you will be able to convert it to VCF or to a format that is familiar to you which is usually provided as raw data by DNA testing companies.

Step #1. Convert SRA to FASTQ

This step will convert the sequence reads to fastq files. fastq-dump is part of SRA Toolkit.

$ fastq-dump ERR405813.sra

FASTQ format stores sequences and Phred qualities in a single file.

Step #2. Convert FASTQ to SAM

This step we will use Burrows-Wheeler Aligner to convert FASTQ to SAM file. This step assumes that the correct reference file ref.fa is indexed and dictionary is generated.

$ bwa mem ref.fa ERR405813.fastq > ERR405813.sam

SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of an optional header section and an alignment section. For more information on SAM specification, please refer here.

Step #3. Add Read Group

Some tools that we will be using further like GATK requires read groups and the SAM file we generated above does not have @RG tag. So we will add it using picard. Because the sequence read is of a single sample, we just add it using the below command.

$ java -Xmx2g -jar AddOrReplaceReadGroups.jar INPUT=ERR405813.sam OUTPUT=ERR405813.bam SORT_ORDER=coordinate RGID=rgid RGLB=rglib RGPL=illumina RGPU=rgpu RGSM=sample

Step #4. Index the BAM

When we added Read Group, we did sort using coordinates. So, it is enough we just index the BAM file. This step produces new file ERR405813.bam.bai

$ samtools index ERR405813.bam

Step #5. Realigning the BAM file

In this step we will use GATK to realign the BAM file.

$ java -Xmx2g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref.fa -I ERR405813.bam  -o bam.intervals
$ java -Xmx2g -jar GenomeAnalysisTK.jar -T IndelRealigner -R ref.fa -I ERR405813.bam -targetIntervals bam.intervals -o ERR405813_realigned.bam

Realigning BAM file is a very important process to get accurate results. The above steps emit intervals and performs local realignment of reads to correct misalignments due to the presence of indels.

Step #6. Index the realigned BAM file

This step will produce ERR405813_realigned.bam.bai file.

$ samtools index ERR405813_realigned.bam

Step #7. Invoke the variant caller

Now, you have to call the Variant Caller using GATK to emit all confident sites.

$ java -Xmx2g -jar GenomeAnalysisTK.jar -l INFO -R ref.fa -T UnifiedGenotyper -I ERR405813_realigned.bam -rf BadCigar -o bam_out.vcf --output_mode EMIT_ALL_CONFIDENT_SITES

Step #8. You have the VCF, what next?

Since you now have the VCF, you can verify everything you want. You can also use my open source program BAMAnalysisKitVCFParser to extract Y-SNPs, mt-Markers and generate all files familiar to genetic genealogists. If you intend to use my code, the VCF must be passed as a parameter and all required files must be present in relative paths as in GitHub and the gzipped files within snp138 folder must be extracted.

$ java -Xmx2g -classpath bamkit.jar;. fc.id.au.BAMAnalysisKitVCFParser bam_out.vcf

So, now you know how to convert a SRA file to VCF and other formats.

My Tips / Comments

You don't have to follow the exact steps as above. Except for SRA to SAM/BAM, You can basically modify the BAM Analysis Kit to convert BAM or SAM file to VCF and formats familiar to genetic genealogists.

E.g.,
After step 2, you can convert SAM to BAM using the below command and modify the BAM Analysis kit to process it.

$ samtools view -bS ERR405813.sam > ERR405813.bam

Then, replace the below line

bin\samtools\samtools.exe reheader header reads.bam > bam_wh.bam

with

bin\samtools\samtools.exe reheader header reads.bam > bam_wh_tmp.bam

echo Adding Read Group Header ...
bin\jre\bin\java.exe -Xmx2g -jar bin\picard\AddOrReplaceReadGroups.jar INPUT=bam_wh_tmp.bam OUTPUT=bam_wh.bam SORT_ORDER=coordinate RGID=rgid RGLB=rglib RGPL=illumina RGPU=rgpu RGSM=sample

from BAM Analysis kit and use it to process the BAM file which should be easy.

Looking at what genetiker posted (screenshot), I can say for sure, he had neither realigned his BAM file nor used a variant caller to emit confident sites. Realignment is so important that bases mismatching the reference near the misalignment can easily be mistaken as SNPs.

(screenshot from anthrogenica.com)

I suggest genetiker to use all steps correctly and completely before criticizing someone's work as garbage. This post is only intended to reassure that the processed Ancient DNA uploads by me can be used with great confidence. By posting this blog, I am not only making the process transparent, you can verify the output results yourself.

Viewing all articles
Browse latest Browse all 109

Trending Articles