GATK (Genome Analysis Toolkit) is a set of procedures for detecting SNP (SNP institute) developed by the United States broad calling. After sequencing the base mass due to the physical and chemical reactions in the sequencing process and the defects of the sequencing instrument will lead to the base mass deviation from the real situation, in order to correct the base quality, the Baserecalibrator program was developed. In the process of base quality correction, the known standard SNP database is a very important input file, such as the human DBSNP database. But if the genome being studied is a relatively new species and there is no standard SNP database, is it still possible to calibrate the base instructions? The answer is that it is still necessary to use the existing data to simulate a standard SNP database. The following is forwarded to the relevant description on the GATK website (original URL: https://software.broadinstitute.org/gatk/documentation/article?id=44).
I ' m working on a genome that doesn ' t really has a good SNP database yet. I ' m wondering if it still makes sense to run base quality score recalibration without known SNPs.
The base quality score Recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms is legitimate mismatches to the reference and shouldn ' t is counted against the quality of a base. We use a database of known polymorphisms to skip through most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases would be Inferred to being much much lower than it actually is as a result of the reference-mismatching SNP sites.
However, all was not lost if you were willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works:
- First do a initial round of SNP calling on your original, unrecalibrated data.
- Then take the SNPs the highest confidence in and use that set as the database of known SNPs by feeding it as A VCF file to the base quality score Recalibrator.
- Finally, do a real round of SNPs calling with the recalibrated data. These steps could be repeated several times until convergence.
Question: The genome I'm currently working on does not yet have a good SNP database, and I wonder if there is a database of known SNPs that can perform base quality corrections?
Answer: The base mass fraction orthotics treat each base that is incorrectly associated with the reference genome as a machine error. True polymorphism sites are legitimate mismatches and therefore should not be counted as errors of base quality resulting from mismatches. We use a known polymorphic site database to skip most polymorphic sites. Unfortunately, without this information, the data would become completely unusable because the base mass fraction would be presumed to be much lower than the actual mass fraction of the SNP sites that were incorrectly associated with the reference genome.
However, if you are willing to perform an experiment, the base mass can still be corrected. You can build a known SNP database on your own. The steps are as follows:
1. First the SNP calling for your original, uncorrected data.
2. Then select the SNP sites that you are most confident as a known SNP database and pass them as VCF files to the base mass fraction orthotics.
3. Finally, make a real SNP calling that uses the calibration data. These steps can be repeated several times until the results converge.
GATK baserecalibration program can be baserecalibration without a standard SNP database?