gatk--Use Reprint

Source: Internet
Author: User
Tags gz file

Http://blog.sciencenet.cn/blog-1469385-819498.html

Article Directory

    • I. Preparatory work
    • Two. Process overview
    • Three. Process

First of all, tell me what GATK can do. It is mainly used for variant calling from sequencing data, including SNP, INDEL. For example, now popular exome sequencing looking for a variant, generally through the bwa+gatk of pipeline data analysis.

To run GATK, you first need to know its website (http://www.broadinstitute.org/gatk/).

I. Preparatory work (1) program download

Http://www.broadinstitute.org/gatk/download

Be sure to download the latest version, note that it is not the gatk-lite, but the GATK2. (currently 2, I do not know when to update to 3.) Update soon AH have wood, so be sure to use the new version of the extract to a directory, no installation. Open the extracted directory, you will find Genomeanalysistk.jar this file. All the tools we need are in this bag.

(2) Reference Sequence and annotation download

Be sure to run GATK using various sequences and annotations on the GATK resource bundle.

GATK Resource Bundle Introduction:

Http://gatkforums.broadinstitute.org/discussion/1213/whats-in-the-resource-bundle-and-how-can-i-get-it

GATK Resource Bundle FTP address:

Http://gatkforums.broadinstitute.org/discussion/1215/how-can-i-access-the-gsa-public-ftp-server

For example, with hg19 (human genome build 19. Also, B37 is often used) as a reference sequence,

Enter the following command: Lftp ftp.broadinstitute.org-u gsapubftp-anonymous

Enter the resource bundle by typing a space after you enter. Enter the directory named bundle to find the latest version of the Hg19 directory.

The files to be downloaded include:

Ucsc.hg19.fasta

Ucsc.hg19.fasta.fai

1000g_omni2.5.hg19.vcf

1000g_omni2.5.hg19.vcf.idx

1000g_phase1.indels.hg19.vcf

1000g_phase1.indels.hg19.vcf.idx

dbsnp_137.hg19.vcf

Dbsnp_137.hg19.vcf.idx

hapmap_3.3.hg19.vcf

Hapmap_3.3.hg19.vcf.idx

Mills_and_1000G_gold_standard.indels.hg19.vcf

Mills_and_1000G_gold_standard.indels.hg19.vcf.idx

(3) R settings

Gatk Some of the steps require the use of R to draw some graphs, some r packages are called at this point. If you use r without these packages, you will not be able to generate PDF files. Therefore, some necessary r packages must be installed, including (but not limited to) the following:

  • Bitops
  • Catools
  • ColorSpace
  • Gdata
  • Ggplot2
  • Gplots
  • Gtools
  • Rcolorbrewer
  • Reshape
  • Gsalib

How do I check if these packages are installed? Taking Ggplot2 as an example, enter R, enter the library (GGPLOT2), if there is no error information, it indicates that ggplot2 installed, if not installed, enter the command install.packages ("Ggplot2"), Follow the instructions to install the Ggplot2. If the package installation is unsuccessful, it is most likely a permissions issue: The default use of R is installed by the system administrator, and you do not have write permission in the directory such as/usr. This is the option to install the R directly in your own directory, then installs the various packages. You can also specify that packages be installed in your own directory when install.packages.

To install R packages into your own directory:

Install.packages ("Ggplot2", lib= "/your/own/r-packages/")

You can then add the specified path to the r_libs variable.

The installation of gsalib is a bit different.

Download GSALIB:HTTPS://GITHUB.COM/BROADGSA/GATK from this website

After decompression, enter the directory where the Build.xml file is located and enter the command ant gsalib. This will allow you to install the gsalib. If the gsalib still fails to load, you may need to tell r the path of Gsllib.

Http://gatkforums.broadinstitute.org/discussion/1244/what-is-a-gatkreport

Here are the more detailed steps. (It is strange that I have no ~/.) Rprofle file, Ant does not have add path after completion, still work)

In addition, the Gaslib installation requires Java to be version 1.6.

To set the path:

Export R_home=/your/r/install/pathexport R_libs=/your/r-packages/path

How do I see the path to the currently used R? Enter which R.

Two. Process overview

Images from Http://www.broadinstitute.org/gatk/guide/topic?name=best-practices. Say this best practices, just started to contact GATK when everyone recommended me to see, but I just can't understand ... Many things have to be done in person, then look back to understand. But since then the elders have said to look, so if you are a novice, still look at it.

The good news is that the page is now more than a broade workshop, some of the material is very useful. Http://www.broadinstitute.org/gatk/guide/events?id=2038#materials

It's best to browse through these work materials before working.

This wiki on the seqanswers is also a good entry point for you to see what you are doing at each step.

Http://seqanswers.com/wiki/How-to/exome_analysis

However, it is more outdated, it uses the GATK version should be 2.0 ago, the current process has some adjustments, some parameter settings are also changed. So the pipeline is for reference only, do not copy the code. For example it is on the part of reference sequence, which is its own manual generate Hg19 genome sequence. However, because the subsequent use of DBSNP and Indel of the various annotation, and these file index is likely to be the same as their own Generate HG19 index (resource bundle HG19 except Chr1-22,x,y, There will be a number of non-assembled sequences outside the CHRC,CHRM, which will cause the program to go wrong and have to start over. So, be sure to use GATK resource bundle on the stuff!

The code behind me is also the same, do not copy. Ghosts know what will change in a few months gatk. It keeps changing all the time ... Always keep up with the update on the GATK website is the kingly way. (though it's web site done ...) It's not really good. I can't find what I want in a lot of times.

Three. Process (1) Mapping.

For example, I have only processed the pair-end data of Miseq and HiSeq. Mapping uses BWA, because it allows indels to be used. Another commonly used mapping software bowtie can not generate indels, more for perfect match occasions such as Srna mapping.

First establish index of reference.

BWA index-a bwtsw-p hg19 Hg19.fasta

This process takes a long time ... above 5h.

-P is the name given to reference. The following files are generated when you are finished:

Hg19.amb

Hg19.ann

Hg19.bwt

Hg19.dict

Hg19.pac

Hg19.sa

Then it's mapping, start pasting code ... First, the Pair1 and Pair2 of the pair-end data are mapping respectively, the corresponding Sai file is generated, and then the SAM file is generated in combination. The SAM file is then sort with Picard and the BAM file after sort is generated.

sample=$1data_dir=***work_dir=***ref_dir=~/database/ hg19sai_dir= $work _dir/saifilessam_dir= $work _dir/samfilesbam_dir= $work _dir/bamfileslog_dir= $work _dir/logsmet_ dir= $work _dir/metricsother_dir= $work _dir/otherfilesgatk=/software/sequencing/genomeanalysistk-2.3-4-g57ea19f/ Genomeanalysistk.jarpicard=/software/sequencing/picard_latestbwa AlN $ref _dir/hg19-t 4 $data _dir/${sample}_1. fastq.gz > $sai _dir/${sample}_1.saibwa aln $ref _dir/hg19-t 4 $data _dir/${sample}_2.fastq.gz > $sai _dir/${sample} _2.saibwa sampe-r "@RG \tid: $sample \tlb: $sample \tsm: $sample \tpl:illumina" $ref _dir/hg19 $sai _dir/${sample}_1.sai $ Sai_dir/${sample}_2.sai $data _dir/${sample}_1.fastq.gz $data _dir/${sample}_2.fastq.gz | gzip > $sam _dir/$sample. Sam.gzjava-xmx10g-djava.io.tmpdir=/tmp-jar $picard/sortsam.jar input= $sam _dir/$ sample.sam.gz output= $bam _dir/$sample. Bam Sort_order=coordinatevalidation_stringency=lenient Create_index=true 

Tips:

A. Bwa and the series of tools used later can read gzip compressed files directly. To save hard disk space and reduce the read/write time of the compression/decompression process, it is best to use the *.gz file directly as input. Output is also best to use gzip compression in the pipeline before outputting.

B.bwa the-r option in the sample step is not negligible.

C.picard Sortsam need to specify a TMP directory for intermediate files, the intermediate file will be large, above 10G. Note The space size of the specified directory.

D. For fastq.gz files with a compressed size of approximately 3g, the BWA AlN needs run 30-60min. A merge of the generated two Sai file to generate a SAM file requires 60-90min. Using the Picard sort Sam requires 30-60min.

The E.samtools also provides tools for the sort and Bam file generation of the Sam file, and does not generate large intermediate files.

(2) Duplicate marking.

The aim of this step is to mark those PCR duplicate. Because PCR has biases, some sequences are overpresented, they have the same sequence and a consistent map location, which affects gatk work. This step sets a flag for these sequences to mark them for easy gatk identification. You can also set Remove_duplicates=true to discard the duplicated sequence, but you have not explored the impact of a tag that is not discarded and discarded for subsequent analysis. Just mark it in the official process.

Java-xmx15g-djava.io.tmpdir=/tmp-jar $picard/markduplicates.jar input= $bam _dir/$sample. Bam output= $bam _dir/$ Sample.dedupped.bam metrics_file= $met _dir/$sample. Dedupped.metrics create_index=true validation_stringency= Lenient
(3) Local relignment

Indel near the alignment is usually not allowed, need to use known Indel information to realignment, in two steps, the first step realignertargetcreator, output a file containing possible indels. The first step Indelrealigner, use this file for realign.

The first step is to use some known Indel information. So which files should I use?

Refer to the following link: http://gatkforums.broadinstitute.org/discussion/1247/what-should-i-use-as-known-variantssites-for-running-tool-x

This chart is especially useful, and you can see that the Realignertargetcreator and Indelrealigner steps recommend the use of two Indel annotated file, as shown in the Code-known.

Tool DBSNP 129 DBSNP >132 Mills Indels 1KG Indels HapMap Omni
Realignertargetcreator X X
Indelrealigner X X
Baserecalibrator X X X
(Unifiedgenotyper/haplotypecaller) X
Variantrecalibrator X X X X
Varianteval X

Java-xmx15g-jar $gatk-T realignertargetcreator-r $ref _dir/hg19.fasta-o $other _dir/$sample. Realigner.intervals-i $ba m_dir/$sample. Dedupped.bam-known $ref _dir/1000g_phase1.indels.hg19.vcf-known $ref _dir/mills_and_1000g_gold_ standard.indels.hg19.vcf
Java-xmx15g-jar $gatk-T indelrealigner-r $ref _dir/hg19.fasta-i $bam _dir/$sample. Dedupped.bam-targetintervals $other _dir/$sample. Realigner.intervals-o $bam _dir/$sample. Dedupped.realigned.bam-known $ref _dir/1000g_ Phase1.indels.hg19.vcf-known $ref _DIR/MILLS_AND_1000G_GOLD_STANDARD.INDELS.HG19.VCF

In some older versions of pipeline (like the one on Seqanswers), fix the mate information for pair-end data after this step is done. However, the new version does not need this step, Indel realign can own fix mate. Reference Http://gatkforums.broadinstitute.org/discussion/1562/need-to-run-a-step-with-fixmateinformation-after-realignment-step

(4) Base Quality score recalibration

This step is abbreviated as BQSR, which corrects the quality score of the base. Also use some known files for SNP and Indel (refer to the table above and the-known option in the code). The first step baserecalibrator generates a RECAL.GRP file for correction and a pre-correction plot; the first step baserecalibrator generate a corrected plot; the third step pringreads use RECAL.GRP to correct the output of the corrected BAM File

Java-xmx15g-jar $gatk-T baserecalibrator-r $ref _dir/hg19.fasta-i $bam _dir/$sample. Dedupped.realigned.bam- Knownsites $ref _dir/dbsnp_137.hg19.vcf-knownsites $ref _dir/mills_and_1000g_gold_standard.indels.hg19.vcf- Knownsites $ref _dir/1000g_phase1.indels.hg19.vcf-o $other _dir/$sample. Recal.grp-plots $other _dir/$ Sample.recal.grp.pdf
Java-xmx15g-jar $gatk-T baserecalibrator-r $ref _dir/hg19.fasta-i $bam _dir/$sample. DEDUPPED.REALIGNED.BAM-BQSR $othe r_dir/$sample. Recal.grp-o $other _dir/$sample. Post_recal.grp-plots $other _dir/$sample. Post_recal.grp.pdf- Knownsites $ref _dir/dbsnp_137.hg19.vcf-knownsites $ref _dir/mills_and_1000g_gold_standard.indels.hg19.vcf- Knownsites $ref _DIR/1000G_PHASE1.INDELS.HG19.VCF
Java-xmx15g-jar $gatk-T printreads-r $ref _dir/hg19.fasta-i $bam _dir/$sample. DEDUPPED.REALIGNED.BAM-BQSR $other _dir/ $sample. Recal.grp-o $bam _dir/$sample. Dedupped.realigned.recal.bam

TIPS:

If the PDF file is not produced in this step, it is likely that the rscript has performed a problem. Check if the installation of the packages is installed in R. If you still cannot find out the reason, you can baserecalibrator the two steps plus-l debug to run again, at this time the log information will detail why rscript do not.

(5) Reduce Bam File

This step is similar to compressing the BAM file, reducing size, and facilitating subsequent unifiedgenotyper reading and processing of the data.

Java-xmx15g-jar $gatk-T reducereads-r $ref _dir/hg19.fasta-i $bam _dir/$sample. Dedupped.realigned.recal.bam-o $bam _di r/$sample. Dedupped.realigned.recal.reduced.bam

The BAM file required by the call variant is now ready. The next article will talk about how to use unifiedgenotyper for the analysis of variant calling and downstream.

(6) Variant Calling

GATK offers two different tools for variant Calling,haplotypecaller and Unifiedgenotyper.

According to GATK developers, Haplotypecaller uses the local de novo assembler and the HMM likelihood function, which performs better than Unifiedgenotyper, However, Haplotypecaller is still in the experimental phase and may be problematic at run time. GATK's recommendation is if you can use Haplotypecaller, or use it. It is important to note that the current Haplotypecaller input cannot make reduced BAM files, nor can it support multithreading.

The GATK tool used here is unifiedgenotyper.

The better way to calling each sample is to Multi-sample calling. Such pooled calling can make more use of multiple sample information, and for specific sites, information is more believable.

data_dir=*******work_dir=******ref_dir=~/database/hg19bam_dir= $work _dir/bamfileslog_dir= $work _dir/logsmet_dir= $work _dir/metricsother_dir= $work _dir/otherfilesvcf_dir= $work _dir/vcfgatk=/software/sequencing/ Genomeanalysistk-2.3-4-g57ea19f/genomeanalysistk.jar
Java-xmx15g-jar $gatk-glm both-l info-r $ref _dir/hg19.fasta-t unifiedgenotyper-i $bam _dir/sample1.dedupped.realigne D.recal.reduced.bam-i ...-d $ref _dir/dbsnp_137.hg19.vcf-o $vcf _dir/allsamples.vcf-metrics $met _dir/ Snpcall.metrics-l $work _dir/truseq.bed

-L BOTH refers to simultaneous calling of SNP and Indel.

The-l parameter specifies the bed file for capture region.

This step if is deep coverage of exome sequencing data, time is quite long, >12h. Since the calling of each chromosome is relatively independent, if you are in a hurry, you can calling each chromosome separately and then merge the different vcf files with combinevariants. Unifiedgenotyper can also use multi-threading, multithreading, please refer to the following two articles:

Http://gatkforums.broadinstitute.org/discussion/1988/a-primer-on-parallelism-with-the-gatk#latest

http://www.broadinstitute.org/gatk/guide/article?id=1975

(7) Variant Quality Score Recalibration

Before doing this, it is advisable to read this article carefully.

Http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration-vqsr

Understand the principle of VQSR: Based on known variants, build models to evaluate and filter new variants.

Note that the VQSR only applies to exome sequencing or whole genome sequenicing with a samples number greater than 30. If the number of sample is not enough or the capture region is too small, please refer to the workaround recommendations for this type of problem in http://www.broadinstitute.org/gatk/guide/topic?name=best-practices. Hard filter is usually done using variantfiltration.

For those who can carry out VQSR, two steps are carried out:

A. Variantrecalibrator This step creates a model that generates a file for evaluation and recalibration.

Java-xmx15g-jar $gatk-T variantrecalibrator-r $ref _dir/hg19.fasta-mode snp-input $vcf _dir/allsamples.vcf-resource:h apmap,known=false,training=true,truth=true,prior=15.0 $ref _dir/hapmap_3.3.hg19.vcf-resource:omni,known=false, training=true,truth=false,prior=12.0 $ref _dir/1000g_omni2.5.hg19.vcf-resource:dbsnp,known=true,training=false,  truth=false,prior=6.0 $ref _dir/dbsnp_137.hg19.vcf-an qd-an haplotypescore-an mqranksum-an readposranksum-an fs-an MQ -an inbreedingcoeff-recalfile $other _dir/allsamples_snp.recal-tranchesfile $other _dir/allsamples_snp.tranches- Rscriptfile $other _dir/allsamples_snp.plots.r--tstranche 90.0--tstranche 93.0--tstranche 95.0--tstranche 97.0-- Tstranche 99.0--tstranche 99.9--tstranche 100.0

First run the above code. -an flag should include as many variants of those annotator parameters as possible to facilitate later evaluation of these parameters to select the optimal combination. This step produces a picture of ALLSAMPLES_SNP.PLOTS.R, similar to the following:

Each diagram depicts a scatter plot of two annotator parameters. Please refer to the above link for specific meanings. All we have to do is choose the parameters of the variants that can be very well divided into known/novel and retained/filtered. The Mqranksum and Haplotypescore shown can be used to make the known variants clusters very clearly, so that these two parameters are preserved by filter the cluster that are particularly large with the known variants differences.

The process of choosing the Annotator parameter is actually a bit subjective and requires experience ... anyway, good luck.

When we have chosen all the parameters that have a good degree of sensitivity to the variants, then run the above code once, and this time, in the-an line, remove the parameters that are not selected, preserving the parameters we have chosen.

If you hesitate to decide which parameters to add or not to add, you can look at. Tranches and. Tranches.pdf these two files to see what kind of parameter combination the result of run out is a bit better. How can you tell which result is good? (1) TI/TV rate. (2) Number of variants reserved. (3) True positive and false positive proportions. Generally speaking, the larger the TI/TV is in the range close to 3, the smaller the value, the more False positive.

is an example of the contents of a. tranches file. It lists the number of novel and known variants, as well as the corresponding TI/TV proportions, under the given target truth sensitivity. The higher the target truth sensitivity, the more the number of variants included, the more likely the false positives are, and the more specificity you want to lower the false positives, the lower the target truth sensitivity. Target truth sensitivity can be set by--tstranche, Variantrecalibrator only gives--tstranche 99 option, we can set a few more threshold in the program, In turn, you can choose between sensitivity and specificity.

The same is true for Indel. Just Indel's training resources and SNP are different. In addition, some other parameters are set according to GATK's recommendations.

Java-xmx15g-jar $gatk-T variantrecalibrator-r $ref _dir/hg19.fasta-mode INDEL--maxgaussians 4-std 10.0-percentbad 0 .12-input $vcf _dir/allsamples.vcf-resource:mills,known=true,training=true,truth=true,prior=12.0 $ref _dir/mills_ And_1000g_gold_standard.indels.hg19-an mq-an fs-an inbreedingcoeff-recalfile $other _dir/allsamples_indel.recal- Tranchesfile $other _dir/allsamples_indel.tranches-rscriptfile $other _DIR/ALLSAMPLES_INDEL.PLOTS.R

Also choose the flag of-an and run it once again.

B.applyrecalibration apply the SNP first and then apply it to Indel.

Java-xmx15g-jar $gatk-T applyrecalibration-r $ref _dir/hg19.fasta-mode snp-input $vcf _dir/allsamples.vcf-tranchesfil E $other _dir/allsamples_snp.tranches-recalfile $other _dir/allsamples_snp.recal-o $vcf _dir/allsamples_t90. SNPRECAL.VCF--ts_filter_level 90
Java-xmx15g-jar $gatk-T applyrecalibration-r $ref _dir/hg19.fasta-mode indel-input $vcf _dir/allsamples_t90. Snprecal.vcf-tranchesfile $other _dir/allsamples_indel.tranches-recalfile $other _dir/allsamples_indel.recal-o $vcf _dir/allsamples_t90.bothrecal.vcf

--ts_filter_lever 90 refers to the target truth sensitivity under 90 in the "Filter" field marked as pass, the rest according to our set of Tstranche options, respectively, flag different flag. such as vqsrtranchesnp90.00to93.00,vqsrtranchesnp93.00to95.00 and so on.

(8) Filtering

After you finish VQSR, you can filter by the filter column in the VCF file. Filter does not have a certain standard, the main or look at their own expectations, between specificity and sensitivity choose. The main reference is the data in the. tranches file. If you want to include as many novel variants as possible and do not care too much about false positive, you can choose a higher target truth sensitivity, for example 97. So our filter strategy is: Preserve the filter column value is pass, The vqsrtranchesnp90.00to93.00,vqsrtranchesnp93.00to95.00,vqsrtranchesnp95.00to97.00 variants,filter off the other variants. You can use awk.

(9) Variants evaluation

The tool used is varianteval. It can give a lot of summary information, such as the number of ti/tv,variants per sample, and so on.

Java-xmx70g-jar $gatk-T varianteval-r $ref _dir/genome.fa--eval:allsamples $vcf _dir/allsamples_t90-97.bothrecal.vcf -O $other _dir/allsamples_t90.eval.gatkreport-d $ref _dir/dbsnp_137.hg19.vcf-st sample-st Filter-noev-ev Titvvariantevaluator
Java-xmx70g-jar $gatk-T varianteval-r $ref _dir/genome.fa--eval:allsamples $vcf _dir/allsamples_t90-97.bothrecal.vcf -O $other _dir/allsamples_t90.eval_1.gatkreport-d $ref _dir/dbsnp_137.hg19.vcf-st sample-st Filter-noev-ev Compoverlap-ev Countvariants-ev Indelsummary-ev Multiallelicsummary-ev validationreport

This step is very memory-intensive, so sometimes you have to run multiple times, producing a portion of the results each time.

-st Each of the sample is counted separately.

-noev because EV includes too many options, all run is a bit overwhelming, so disable it, and then use-ev flag to select the indicators we are interested in to summary.

This article comes from:

Http://blog.sina.com.cn/s/blog_681aaa550101bhdc.html

Http://blog.sina.com.cn/s/blog_681aaa550101cdmk.html

gatk--Use Reprint

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.