Description of VCF (Variant call format) format

Source: Internet
Author: User

VCF File Sample (VCFv4.2)

# #fileformat =vcfv4.2# #fileDate=20090805# #source=myimputationprogramv3.1# #reference=file:///Seq/references/1000genomespilot-ncbi36.fasta# #contig =<id= -, length=62435964, assembly=b36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x># #phasing=partial# #INFO=<id=ns,number=1, type=integer,description="Number of Samples with Data"># #INFO=<id=dp,number=1, type=integer,description="Total Depth"># #INFO=<id=af,number=a,type=float,description="allele Frequency"># #INFO=<id=aa,number=1, type=string,description="Ancestral allele"># #INFO=<id=db,number=0, type=flag,description="DBSNP Membership, build 129"># #INFO=<id=h2,number=0, type=flag,description="HAPMAP2 Membership"># #FILTER=<id=q10,description="Quality below"># #FILTER=<id=s50,description="Less than 50% of samples has data"># #FORMAT=<id=gt,number=1, type=string,description="genotype"># #FORMAT=<id=gq,number=1, type=integer,description="Genotype Quality"># #FORMAT=<id=dp,number=1, type=integer,description="Read Depth"># #FORMAT=<id=hq,number=2, type=integer,description="haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA000 NA00003 -     14370rs6054257 G A inPASS ns=3;DP = -; af=0.5;D b; H2 GT:GQ:DP:HQ0|0: -:1:Wuyi,Wuyi 1|0: -:8:Wuyi,Wuyi 1/1: +:5:.,. -     17330. T A3Q10 ns=3;DP = One; af=0.017GT:GQ:DP:HQ0|0: the:3: -, - 0|1:3:5: $,3   0/0: A:3 -     1110696rs6040355 A g,t thePASS ns=2;DP =Ten; af=0.333,0.667; Aa=t;db GT:GQ:DP:HQ1|2: +:6: at, - 2|1:2:0: -,2   2/2: *:4 -     1230237. T. -PASS ns=3;DP = -; Aa=t GT:GQ:DP:HQ0|0: Wu:7: About, - 0|0: -:4:Wuyi,Wuyi 0/0: A:2 -     1234567MICROSAT1 GTC G,GTCT -PASS ns=3;DP =9; Aa=g GT:GQ:DP0/1: *:4       0/2: -:2       1/1: +:3

chrom: Indicates the mutation site in which contig call out, if it is the whole genome of human beings that is chr1...chr22,chrx,y,m.

POS: The mutation site is relative to the location of the reference genome, and if it is Indel, it is where the first base is located.

ID: If a call-out SNP exists in the DBSNP database, the RS number in the corresponding DBSNP will be displayed.

ref and ref: At this mutation site, reference the base of the genome and the corresponding base in the genome of the research object.

QUAL: Can be understood as the mass value of the mutation site that is being call out. The Q=-10LGP,Q represents the mass value; p indicates the probability of the error occurring at this bit. Therefore, if you want to control the error rate from more than 90%, the p threshold is 1/10, that LG (1/10) =-1,q= (-10) * (-1) = 10. Similarly, when q=20, the error rate is controlled at 0.01.

FILTER: Ideally, the value of qual should be calculated with all the error models, which can represent the correct mutation sites, but the facts are not. Therefore, the original mutation sites need to be further filtered. No matter what method you use to filter the mutation site, after filtering, in the filter column will leave a filtering record, if passed the filter standard, then these through the standard good mutation site filter column will be annotated a pass, if not through the filter, You will be prompted for additional information except pass in the filter column. If this column is a "." , it means that no filtering has been done.

Example:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT na12878chr1873762. T G5231.78PASS ac=1; af=0.50; an=2;DP =315;D els=0.00; hrun=2; Haplotypescore=15.11; mq=91.05; mq0= the; Qd=16.61; sb=-1533.02; vqslod=-1.5473GT:AD:DP:GQ:PL0/1:173,141:282: About:255,0,255CHR1877664rs3828047 A G3931.66PASS ac=2; af=1.00; an=2;D b;dp= the;D els=0.00; hrun=1; Haplotypescore=1.59; mq=92.52; mq0=4; Qd=37.44; sb=-1152.13; Vqslod=0.1185GT:AD:DP:GQ:PL1/1:0, the:94: About:255,255,0CHR1899282rs28548431 C T71.77PASS ac=1; af=0.50; an=2;D b;dp=4;D els=0.00; hrun=0; Haplotypescore=0.00; mq=99.00; mq0=0; Qd=17.94; sb=-46.55; vqslod=-1.9148GT:AD:DP:GQ:PL0/1:1,3:4:25.92:103,0, -CHR1974165rs9442391 T C29.84Lowqual ac=1; af=0.50; an=2;D b;dp= -;D els=0.00; hrun=1; Haplotypescore=0.16; mq=95.26; mq0=0; Qd=1.66; sb=-0.98GT:AD:DP:GQ:PL0/1: -,4: -:60.91: A,0,255

By now, we can explain the above example:

chr1:873762 is a newly discovered t/g mutation and has a high level of credibility (qual=5231.78).

chr1:877664 is a known mutation for the a/g SNP site, the name rs3828047, and has a high level of confidence (qual=3931.66).

chr1:899282 is a known variant of the C/T SNP site, with the name rs28548431, but with a lower confidence level (qual=71.77).

CHR1:974165 is a known mutation for T/C SNP sites, the name rs9442391, but the quality of the site is very low, is labeled "lowqual", in the subsequent analysis can be filtered out.

VCF file looks very complex, very scary look, but most of them are some tags, and these tags are basically in the VASR filter, can understand the meaning of each tags is the best, if really do not understand the tube. In fact, the most critical information is a few columns:

Chr1 873762. T G [clipped] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255

Chr1 877664 rs3828047 A G [clipped] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0

Chr1 899282 rs28548431 C T [clipped] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26

Where the last two columns correspond, each tag corresponds to one or a set of values, such as:

CHR1:873762,GT corresponds to 0/1;ad corresponding to 173,141;DP corresponding to 282;gq corresponding to 99;pl corresponding to 255, 0,255.

GT: Represents the genotype of this sample, for a twice-fold organism, the GT value represents the two alleles that the sample carries at the site. 0 is the same as ref, 1 means the same as ALT, and 2 represents the second alt. When there is only one alt allele, 0/0 is pure and is consistent with ref; 0/1 means hybrid, two allele one is ALT, one is REF;1/1 and all are alt; the most common format subfield is GT ( Genotype) data. If the GT subfield is present, it must be the first subfield. In the sample data, genotype alleles was numeric:the REF allele is 0, the first ALT allele was 1, and so on. The allele separator is '/' for unphased genotypes and ' | ' for phased genotypes.

0-reference Call

1-alternative Call 1

2-alternative Call 2

AD: corresponds to two comma-separated values, each representing the number of reads that overwrite the ref and ALT bases, equivalent to supporting ref and Alt-enabled sequencing depth.

DP: The total number of reads that covers the site is equivalent to the depth of this bit (not the number of reads, but the number of reads required for a certain mass value).

PL: corresponds to 3 comma-separated values, these three values indicate that the locus genotype is 0/0,0/1,1/1 without a priori standardized phred-scaled likelihood value (L). If converted to support the genotype probability (P), because of L=-10LGP, then p=10^ (-L/10), therefore, when the L value is 0 o'clock, p=10^0=1. Therefore, the smaller the value, the greater the probability of support, that is, the greater the likelihood of this genotype.

GQ: Represents the mass value of the most probable genotype. The meaning of expression is the same as qual.

Give an example:

Chr1 899282 rs28548431 C T [clipped] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26

At this point, GT=0/1, that is, the genotype of this locus is c/t;gq=25.92, the mass value is not too high, probably because cover to the reads number of the site is too little, dp=4, that is, only 4 reads support the variation of this place; ad= 1, 3, that is, support for ref read has a, there are 3 support for ALT, in PL, the locus of the uncertainty of the genotype is more prominent, 0/1 of the PL value is 0, although the probability of supporting 0/1 is very high, but 1/1 of the PL value is only 26, that is, there are 10^ (-2.6) The probability of =0.25% is 1/1, but it is almost impossible to be 0/0 because the probability of supporting 0/0 is only 10^ (-10.3) =5*10-11.

VCF (Variant call Format) version 4.1

The VCF specification is no longer maintained by the Genomes Project. The group leading the management and expansion of the format are the Global Alliance for Genomics and health Data working g Roup file Format team, Http://ga4gh.org/#/fileformats-team

The main version of the specification can found on Https://github.com/samtools/hts-specs

This is under continued development, please check the Hts-specs page for the most recent specification

A PDF of the v4.1 spec is http://samtools.github.io/hts-specs/VCFv4.1.pdf
A PDF of the v4.2 spec is http://samtools.github.io/hts-specs/VCFv4.2.pdf

Vcftools Host A discussion list about the specification called Vcf-spec http://sourceforge.net/p/vcftools/mailman/

REF:

Http://blog.sina.com.cn/s/blog_12d5e3d3c0101qv1u.html

Http://samtools.github.io/hts-specs/VCFv4.2.pdf

Description of VCF (Variant call format) format

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.