Variant call format (VCF)

Source: Internet
Author: User
Introduction

Variant call format (VCF) Is a text file format for storing marker and genotype data. This short tutorial describes how variant call format encodes data for single nucleus otide variants.

Every VCF file has three parts in the following order:

  1. Meta-information lines (Lines beginning "##").
  2. One header line (line beginning with "# chrom ").
  3. Data lines contain marker and genotype data (one variant per line). A data line is called a VCF record.

Each VCF record has the same number of tab-separated fields as the header line. The symbol "." is used to denote missing data.

 

Example VCF file ## fileformat = vcfv4.1 ## format = <id = gt, number = 1, type = integer, description = "genotype" >## format = <id = GP, number = g, type = float, description = "genotype probabilities" >## format = <id = pl, number = g, type = float, description = "phred-scaled genotype likelihoods"> # chrom pos id ref alt qual filter Info format samp001 samp002 20 1291018 rs11449 g. pass. GT 0/0 0/1 20 2300608 rs84825 c t. pass. GT: GP 0/1 :. 0/1: 0.03, 0.997, 0 20 2301308 rs84823 T g. pass. GT: Pl. /. :. 1/1: 10, 5, 0

 

Meta-information lines

Each meta-information line must have the format# Key = ValueAnd cannot contain white-space. the first meta-information line must specify the VCF version number (Version 4.1 In the example ). additional meta-information lines are optional, but are often encoded ded to describe terms used in the filter, info, and format fields. in the example, the additional meta-information lines say thatGTMeans genotype,GPMeans the probability of each possible genotype call, andGLMeans the likelihood of each possible genotype call.

 

Marker information

The first nine columns of the header line and data lines describe the variants:

Chrom The chromosome.
Pos The genome coordinate of the first base in the variant. Within a chromosome, VCF records are sorted in order of increasing position.
ID A semicolon-separated list of marker identifiers.
Ref The reference allele expressed as a sequence of one or more A/C/G/T nucleus otides (e.g. "A" or "AAC ")
ALT The alternate allele expressed as a sequence of one or more A/C/G/T nucleus otides (e.g. "A" or "AAC "). if there is more than one alternate alleles, the field shocould be a comma-separated list of alternate alleles.
Qual Probability that the alt allele is incorrectly specified, expressed on the phred Scale (-10log10 (probability )).
Filter Either "pass" or a semicolon-separated list of failed quality control filters.
Info Additional information (no white space, tabs, or semi-colons permitted ).
Format Colon-separated list of data subfields reported for each sample. The format fields in the example are explained below.

 

Sample Data

After the nine fixed columns, the remaining columns contain the sample identifier and the colon-separated Data subfields for each individual. The data subfields in a record must match that record's format subfields.

The most common format subfield is GT (genotype) data. if the GT subfield is present, it must be the first subfield. in the sample data, genotype alleles are numeric: the ref allele is 0, the first alt allele is 1, and so on. the allele separator is '/' For unphased genotypes and '|' for phased genotypes. in the example, all genotypes are unphased, And the genotypes for samp001 are homozygote reference, heterozygote, and missing in the first, second, and third records.

The second record contains a GP (genotype probability) format subfield, and the third record contains pL (phred-scaled genotype likelihood) format subfield. GP and GL data subfields are three comma-separated values corresponding to the ref/Ref, ref/ALT, and ALT/ALT genotypes in that order. to convert a phred-scaled likelihood P to a raw likelihood L, use the formula L = 10 (-P/10 ).

In the second record of the example, the GP data subfield is missing for samp001 and the GP subfield for samp002 has probabilities of 0.03, 0.97, and 0 for the ref/Ref, ref/ALT, and ALT/ALT genotypes.

In the third record of the example, the GL data subfield is missing for samp001. the GL subfield for samp002 has phred-scaled likelihoods of 10, 5, and 0 and raw likelihoods of 0.1, 0.316, and 1 for the ref/Ref, ref/ALT, and ALT/ALT genotypes. it is not necessary for the genotype likelihoods to sum up to 1.0.

 

Resources

Here are some tools for manipulating VCF files:

  • The beagle utilities
  • Vcftools
  • Vcflib
  • Plink/seq

Know more about VCF, please click here: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

Http://vcftools.sourceforge.net/VCF-poster.pdf

Variant call format (VCF)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.