Call INDELS/SV common software-porters

Source: Internet
Author: User
Tags repetition

Indel calling is more difficult than SNP calling because of the presence of this insertion-missing, which can easily interfere with sequencing, which causes many false-positive SNPs around Indel, and affects the accuracy of Indel itself. In theory, the best way to detect indel is to do de novo assembly, and then compare the de novo genome with the original genome, but de novo assembly is actually more difficult otl

Paired-end sequencing provides very useful information for finding long fragments of indel, but how to use this information accurately is also a difficult problem at present.

The following is a brief introduction to some of the existing call Indel software and the use of the process, note that the introduction of a few here are basically for the Illumina paired-end data. (1) Samtools Mpileup

Samtools inside the mpileup can call SNP, also can call Indel,

Samtools MPILEUP-DSUGF Ref.fasta
Sample.bam |
Bcftools view-ncvg–

Here's a-n argument that skips the reference where the base is N.

Note that this is the utility of a pipeline, because the result of Mpileup directly is not the result that is seen in the last VCF file, but a sort of result, this result is equivalent to a temporary result, so we do not need to output it and can go through the pipeline "|" under Linux. (<-this vertical line is called pipe, pipeline) directly to the next command, this is bcftools (we will see later, there are other software can be used to deal with the results of mpileup), so bcftools the last parameter is a "-", Represents a standard input (STDIN), which is something that is piped to it (so the pipeline is actually a very image thing, he doesn't need you to save the intermediate results before you read it, but directly connected to two command lines, the output of one command line to another command line input, of course, if you want to, You can use a pipe to Liam multiple ...)

The specific contents of the Mpileup can be found in


(2) Gatk Unifiedgenotyper

Refer to a process of others:





GATK's previous post has been discussed a lot, the difference between call Indel and SNP is small, for example,

Java-jar Genomeanalysistklite.jar
-R Ref.fasta
-T Unifiedgenotyper
-nt 4
-stand_call_conf 50.0
-stand_emit_conf 0
-RF Badcigar

This is the main-GLM this parameter is set to Indel, so the output of the results only Indel.

In addition, now switch to lite version, and finally can no longer be affected by no Gatk_key (--| |

-nt This parameter is set to the number of threads (I have never known unifiedgenotyper has supported multithreading = =, now we can set this parameter we will no longer be afraid of him running too slow ...), in addition to this parameter and-nct can also control the number of threads, visual difference is more subtle, data Threads is a concept of what the landlord is also more difficult to understand,-nt is how many data threads, and-nct is each data threads allocate how many CPUs, we see the machine resources try to set up, I generally use-nt, consumption of memory higher but should seem a bit faster ...

Then according to personal experience 2.2 version call out of Indel will be inexplicably less many, a lot of 2.1 call out of the I use 2.2 to try the parameters of the grace is not out (certainly I open the software is not the way = =), and 2.2 of the ad value seems to have bugs, obviously a lot of wrong, but the-maxaltalleles default value has risen to 6 and the appreciation does not slow down is really imba (that is, so many alt occasionally feel that the meaning is ...) but if the bug is expected to be repaired, Wait for the GATK to continue doing better.

Finally,-rf this parameter, the full name is –read_filter, is used to filter the input BAM file reads, because GATK will check the BAM file there is a cigar value of something, Sometimes there are some mapping software generated by some of the BAM files do not meet its standards, in GATK processing may be wrapped malformed read a kind of error, so you can-rf Badcigar This parameter to eliminate these non-standard reads, So GATK will be able to run normally, the last time a classmate ran into such a problem, I later remembered to add this parameter should be most related to the problem can be solved (if added can not be solved then it may be a version of the bug, GATK on the forum seems to have been encountered this situation, more than a few versions try it ......)。

(3) Shore


Ossowski, S. et al sequencing of natural strains of Arabidopsis thaliana with short reads. Genome 18, 2024–2033 (2008).

Shore is a set of processes and software developed by a group of people doing Arabidopsis mapping and data analysis as a whole.

Citation point this, foreign Super Daniel Detlef Weigel attention, the later 1001 genome is basically used this set (of course, they are also a group of people).

When compared with other software, Shore uses the process slightly more complicated, because he has his own data structure, so first to convert the data, and then for Paired-end he has a correct step, by default only call 1~3bp Indel.

Shore is actually my first time to do mapping use of a software (because I was mainly to do A.thaliana =), but later felt really not good, although it seems (there is a document for the card) is a very good software (but visual on their own use, the article is basically their hair) , so this way I stole a lazy jump directly, we are interested in self-study, the function is really still very powerful ...

(4) Varscan

Koboldt, D. C. et al varscan:variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25, 2283–2285 (2009).

Koboldt, D. C. et al varscan 2:somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568–576 (2012).

The citation rate seems to be a variants detection software, used to call Indel nature.

Before we talk about samtools inside of the Mpileup, he generated the results can be used to call Indel Bcftools, of course, can also be processed with other software, the Varscan on this side is also through the result of Mpileup call Indel, Specific usage, for example,

Samtools mpileup-f Ref.fasta Sample.bam |
Java-jar Varscan.v2.3.3.jar Mpileup2indel
–vcf-sample-list sample_names.list

Varscan is a Java program, this –OUTPUT-VCF 1 means that the output is formatted as VCF format, otherwise it is the format of the software itself, and then –vcf-sample-list this parameter can not be added, but the generated VCF file in the sample name is 1, 2, 3 ... So rename, so you can use this parameter to a list of sample names, corresponding to the BAM file you give the name of the sample, so that the VCF file has a corresponding sample name. Some other parameters generally with the default on OK, note that this is used Mpileup2indel, corresponding to Samtools mpileup, before samtools inside or pileup when he corresponds to Pileup2indel ( From my sensible feeling pileup is eliminated, are useless to, the world changes Ah ...), and then the software may be prompted by the use of mpileup generated results there are many unresolved, ignore should be able to ...

Overall, the use of the method is still very simple, the equivalent is to replace the bcftools into a varscan, I believe you can easily get started.

The following introduces a few specifically for Call Indel, the software name is with Indel, a look is very professional;-)

5) Dindel

Albers, C. A. et al dindel:accurate Indel calls from Short-read data. Genome Res. 21, 961–973 (2011).

Again the Sanger center of things, the Internet seems to be able to search another link but seems to have failed (or is the wall = =).

About Dindel Why to call Dindel, personal speculation may be detect Indel or discover Indel abbreviation, but whether or not correct, this is not the problem we need to care about ...

The Dindel input file is also a BAM file, then the entire process is divided into four steps (OK, another complex = =), each step of the code is roughly the following

# # Stage 1 First the BAM file inside all the Indel are proposed, in doing this step while the software
# # will also automatically detect the insert size of the paired-end reads

Dindel–ref Ref.fasta
–analysis Getcigarindels
–bamfile Sample.bam
–outputfile Sample.dindel_output

Two files are generated at the end of the first step and we are here


Contains all the candidate indels, and the other one is


Inside is a distribution of the insert size, and the next step is to use the first file as the main one.

# # Stage 2 Put the above-mentioned indels into one of the windows (size

# # is about 120bp), this step is provided with the this Python script

# # to execute, the input file is the first step to generate the Sample.dindel_output.variants.txt–inputvarfile Sample.dindel_output.variants.txt
–numwindowsperfile 100000

This step is based on how many windows that you set per file contains the size of the –numwindowsperfile parameter, and then generates several files prefixed with, such as .....

# # Stage 3 This step is basically a reordering of the reads inside each windows,

# # Here to use the original BAM file, but also to use the first step of the generated

# # Sample.dindel_output.libraries.txt This file, and the second step generated by the

# # Windows files such as

Dindel–ref Ref.fasta
–analysis Indels
–bamfile Sample.bam
–libfile Sample.dindel_output.libraries.txt
–outputfile Sample.dindel_output.stage2

It will then still generate several prefixes that are


file with the suffix named *.glt.txt (this software is not only more steps, the resulting file is much more AH-). Here you know a little bit may have been found, this way a few steps in fact, we talked about the gatk do realign is the same, this tells us that in Indel around to do realignment steps for SNP and Indel accuracy is very important ...

# # # Stage4 The fourth step is the step to generate the final result–ref Ref.fasta
–inputfiles Sample.dindel_output.stage2.list.txt

The input file Sample.dindel_output.stage2.list.txt contains a list of all previously generated glt.txt files, which are sample.dindel_output.stage2.*.glt.txt files. (not under the current path, the file name should be preceded by a relative path, we are afraid of trouble if the full use of the absolute path is good), and then generate a VCF results

The whole process is actually not too complicated, but here is a little to say is dindel speed seems a little not flattering, speed limit is the third step, Slow then I waited a day and night to finish two or three samples really can't see down then resolutely put it pinched so fourth step I actually did not run this kind of thing I will tell you? ............ (Must be my indel more difficult to Detect = +), the specific also see the use of their own situation.

(6) Pindel

Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel:a pattern growth approach to detect break points of L Arge deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 2865–2871 (2009).

Unlike other call Indel software is not the same, Pindel is called pattern growth algorithm to detect indel and other structural variations (so called P-indel bar), the specific algorithm see the above citation. The number of citations is also possible, indicating that the algorithm has some advantages.

Pindel can have several types of input files, individuals generally tend to one of them, and the software itself is recommended should also be this, the specific process code is actually very simple:

The first step is call Indel and some other structural variants like rewind, tandem repetition, etc.

Pindel-f ref.fasta-i sample.pindel.config-c all-t 2-o sample

In fact this step we can already get all the results, sample.pindel.config This file is a configuration file, all BAM files and insert size information is stored in this file, and then the software by reading this file as its input, The contents of this file are formatted as follows

Sample.bam sample

The first column is the file name of the BAM file (take the path when needed), and the second column is the insert size

* Here's to add the insert size, the so-called insert size is the length of the sequence break when you build the library, that is, the sequence of the paired-end you are measuring is from the ends of such a sequence,


|-–75--| ———————— ————— –|-–75-–|

This way the whole 250bp is actually the length of the sequence you get after interruption, and then the two sides measured 75bp, this 250bp is we speak of the insert size, this length can ask the sequencing company (did not ask, do not know if they will tell you ...) ), can also be counted by the software, like the above Dindel first step will be counted to estimate this length, and then Picard inside also has collectinsertsizemetrics this tool.

Because this length is actually just a range (generally narrower), the interrupt length is basically in this range, as shown in

So just set a approximate value on this side, not very accurate, the last column is to set a label, because there can be more than one BAM file, this side of the label will be replaced by the file name in the final result to distinguish the different sources of reads. Columns and columns are separated by tabs or spaces.

The-c parameter can be used to set the range,-C all represents the entire genome,-T is the number of threads, then there is a-w parameter can be used to control memory usage, large memory can be ignored.

Finally-o This parameter is set to a prefix, and then by default output all insert missing or structure variant types, respectively, generate the following suffix name end of the file:

D = deletion missing sequence

SI = short insertion Insert Sequence

INV = inversion-turn

TD = Tandem Duplication Tandem Repetition

LI = large insertion Long insert sequence, this file format is very different from other files

BP = Unassigned Breakpoints does not have the remaining breakpoint on either of the above types

We can then convert these files into our usual VCF file through the PINDEL2VCF provided in the program.

Convenient downstream processing,

Pindel2vcf-r ref.fasta-r ref_name-d 20121208-p sample-v sample.indel.vcf

This side-R needs to set a name for the reference sequence,-D also set the date (that is, the date the reference sequence is generated), of course, the random should also be no problem, mainly for normalization,-p This parameter is the value of the previous generation of the prefix name, for example, this is the sample ( Of course all sample_* results are on this side), and the last-V is the file name of the generated VCF file.

Pindel use up or relatively simple, and the speed is relatively fast, but the generated VCF file is not so standard, with GATK this software processing may be inconvenient, you can add-g this parameter to make it as much as possible to meet the requirements of gatk input files.

(7) Soapindel


Li, S. et al soapindel:efficient identification of indels from short paired reads. Genome Res. doi:10.1101/gr.132480.111

Content from:

Call INDELS/SV common software-porters

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.