Introduction of pacbio-Assembly

Source: Internet
Author: User
Tags diff documentation repetition cpu usage virtualenv


Home: Github:pacificbiosciences/falcon

From: https://www.cnblogs.com/leezx/p/5724590.html Introduction

Falcon is a set of tools that are used to consensus and assemble by rapidly reads.

The Falcon Toolkit is a simple set of code that I use to study the efficient assembly algorithms for haploid and twice-fold genomes.

To increase the speed of the calculation, some of the background code is implemented using C, in order to facilitate some simple front-end is written in Python.

Falcon is not a fool's assembly tool (except for a very small genome), in order to get the best results, you may need to understand a variety of distributed computing systems and some basic theory of genome Assembly. FAQ pages have some common questions and answers. Manual Tips contrib Falcon Genome Assembly Toolkit-use guide 1. An overview of hierarchical genome assembly process

There are several main steps: Raw sub-reads overlapping for error correction pre-assembly and error correction overlapping of the E Rror corrected reads overlap filtering constructing graph from overlaps constructing contig from graph

Simple translation: Build overlap using raw sub-reads, prepare for error correction and error correction error correction reads overlapping detection overlapping filtering uses overlapping to construct diagrams to construct Contig

Each step uses different command-line tools to implement different algorithms to do this work. Similarly, the computational requirements for each step are very different. Assume that the user already has a reasonable amount of computing resources.

For example, assembling a 100M-sized genome within a reasonable amount of time may require at least 32 core CPUs and 128 Gb RAM. Code is written according to the cluster computing environment. A job queue is required to perform lengthy scripting and Cpu-rich's computed task force columns.

A file containing a set of sub-reads, fc_run.py scripts can drive workflows to manage and examine data dependencies and submit each step of work, generating an assembled sketch from a given data.

Fc_run.py is a workflow-driven script that needs to run on a single computer, allowing it to work for a long period of time to complete the entire assembly process. It requires a configuration file fc_local.cfg as a single input. The input file for the original sequence data is contained in the configuration file.

Configuration files can be used to control the use of computing resources and important algorithm parameters, according to the input data set to achieve the most assembled results. However, there is no one-size-fits-all way to guess the most appropriate options and parameters for various specific projects. The best way to tune a parameter is to understand some of the assembly theory and implement it a little bit, so that you can correctly understand the effect of changing some options on the assembly result. It is also important to quickly browse the reads length distribution, overall coverage, and adjust the appropriate options.

Here we will detail the genome layering strategy and fc_run.py the meaning of each parameter Quick start setup:-installation-and-environment setup:-running setup:- Complete-example 2. Build overlap using raw sub-reads (for error correction)

This version of Falcon's overlapping is made with a modified version of Gene Myers ' Daligner (refer to GitHub version). The main modification is the adapting some I/O for the downstream bioinformatics process. You can see what modifications are available through Git diff.

git diff    #查看版本之间差异

The INPUT_FOFN option points to a file that contains all the input data. Fasta2db from Daligner will be invoked in fc_run.py (I/O concentrator will be executed in your FC_ run.py computer nodes are running, if this is a problem in your server cluster, you must modify the code to package the relevant operations into a script for submission to your task management system.

Input_fofn    #指向了包含所有输入数据的文件

This version of the fc_run.py supports running the assembly from the error correction reads (You can skip the error correction link and start assembling directly)

Input_type = preads  #输入为 All error-corrected reads, skips error correction, starts the final overlapping group assembly step directly
Input_type = raw    #原始数据

You need to determine the length of the cutoff, general, select a certain threshold, so that you can get 15x to 20x coverage is better. However, if you have sufficient computational resources to correct reads you have other uses, you can set lower length cutoff to get more error corrected reads.
At the end of the assembly, more reads does not mean better assembly results, because there will be noise.
Sometimes it is meaningful to try different length_cutoff_pr, and later computations are more cheap than the first overlapping step for error correction.
Our strategy is to select a short length_cutoff for one calculation. After that, we use different length_cutoff_pr to get better assembly results.

Length_cutoff    #控制错误校正过程中的cutoff
length_cutoff_pr    #控制之后组装重叠群步骤的cutoff

The Pa_concurrent_jobs option controls the number of concurrent jobs submitted by fc_run.py. Sge_option_da and Sge_option_la control the job queue and the number of Daligner jobs slots. The number of threads used by Daligner is 4 by default. However, you can use more than 4 slots depending on the cluster settings and the amount of memory you compute for the node. In order to choose the best quantity, you'd better consult your lab's HPC experts and do some upfront testing.

Pa_concurrent_jobs    #fc_run. Py commits the current number of jobs
Sge_option_da    #作业队列
sge_option_la    #daligner作业的slots数

The number of final run jobs depends on how you "splits" the sequence database, read Gene Myers's blog carefully (http://dazzlerblog.wordpress.com) to learn how to set up Pa_dbsplit_option and Pa_ Hpcdaligner_option option. In general, for large genomes, you should use-S400 (400Mb sequence per block) in Pa_dbsplit_option. The number of jobs will be reduced, but the duration will be prolonged. If the job scheduling system restricts the time a job can run, you should set a smaller number for the-s option.
Another parameter that affects the total number of jobs is-dal option in Pa_hpcdaligner_option. It determines how many blocks are compared to each in the single jobs. Large-dal give a large amount of homework, but the number is small; Small, small job, but a large number of submissions.
In this workflow, trace point generated by Daligner is not used (to be efficient, you should use trace point, but first you know how to pull out them correctly). The-s1000 in pa_hpcdaligner_option makes trace point thinner, saving disk space. We use-l1000 to ignore reads shorter than 1KB.

Pa_dbsplit_option    #-s400 for  large genomes    smaller number for the-s
pa_hpcdaligner_option    #  
    -dal    #  Determines how many blocks are compared to the single jobs;    
    -s1000   #
    -l1000   #
3. Pre-assembly and error correction

Daligner output is a set of. Las files that contain information about the reads of each other. Such information is dumped as sequences for error correction by a binary executable to La4falcon. Fc_consensus.py generates consensus (this part is written in C).

Fc_consensus.py has many options, and you can use falcon_sense_option to control it. In most cases,--min_cov and--max_n_read are the most important options. --min_cov controls when a seed read gets trimmed or broken due to the low coverage (what meaning. ),--max_n_read puts a cap on the number of reads used for error correction (what meaning. )。
In a highly repetitive genome rollup, set small--max_n_read to ensure consensus code does not waste time aligning repeats. The longest suitable overlaps is used for correction to reduce the probability of collapsed repeats.

You can use Cns_concurrent_jobs to set the maximum number of current jobs that are submitted to the task management system.

Falcon_sense_option    #
    --min_cov    #when A seed read gets trimmed or broken due to low coverage???
    --max_n_read    #
cns_concurrent_jobs    #
4. Overlaps detection of reads after error correction

This part is very similar to the original overlapping, only partially modified, Daligner only the local raw reads as the default value.
fc_run.py generates a Fasta file of error-corrected reads where the FASTA header is parse-able by Daligner. (What meaning. )
The following parameters control the process of calculating this step:

SGE_OPTION_PDA =-pe SMP 8-q jobqueue
sge_option_pla =-pe SMP 2-q jobqueue
= ovlp_concurrent_jobs
Bsplit_option =-x500-s50
ovlp_hpcdaligner_option =-v-dal4-t32-h60-e.96-l500-s1000

This setting is parallel to the overlapping of the first step, and the main difference is the ovlp_hpcdaligner_option-e option. The error rate is a lot smaller, so we expect much higher correlation between the p-reads. 5. Overlaps filtration

Not all overlaps are independent, so it is possible to use some filtering step to reduce the complexity of calculations and assembly diagrams. For example, if a read is all contained in another read, their overlap does not provide additional information about the construction of the genome. Because of the transitive properties of overlapping relationships (transitive property), many overlap information can be simply inferred.
In fact, the first step in building Contigs is to delete the transitive reducible edges (...). This means that we only need best n overlaps at 5 ' or 3 ' end. The--BESTN parameter parameters of overlap_filtering_setting option can be used to control the maximum overlap of each read report.

Overlap_filtering_setting 
    --bestn  #control the maximum overlap reported for each read

Another useful heuristic is to retain only reads with an average of 5 ' and 3 ' overlays. This is because if a read ends with repetition, it will have a higher repetition. This reads does not provide value and does not solve the related duplication problem. We can filter them out, hoping to have reads (across repetition, with a normal unique anchors at both ends). Similarly, if one end of a reads is too low in coverage, it may have too many sequencing errors. This reads produces "Spurs" in an assembly diagram, and usually filters them out. --max_cov and--min_cov are used to filter reads that are too high or too low.

--max_cov   #过滤太高的reads
--min_cov   #过滤太低的reads

The filtered script also allows you to filter some "split" reads. If a read has a very unequal coverage on the 5 ' and 3 ' ends, this is also a signal that a section is repetitive. --max_diff can be used to filter out one end of the cover than the other end of a much higher reads.

--max_diff   #过滤掉一端比另一端的覆盖哦高很多的reads

What is the correct quantity to use these parameters? It's really hard to set it up correctly, if the error corrected reads's total coverage is higher than the known length cut off and reasonably high (e.g. greater than 20x), then the average coverage Min_cov is set to 5,max_ CoV set to 3 times times will be safe, max_diff set to twice times. However, in the case of low coverage, it would be better to set the Min_cov to 1 or 2. A help script named fc_ovlp_stats.py can help dump (discard). A helper script called fc_ovlp_stats.py can help to dump the number of the 3 ' and 5 ' overlap of a given length cutoff Can plot the distribution of the number of overlaps to make a better decision. )
You can also set Max_diff and Min_cov for very high values to avoid filtering on certain special occasions.

This filtering process can significantly filter out the duplicate information of highly copied copies. That is, these repetitions are filtered out and will not be shown in the final assembly result. If you are interested in these repetitions, it will be more effective and practical to deal with these repetitions alone. How to solve the problem of repeating sequence is of great significance. 6. Using overlaps structure diagram

Because of overlapping data, string graph is created by  fc_ovlp_to_graph.py, using the default parameters.  fc_ovlp_to_graph.py generates some files that represent the final string graph after assembly. The final Ctg_path contains information for each contig diagram. A contig is a linear path of simple paths and compound paths. Compound paths are those subgraph that isn't simple but have the unique inlet and outlet after graph edge reduction. 
they can be induced by genome polymorphism or sequence errors. By explicitly encoding such information into the graph output, we can examine the sequences again to classify them. 7. Construct Contig

with graphs

The final step of the Create draft Contigs is find a one path of each contig graph, and generate sequences accordingly. In the case of a contig graph is isn't a simple path, we look for End-to-end path that has the most overlapped bases. This is called primary contig. Each path in the diagram, if it is different from primary one, we will construct the associated contig. In the case of the associated contigs are induced by sequencing error, the identity of the alternative Contig and the primary contig 'll is High (> 99% identity Most of time). In the case of there are true structure polymorphism, there are typically bigger difference between the associated contigs and the PR Imary Contigs.

fc_graph_to_contig.py script, takes the sequence data and graph output to construct contigs. It will generate all the associated contigs at this moment. In the future, some post handlers to remove the some of the errors in the associated contigs will be developed. (under development) 8. Common Daligner Options

The Daligner is controlled by Pa_hpcdaligner_option and Ovlp_hpcdaligner_option.
To limit memory, we can use the-M option. For the Human genome Assembly, we tested-M 32, each Daligner using 32G RAM. Other possibilities are under investigation.
For more daligner details, see: Dazzlerblog. Working directory structure, job recovery, and troubleshooting

Code design works in a single directory. The typical layout for a working directory is this:

$ ls-l Total
drwxr-xr-x jchin domain Users  8192 Nov-12:30 0-rawreads drwxr-xr-x-
Jchin Domain Users  4096 Nov 12:33 1-preads_ovl
drwxr-xr-x  2 jchin Domain Users  4096 Nov
12:44 2-asm-falcon- Rwxr-xr-x  1 Jchin Domain users  1041 Nov 12:13 fc_run.cfg
-rw-r--r--  1 jchin Domain Users   378 N OV 23:20 input.fofn
drwxr-xr-x  2 jchin Domain Users 4096 Nov-12:13 scripts drwxr-xr-x  2 JC Hin Domain Users 24576 Nov 12:33 sge_log

The typical input.fofn is this:

/mnt/data/analysis_results/m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads.fasta
/mnt/data/analysis_results/m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads.fasta
/mnt/data/analysis_results/m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads.fasta
1. Internal 0-rawreads Catalogue

The directory 0-rawreads includes all scripts and data for the overlapping original sequence, which contains different job_ * and m_ * directories.
.
For example, if we would e. The COLI data is divided into 20 blocks, and the table of contents is as follows:

Cns_done   job_00011  job_00024  job_00037  job_00050    m_00003 m_00016 da_done job_ 00012  job_00025  job_00038  job_00051  m_00004  m_00017 job_00000 job_00013 job_00026  job_00039  job_00052  m_00005  m_00018  job_00001 job_00014 job_00027  job_00040  job_00053  m_00006  m_00019 job_00002 job_00015  Job _00041  job_00054  m_00007  m_00020
job_00003  job_00016 job_00029 job_00042  job_00055  m_00008  preads
job_00004  job_00017  job_00030 job_00043 job_ 00056  m_00009  prepare_db.sh
job_00005  job_00018  job_00031 job_00044 job_ 00057  m_00010  raw_reads.db
job_00006  job_00019  job_00032 job_00045 job_ 00058  m_00011  rdb_build_done
job_00007  job_00020  job_00033 job_00046 job_ 00059  m_00012  run_jobs.sh
job_00008  job_00021  job_00034 job_00047 las_files  m_00013
job_00009  job_00022  job_00035  job_00048
m_00001 m_00014 job_ 00010  job_00023  job_00036  job_00049  m_00002    m_00015

The Job_ * Directory stores the output work for each daligner.

M_ * Directory is used for merging work.

There are some empty files whose names are done at the end. The timestamps of these files are used to track workflow stages. You can modify the timestamp and restart the fc_run.py to trigger some calculations for the part of the workflow. However, this is not recommended unless you have some time to read fc_run.py source code and know how to track dependencies within the workflow. (For example, if you touch Rdb_build_done after successful assembly, and rerun fc_run.py.) Because all intermediate processes depend on the file, Rdb_build_done is newer than any intermediate file, triggering the fc_run.py and repeating the entire assembly process.
.
Las_files stores the final alignment information.

If you are not going to rerun the work but want to know how to compare the look, you can delete all job_ * and m_ * directories but keep the las_files and preads directories.
.
The main output of this step is stored in 0-rawreads/preads. out.%0 inside the 0-rawreads/preads 4D.FA is the output of the reads Fasta file. If this step is completed successfully, the Sentinel file Cns_done will be created. 2.1-PREADS_OVL Directory

The directory stores the data for P-read and P-read overlapping. It is roughly in line with 0-rawreads, but it does not consensus this step. The main output is the *.las in the 1-preads_ovl/directory. 3.2-asm-falcon Directory

This is the final output directory.
It contains all the information about the assembly diagram and the draft contigs, with the details described in the Graph Output format section. fc_ovlp_to_graph.py options and assembly diagram output format

Here is the fc_ovlp_to_graph.py usage information:

usage:fc_ovlp_to_graph.py [h] [--min_len Min_len] [--min_idt Min_idt]
                           [--LFC]
                           overlap_file

A example String graph assembler is desinged for handling diploid genomes positional arguments

:
  overlap_file       a fil E that contains the overlap information.

Optional arguments:-H,--help show this help message and         exit
  --min_len min_len Minimum length of the  re Ads to is considered for
                     assembling
  --min_idt Min_idt minimum alignment identity of the reads to be  conside Red for assembling--LFC with local flow
                     constraint method
  rather than best overlap method to
                     resolve K Nots in string graph

De-dup Example and strategy
Todo
Low coverage assembly
Todo
Assembly Layout Rule CPU Usage

Applications of the Assembly graph
Todo
C code for reproducibility and replicability sequences alignment and consensus

Other TODOs
Incremental overlapping
pre-processing repeat for overlapping term

Subread: Son read,
Full-pass Subread:
Pre-assembly:
Error Correction:
Proper overlap:
String graph:
Contig
Primary Contig:
Associated Contig:
P-read:
Compound path:
Simple path:
Quiver:

Popular Science Time:
What is virtualenv.
Each application may need to have a "stand-alone" Python run environment. Virtualenv is used to create a "quarantine" Python runtime environment for an application.
Information:
Virtualenv-Liaoche
Virtualenv-Python virtual sandbox
VIRTUALENV Introductory Tutorial Help documentation

The default branch is now "Master", which contains Falcon's latest source code.
Currently the latest comprehensive version is v0.3.0, installation Guide reference: v0.3+ Integration Installation Guide

Previous version is v0.2.2, specific reference: v0.2.2 release GitHub repository wiki pages Developer Installation Guide v0.3+ integrated installation Guide use and help documentation FAQs v0.2.2 release GitHub re Pository Falcon Release Notes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.