Sequencing summary, high-throughput sequencing nouns

Source: Internet
Author: User
Tags repetition

Mainly from: Http://mp.weixin.qq.com/s/iTnsYajtHsbieGILGpUYgQ

The gold standard for sequencing: a generation of sequencing, so called golden Sequencing.

High-throughput sequencing the recent years have been a fire more and more fire, but the world is still a bunch of molecular cloning, cell breeding, bacteria, mixed protein biologists, the reason for the Sanger sequencing or sequencing of the gold standard, because the accuracy of more than 2, 3-generation sequencing and maintain the price of Chinese cabbage to make it a stable position.

Application scope: De novo sequencing, re-sequencing: such as mutation detection, SNPs, insertion, deletion of cloning product validation, comparative genomics, genotyping: such as microbial and fungal identification, HLA typing, viral typing

Other: such as methylation analysis (heavy sulfite sequencing) and sage (gene expression Tandem analysis) method

Clinical application: Detection of tumor mutation gene and individualized treatment of tumor.

C: Refers to the amount of DNA in eukaryotic cells that have a single cell nucleus (half of a fertilized egg or twice-fold body cell)

Chain of Justice: Also known as the coding chain , the DNA in the double chain with the RNA sequence consistent (t instead of u) of a single chain sequence. The same chain as the mRNA nucleotide sequence (U instead of T), called the chain of codes or the chain of justice.
antisense chain: Also known as the template chain , DNA double-strand in accordance with the rule of the base pairs can guide the transcription of a single strand of RNA generation ,
Mechanism of action: two complementary strands of DNA one carries encoded protein information called the justice chain, and the other is called the antisense chain that complements it.
Antisense Nucleic acid Technology: A chain of encoded sequences in a DNA or RNA structure is called a justice chain, and the chain paired with it is called an antisense chain. Antisense nucleic acids (RNA and DNA) are complementary to their target genes.

Three or two-generation sequencing related noun interpretation

High-throughput sequencing, each reaction on the chip, will read a sequence, is relatively short, called read, they are raw data, there are many reads through fragments overlap, can be assembled into a larger fragment, called contig; multiple contigs overlap through fragments, Make up a longer scaffold; After a contig is formed, it is identified that it is a gene encoding protein, called Singleton, and multiple contigs assembled into scaffold, identified the gene that it encodes protein, called Unigene. A unigene does not necessarily represent a contig, a unigene can have multiple contig

Consensus sequence: a sequence or a uniform sequence, a sequence of identical functions, and an ideal sequence in which the most frequently occurring bases or amino acids are arranged together at each point.

Ion Torrent Personal Genome Machine (PGM)

Single-molecule real-time DNA sequencing technique for Molecular-TIME,SMRT

This enables the "Edge Synthesis edge sequencing (sequencing by Synthesis,sbs)", such as 454, Illumina, Ion torrent and other sequencing techniques, or "Edge connection edge sequencing (sequencing by LIGATION,SBL)", such as solid technology.

Enhancer: It is a cis-regulatory element that improves transcription efficiency, with the earliest discovery of a piece of DNA in the SV40 virus that is about 200bp long, which increases the gene transcription of flanking by up to 100 times times, and subsequently finds reinforcements in a variety of eukaryotes, even in prokaryotic organisms. The enhancer usually takes up the 100~200bp length, as well as the Promoter is composed of several components, the basic core components are often 8~12bp, can be single-copy or multi-copy serial form exists.

What is segment duplication? Commonly called SD region, Tandem repetition is composed of a series of DNA fragments that are similar in sequence. Tandem repetition plays an important role in the primate genes of human genetic diversity. On the human chromosome Y and chromosome 22nd, there is a large SD sequence.

CPG Island Number: The Human Genome sequence sketch analysis results show that the human genome CpG Island is about 28,890, most chromosomes have 5-15 CpG islands per 1 MB, with an average of 10 per MB. The number of CpG islands has a good correspondence with the gene density of 5 CPG islands. Due to the close relationship between DNA methylation and human development and tumor disease,

DNA methylation (DNA methylation), a form of chemical modification of DNA, can alter the genetic phenotype without altering the DNA sequence. DNA methylation plays an important role in maintaining normal cell function, transmitting genomic imprinting, embryo development and tumor occurrence, and has become a hotspot of epigenetics and epigenetics.

How is the genome assembled? In general, the Assembly strategy based on Illumina Gemone Analyzer sequencing results is as follows:

(1) First use short sequence assembly software to paired-end data de novo splicing, assembled into contigs, this stage generally need to provide high coverage of paired-end sequencing data, the need to consume a lot of computer memory, which is the most difficult step of genome assembly;

(2) gradually adding a long insert fragment of the Mate-pair data to build scaffold, generally speaking, mate-pair sequencing depth is not too high, through the Mate-pair two-terminal distance information, the contigs connected to a larger scaffold;

(3) Review paired-end and mate-paired Insert fragment length information, fill gap;

(4) Sometimes adding 454 of the data will help to fill gap and extend Contigs.

What is high-throughput sequencing?

High-throughput sequencing (high-throughputsequencing,hts) is a revolutionary change to traditional Sanger sequencing (known as a generation sequencing technology), sequencing hundreds of thousands of to millions of nucleic acid molecules at a time, so in some literatures it is called next-generation sequencing technology ( Next Generation Sequencing,ngs) is an epoch-making change, and high-throughput sequencing makes it possible to perform a detailed full-scale analysis of a species's transcriptome and genome, so it is also called deep sequencing (deeper sequencing).

What is genome re-sequencing (genome re-sequencing)

Genome-wide sequencing is a method for genome sequencing of individuals who are known to have genome sequences , and for differential analysis at the individual or group level. With the decreasing of genome sequencing cost, the disease mutation research of human diseases is extended from exon region to whole genome range. Through the construction of different length of insert fragments library and the combination of short sequence, double end sequencing strategy for high-throughput sequencing, to achieve at the whole genome level detection of disease associated with common, low-frequency, even rare mutation sites, as well as structural variations, has a significant scientific and industrial value.

What is de novo sequencing

De novo sequencing is also known as the start of sequencing: it does not need any existing sequence data can be sequencing a species, using bioinformatics analysis of the sequence splicing, assembly, to obtain the genome map of the species. Obtaining a whole genome sequence of a species is an important shortcut to accelerating understanding of this species. With the rapid development of new-generation sequencing technology, the cost and time of genome sequencing are much lower than that of traditional technology, large-scale genome sequencing is getting better, and genomics research has ushered in a new development opportunity and a revolutionary breakthrough. With a new generation of high-throughput, highly efficient sequencing technology, and powerful bioinformatics capabilities, genome sequences for all organisms can be measured and analyzed efficiently and cost-effectively.

What is exon sequencing (whole exon Sequencing,wes)

Exon sequencing is a genomic analysis method that uses sequential capture techniques to capture and enrichment DNA from whole-genome exon regions for high-throughput sequencing. Exon sequencing is less costly than genomic sequencing, and has great advantages in studying SNP and indel of known genes, but it is not possible to study genomic structural variations such as chromosomal fracture recombination.

What is mRNA sequencing (RNA-SEQ)

Transcriptome (Transcriptomics) is a new discipline that studies the type and number of copies of all RNA (including mRNA and non-encoded RNA) that a particular cell can transcribe in a functional stateafter genomics. Illumina provides mRNA sequencing techniques that can be used throughout the mRNA field to conduct a variety of related studies and new discoveries. mRNA sequencing does not design primers or probes and provides free and authoritative information about transcription. The researchers needed a single trial to quickly generate complete RNA sequence information for the entire poly-a tail, and to analyze the most comprehensive transcriptome information, such as gene expression, CSNP, new transcription, new isomers, splice sites, allele-specific expression, and rare transcription. Simple sample preparation and data analysis software support mRNA sequencing studies in all species.

What is small RNA sequencing

Smallrna (micro RNAs, siRNAs and pi RNAs) are important regulatory factors in life activities, which play an important role in the regulation of gene expression, biological development, metabolism and the occurrence of diseases. Illumina is able to conduct in-depth sequencing and quantitative analysis of all small RNA in cells or tissues. In the experiment, the 18-30 NT range of small RNA was isolated from total RNA, and after the two ends were combined with a specific joint , the reverse transcription was made into cDNA , and then the DNA fragment was directly sequenced using the sequencing instrument. Through the analysis of large-scale sequencing of small RNA by Illumina, we can obtain the miRNA Atlas of the whole genome level of species, and realize the exploration of the new miRNA molecule, the prediction and identification of the target genes, the differential expression analysis of the samples, the miRNAs clustering and the expression spectrum analysis and other scientific applications.

What is miRNA sequencing

The mature microRNA (MiRNA) is a 17~24nt single-stranded, non-encoded RNA molecule that affects the stability and translation of the target mRNA through interaction with mRNA, ultimately inducing gene silencing and regulating the biological processes of gene expression, cell growth and development. Based on the second generation sequencing technique, the microRNA sequencing can obtain millions of microRNA sequences at a time, and can quickly identify the known and unknown microRNA and their expression differences in different tissues, different developmental stages, different disease states, It provides a powerful tool for studying the effect of microRNA on cell progression and its biological effects.

What is Chip-seq

Chromatin immunoprecipitation (chromatinimmunoprecipitation,chip), also called binding site Analysis , is a powerful tool for studying protein-DNA interactions in vivo . It is commonly used in the study of transcription factor binding sites or histone-specific modifiers. Combining chip and second-generation sequencing technology, the CHIP-SEQ technology can efficiently detect DNA segments that interoperate with histone and transcription factors within the entire genome.

The principle of chip-seq is: firstly, the DNA fragment which binds with the target protein is specifically enriched by chromatin immunoprecipitation Technology (ChIP), purified and the library is constructed, and then the enriched DNA fragments are sequenced. The researchers obtained millions of sequences of tags to accurately locate the genome, so that the whole genome-wide and histone, transcription factors, such as the interaction of DNA segment information.

What is Chirp-seq

Chirp-seq (chromatin Isolationby RNA Purification) is a high -throughput sequencing method for detecting DNA and proteins bound to RNA . By designing a biotin or Streptavidin probe to pull the target RNA down, the DNA fragment that interacts with it is attached to the bead, and the chromosome fragment is then sequenced in high-throughput order, which gives the RNA the ability to bind to areas in the genome, but because the protein sequencing technique is not mature enough, There is no way to know the protein that binds to the RNA.

What is Rip-seq

RNA immunoprecipitation is a technique for studying the binding of RNA to proteins in cells , and is a powerful tool for understanding the dynamics of post-transcriptional regulatory networks, helping us to identify the regulatory targets of miRNA. This technique uses antibodies against the target protein to precipitate the corresponding rna-protein complex, which is then separated and purified to allow sequencing analysis of the RNA that binds to the complex.

RIP can be seen as a common use of chromatin immunoprecipitation chip technology similar to the application, but because the research object is rna-protein complex rather than dna-protein complex, RIP experiment optimization conditions and chip experiment is not very same (such as complex does not need to be fixed, The reagents and antibodies in the RIP reaction system must not contain the RNA enzymes, the antibodies need to be verified by RIP experiments, etc.). The downstream combination of RIP technology and microarray technology, known as Rip-chip, helps us to better understand the overall level of RNA changes in cancer and other diseases.

What is Clip-seq

Clip-seq, also known as Hits-clip, is a combination of UV-diplomatic immunoprecipitation and high-throughput sequencing (Crosslinking-immunprecipitationand high-throughput sequencing), is a revolutionary technique that reveals the interaction of RNA molecules with RNA-binding proteins at the whole genome level. The main principle is based on the RNA molecule and RNA binding protein in the coupling of ultraviolet irradiation, RNA binding protein specific antibodies to the rna-protein complex precipitation, the recovery of the RNA fragments, through the addition of joint, RT-PCR, and other steps to high-throughput sequencing of these molecules, Through the analysis and treatment of bioinformatics, the specific laws are excavated to reveal the regulative effect of RNA-binding protein and RNA molecule and its significance to life.

What is the chromosome conformation capture technique

3 c is usually used with a promoter or a certain gene or a short fragment of a genome in a neighboring dozens of KB or hundreds of KB genome scan to obtain a reciprocal effect area. Because the experiment requires a specific primer, the laboratory is rather laborious and has a small detection range.

4 C is the same as the 3 C to detect the unit point, but its detection extends to the entire genome. It is mainly the introduction of reverse PCR, so it is only necessary to design primers for this single point.

5C do two large segments of the interaction between the points of detection, can reach 10Mb level. It still needs to use primers, and primer design is a difficult point in its technology.

Hi-c can achieve genome level detection, but obtaining high precision requires a very large sequencing depth.

Chia-pet is marked by a specific protein factor and its associated chromatin interaction. The technique combines a paired end-label sequencing technique with chip to cross-connect a DNA fragment enriched with a protein that can be used to determine the long-range interaction of chromatin by a specific transcription factor of the entire genome, thus presenting high specificity and high resolution chromatin interactions.

What is hi-c assisted genome Assembly

Hi-c-assisted genomic Assembly refers to the division of the draft genome sequence into chromosome groups using hi-c sequencing data, and the sequence and direction of each sequence on the chromosome, based on the draft genome sequence and the number of known chromosomes that have been assisted by two or three generations or optical atlas. A technique that enables genomic assembly and assembly levels to be elevated to chromosome level.

What is metagenomic (macro genome)

Magenomics studied the whole microbial community . Compared to traditional single bacterial studies, it has many advantages, of which two points are important: (1) microorganisms usually coexist in a niche in a community, many of which are based on the interaction of the whole Community environment and the individual, Therefore, the study of Metagenomics is better than that of individual individuals, and (2) metagenomics studies do not isolate individual bacteria, and can be studied by microorganisms that cannot be isolated from the laboratory.

The macro genome is a new research direction in genomics. Macro Genomics (also known as meta-genomics, environmental genomics, ecological genomics, etc.) is a discipline that studies genomic genetic material extracted directly from environmental samples. Traditional microbial research relies on laboratory training, and the rise of the macro genome fills gaps in microbial research that cannot be cultivated in traditional laboratories. Over the past few years, advances in DNA sequencing techniques and improvements in sequencing fluxes and analytical methods have allowed a glimpse of this unknown genomic science.

What is SNP, SNV (single nucleotide site mutation)

Single nucleotide polymorphism singlenucleotide POLYMORPHISM,SNP or single nucleotide site variation of SNV. Polymorphism of individual nucleotide variants (substitution, insertion, or deletion) in the same location between individual genomic DNA sequences. There is a difference in the single nucleotide of the same species and in different individual genomic DNA sequences at the same location. Loci and DNA sequences with this difference can be used as markers for genome mapping. On average, about every 1000 nucleotides in the human genome may have 1 SNP changes, some of which may be associated with disease, but most may not be disease-related. Single nucleotide polymorphism is an important basis for the study of genetic variation of human family and plant and animal strains . In the study of the Cancer genome variant, compared to normal tissue, the specific single nucleotide mutation in cancer is a somatic mutation (somatic mutation), called SNV.

What is Indel (genome small fragment insertion)

The insertion or deletion of small fragments (>50bp) on the genome is similar to that of SNP/SNV.

What is copy number Variation (CNV): Genomic copy numbers variation

Genomic copy number variation is a form of genomic variation that typically causes the DNA of large fragments in the genome to form an abnormal number of copies. For example, the human normal chromosome copy number is 2, some chromosome region copy number becomes 1 or 3, so that the region occurs the number of copies missing or increased, the gene expression in the region will also be affected. If a chromosome is divided into a-b-c-d four regions, the a-b-c-c-d/a-c-b-c-d/a-c-c-b-c-d/a-b-d of the C region is amplified and absent respectively, and the amplification position can be continuous amplification, such as a-b-c-c-d or amplification in other locations, such as a-c-b-c-d.

What is structure Variation (SV): Genomic structural variation

Chromosomal structural variation refers to the mutation of large fragments occurring on the chromosome. Mainly including the insertion and deletion of large fragments of chromosomes (causing changes in CNV), the chromosome inside a certain area of the reversal of the change, two chromosomes between the recombination (inter-chromosometrans-location) and so on. The General SV Display utilizes Circos software.

What is segment duplication

Commonly called SD region, Tandem repetition is composed of a series of DNA fragments that are similar in sequence . Tandem repetition plays an important role in the primate genes of human genetic diversity. On the human chromosome Y and chromosome 22nd, there is a large SD sequence.

What is genotype and phenotype

Both genotype and phenotype, usually refers to the relationship between some single nucleotide loci and their manifestations.

What is read?

the short sequence produced by the high-throughput sequencing platform is called reads. PE125, is read long for 125bp double-ended sequencing.

What is Contig?

Splicing software is based on the overlap area between reads, the sequence obtained by stitching is called contig (overlapping group), no n

What is scaffold?

Genome de novo sequencing, through reads stitching to obtain contigs, often also need to build 454 paired-end library or Illumina Mate-pair Library to obtain a certain size fragments (such as 3Kb, 6Kb, 10Kb, 20Kb) at both ends of the sequence. Based on these sequences, it is possible to determine the order relationship between some contig, which sequence known contigs constitute scaffold (containing N).

What is Contig N50?

Reads stitching will get some different lengths of Contigs. Add all the contig lengths to get a total length of contig. Then all the contigs are sorted from long to short, such as getting Contig 1,contig 2,contig 3 .... Contig 25. The Contig is added in this order, and when the added length reaches half the total length of contig, the last added contig length is Contig N50. Example: Contig 1+contig Contig 3+contig4=contig total length *1/2, Contig 4 is the length of Contig N50. Contig N50 can be used as a judging criterion for the results of genomic splicing.

What is scaffold N50?

Scaffold N50 is similar to the definition of Contig N50. Contigs stitching Assembly to obtain some different lengths of scaffolds. Add all the scaffold lengths to get a total length of scaffold. Then all the scaffolds are sorted from long to short, such as getting Scaffold 1,scaffold 2,scaffold 3 .... Scaffold 25. The scaffold is added in this order, and when the added length reaches half the total length of scaffold, the last added scaffold length is scaffold N50. For example: Scaffold 1+scaffold 2+scaffold 3 +scaffold 4 +scaffold 5=scaffold total length *1/2, Scaffold 5 is Scaffold N50 in length. Scaffold N50 can be used as a judging criterion for the results of genomic splicing.

What is sequencing depth and coverage?

The sequencing depth refers to the ratio of the total base of the sequencing to the size of the genome to be measured . Assuming a gene size of 2M, sequencing depth of 10X, then the total amount of data obtained is 20M. Coverage refers to the proportion of the sequence obtained by sequencing to the entire genome. Due to the existence of complex structures such as high GC and repetition sequences in the genome, the sequence obtained by the final stitching Assembly of the sequencing is often unable to cover the area, which is called Gap. For example, a bacterial genome sequencing, with a coverage of 98%, then 2% of the sequence area is not obtained by sequencing.

What is RPKM, fpkm

Rpkm,reads per kilobase of exon model Per Million mapped Reads, is defined in Thisway [Mortazavi etal., 2008]:

The number of reads on each 1K base of a map to exon in reads on each of the 1 million maps.

If there are 1 million reads mapped to the human genome, then specific to each exon it, how many mappings on it, the length of the apparent son of different, then each 1K base on how many reads on the map, which is probably the rpkm of the visual interpretation.

If it corresponds to a particular gene, then how much per KB of reads per 1000000 mapped to that gene is mapped to that gene Exon read

Total Exon Reads

The the number in the column with the header of total exonreads in the rowfor the gene. This is the number of reads that has beenmapped to a region Inwhich an exon are annotated for the gene or across Thebounda Ries of exons Oran Intron and an exon for an annotated transcript ofthe gene. For eukaryotes,exons and their internal relationships is defined byannotations of type MRNA. The total number of reads mapped to the exon. This is the number of reads mapped to an area, either a known annotated gene or an intron or exon of a transcript that has been annotated by a gene that spans two exon boundaries. For eukaryotic organisms, the exon and their own internal relationships are annotated by some type of mRNA.

Exonlength:

The the number in the column has theheader exon length inthe row for the gene, divided by 1000. This was CalculatedAs the sum of thelengths of all exons annotated for the gene. Each exon isincluded only once inthis sum, even if it was present in more annotatedtranscripts for the gene. Partly overlapping exons would count with their fulllength, even though theyshare the same region. The length of the exon. Calculates, calculates the sum of all the exon lengths that a gene has commented on. Even if a gene is present in a variety of annotated transcripts, the exon is only included once when summing. Even if partially overlapping exon shares the same area, the overlapping exon is calculated with its total length.

Mapped Reads

The sum of the numbers in the column with the header totalgenereads. The total, gene reads for a, and the total number ofreads that aftermapping has been mapped to the region of the gene. Thus Thisincludes All thereads uniquely mapped to the region of the gene as well asthose of the Readswhich match in more p Laces (below the limit set in Thedialog in Figure 18.110) that has been allocated tothis gene ' s region. Agene ' s region are that comprised of the flanking regions (if it was specified infigure 18.110), the exons, the introns anda Cross exon-exonboundaries of any transcripts annotated for the gene. Thus,the sum of the totalgene reads numbers is the number of mapped reads for thesample (you can findthe number in the RNA -SEQ report). Reads sum of map. The total number of reads mapped to a gene. So this contains all the unique mappings to the reads on this area.

Example: for example, the corresponding to the gene read 1000, the total number of reads 1 million, and the total exon length of the gene is 5kb, then its rpkm is: 10^9*1000 (reads number)/10^6 (total reads number) *5000 (exon length) = 200 or: Reads (number of)/1 (million) (K) =200 This value reflects the level of gene expression.

FPKM (Fragmentsper kilobase of exon per million fragments mapped)

FPKM and RPKM calculation methods are basically consistent. The difference is that FPKM calculates fragments, and RPKM calculates reads. Fragment has a broader meaning than read, so FPKM contains a broader sense, which can be a fragment of pair-end or a read.

What is transcription-based refactoring

A transcript is assembled using sequenced data. There are two kinds of assembling methods: 1,de-novo construction, 2, reference genome reconstruction. Among them, De-novo assembly refers to the case of not relying on the reference genome, will have overlap reads connected to a longer sequence, through the continuous extension, to spell a contig and scaffold. Common tools include velvet,trans-abyss,trinity and so on. Reference genome reconstruction, refers to the first read paste back to the genome, and then in the genome through reads coverage, junction site information such as transcription, commonly used tools include scripture, cufflinks.

What is the expression spectrum

Gene expression profiling (Geneexpression profile): Refers to the construction of a specific state of cells or tissues of the non-biased cDNA library , large-scale cDNA sequencing, collection of cDNA sequence fragments, qualitative, quantitative analysis of their mRNA group composition, thereby describe the type of gene expression and abundance information of a particular cell or tissue in a particular state , so that a data table is called a gene expression profile.

What is comparative genomics

Comparative genomics (Comparativegenomics) is a comparison of known genes and genomic structures based on genomic mapping and sequencing, to understand the function, expression mechanism, and evolution of species. Using the encoding sequence and structural homology between the model biological genome and the human genome, cloning human disease genes, revealing the molecular mechanisms of gene function and disease, clarifying the evolutionary relationship of species, and the intrinsic structure of the genome.

What are genome annotations

Genome Annotation (genomeannotation) is a hotspot in the research of functional genomics by using bioinformatics methods and tools to make high- throughput annotations on the biological functions of all genes in the genome. The research contents of genome annotation include two aspects of gene recognition and gene function annotation . The core of gene recognition is to determine the exact location of all genes in the whole genome sequence.

Iv. main issues of attention

1. Building a library

The genome sequence using bird marksmanship smash-commonly known as building a library, and then using gel electrophoresis method to separate different lengths of fragments, such as now build a library, the short library generally built 180bp,200bp or 300bp and so on. The 180 and 300 here are the length of the sequencing fragment. Of course, because the reading length of the sequencer is fixed, such as 110,125, or 450. The company is now using a 220bp library, read long as 125bp, because it is a double-ended sequencing, so there will be 30bp overlap area (These are the necessary conditions for the later use of ALLPATH-LG assembly).

2. Filtering

3. Assessment

After these data processing is done, an evaluation of the insert fragment is generally needed. The insert fragment is actually the size of the library. For example, 300bp library, insert is 300bp, but we should know that at the current sequencing level, there will inevitably be errors and errors, the result of the error is that although the insertion fragment is 300bp, but only the average is 300bp, there is a variance, about dozens of BP. Usually the error we are acceptable, and for the mistake, we have to find out, if the insertion fragment, a serious deviation from 300bp, then it means that the library failed. We usually use the test method is to put the data genome Assembly, after the Assembly of the soap comparison, and then draw a comparison of the efficiency diagram. Can someone ask if the insertion fragment can be evaluated only after the assembly is finished? I do not know the other way, can only be said to be spurious.

4 third generation sequencing technology

The third generation sequencing technology refers to single molecule sequencing technology. DNA sequencing does not require PCR amplification to achieve individual sequencing of each DNA molecule. The third generation sequencing technology is also called the ab initio sequencing technique, i.e. single molecule real-time DNA sequencing.

Mainly includes single molecule fluorescence technology, that is, does not need amplification, each molecule shows a light, and then real-time to monitor, to read. So here's how to build an environment where nucleic acid molecules glow alone to identify the technical difficulties.

At present, the three-generation data are mainly used in the research market in two directions, the first one is the genome assembly, the other is the full-length transcriptome.

1. Error correction

The first is because of three generations of random errors of data, so the data correction is not around the past, introduced two software, one is Pacbiotoca, one is Ectool. One is the use of second-generation data correction, one is the use of Contig for error correction.

2. Assembly

The second software is the assembly of three generations of data, recommended Celera Assembly. Of course there are other bull x software so it's hard to get it. By the way, more than one mouth, the official website said alone with three generations to assemble, the depth to 40x.

3. Mix and fight

The third software is a mix, that is, two generations of data and three generations of data assembled together. Software for, no nonsense, the official website requires a depth of 20x.

4. Fill the Hole

Fourth software is my favorite, is the use of three generations of data to fill the long fragments of the second generation of data assembled Gao and connected contig for scaffold. Recommended software Pbjerry. The official website requires a depth of 5 x.

5 Nano-hole sequencing technology

Sequencing summary, high-throughput sequencing nouns

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.