Quality control of sequencing data

Source: Internet
Author: User

Based on the Edge Synthesis edge sequencing (sequencing by SYNTHESIS,SBS) technology, the Illumina HiSeq2500 high-throughput sequencing platform enables the sequencing of cDNA libraries to produce a large number of quality reads, These reads or bases produced by the sequencing platform are known as raw data, and most of their base quality scores can reach or exceed Q30. Raw data is typically provided in the FASTQ format, and raw data for each sequencing sample includes two FASTQ files, each containing a reads measured at both ends of the cDNA fragment.

FASTQ format files are as follows:

FASTQ Format Files

Note: The FASTQ file typically corresponds to one sequence unit per 4 lines: The first line begins with @, followed by the sequence identification (ID) and other optional descriptive information , and the second behavior base sequence, or reads; the third line begins with "+", followed by optional description information , the fourth behavior reads each base corresponding to the quality score code, the length must be the same as the reads sequence length .

    • Sequencing base mass values

The base mass value (quality score or Q-score) is an integer mapping of the probability of a base-based identification (base calling) error. The commonly used phred base mass value formula is:

In the formula, p is the probability of a base recognition error. The following table shows the corresponding relationship between the base mass value and the probability of the base recognition error:

Phred Quality score

probability of incorrect Base call

Base Call accuracy

q10

1/10

90%

q20

1/100

99%

q30

1/1000

99.9%

q40

1/10000

99.99%

The higher the base mass value, the more reliable the base identification, and the less the possibility of the base error measurement. For example, for base-based identification of base mass values of Q20, 1 of the 100 bases will identify errors, and for base recognition of base mass values of Q30, 1 of 1,000 bases will be identified as errors; Q40 indicates that 10,000 of the 1 bases have an error.

Based on sequencing cycle, the distribution of the base mass values of all reads parallel sequences in a single sample can be viewed to examine the sequencing cycles and the overall sequencing quality of individual samples.

Base mass Value Distribution map

Note: The horizontal axis is the position of the sequencing base on the reads, and the ordinate is the base mass value. The color depth represents the base weight, the darker the color, indicating that the position of the base in the determination of the corresponding mass value of the base of the greater proportion, and vice versa.

    • Sequencing quality Control

Sequencing reads in the FASTQ file requires sequence alignment with the specified reference genome to position the cDNA fragment on the genome or gene. Before sequence alignment, it is necessary to ensure that these reads are of high quality to ensure the accuracy of subsequent analysis. The sequencing quality control method is as follows:

(1) Remove the sequencing connector and primer sequence;

(2) Filter Low quality value data to ensure data quality.

A high quality reads or base, known as clean Data, obtained after a series of quality control. Clean data is also available in the FASTQ format.

    • Sequencing Data output statistics

The output statistics of each sample of a project are shown in the following table:

Table 2 sample sequencing data assessment tables

Samples

Id

Read number

Base number

GC Content

%≥q30

P1

T01

38,244,560

9,634,612,093

56.51%

88.21%

P2

s0u

35,589,383

8,965,818,243

55.97%

89.17%

M1

T03

107,654,187

27,121,886,596

56.14%

88.29%

M2

T04

105,334,106

26,537,613,616

56.48%

89.13%

Note: Samples: Sample information single sample name; ID: sample number; Read Number:clean data pair-end Reads (double end sequencing) total; base Number:clean data total base; GC Content:clean data GC content, which is the percent of G and c two bases in clean Data as a percentage of the total base, and the%≥q30:clean data quality value is greater than or equal to 30 of the base.

    • Comparison of transcription group data with reference genome sequence

After the clean reads is obtained, it is sequenced with the reference genome to obtain location information on the reference genome or gene, as well as the sequence characteristic information peculiar to the sequencing sample.

TOPHAT2 is an efficient sequence-comparison software. Based on the high-throughput reads ratio software bowtie, the transcriptome sequencing is reads to the genome, and then the splicing points between exon (splicing Junction) are identified by analyzing the results. This not only provides a data base for variable splicing analysis, but also enables more reads to be compared to the reference genome, improving the utilization of sequencing data.

In transcriptome sequencing data, only the data on the reference genome can be used for subsequent analysis. Therefore, the corresponding data is referred to as the mapped Reads on the Reads referred to the specified reference genome.

    • Comparison of efficiency statistics

The ratio of efficiency to mapped reads as a percentage of clean reads is the most direct manifestation of data utilization in the transcriptome. In addition to the influence of the data sequencing quality, the comparison efficiency is related to the specific reference genome assembly, the reference genome and the biological classification of sequencing samples (subspecies). Therefore, by comparison efficiency, it is possible to evaluate whether the selected reference genome assembly meets the needs of information analysis and the reliability of later data analysis.

The sequence comparison results of each sample sequencing data and the selected reference genome are shown in the following table:

Table 3 Clean data and reference genome comparison results statistics

Bmk-id

Total Reads

Mapped Reads

Mapped Ratio

Uniq Mapped Reads

Uniq Mapped Ratio

T01

76,489,120

58,156,112

76.03%

53,604,920

70.08%

T02

71,178,766

53,874,310

75.69%

50,672,244

71.19%

T03

215,308,374

158,709,127

73.71%

149,083,989

69.24%

T04

210,668,212

156,816,037

74.44%

147,663,070

70.09%

Note: ID: sample number, total Reads:clean Reads number, per-end meter, Mapped Reads: The number of Reads on the reference genome; Mapped Ratio: Comparison to Reads on the reference genome in clean Percentage of Reads; Uniq Mapped Reads: The number of Reads compared to the unique location of the reference genome; Uniq Mapped Ratio: The percentage of Reads in clean Reads compared to the unique location of the reference genome.

    • Plotting the results of a comparison

The distribution of the mapped reads on the selected reference genome is plotted by the location distribution statistics of reads on different chromosomes.

The coverage depth distribution of mapped reads on the reference genome portion of the sample T01 is shown below:

Location of Mapped reads on the reference genome and distribution map of coverage depth

Note: The horizontal axis is the chromosome position, the ordinate is the covering depth with 2 pairs of values, with 10KB as the interval unit length, divides the chromosome into several small window (window), the statistic falls in each window the mapped reads as its coverage depth.

Theoretically, the reads from the mature mRNA should be compared to the exon region. However, a subset of reads will be compared to the intron and the gene region for the following reasons:

(1) When the sample is extracted, it will contain ploy (A) tail and the intron does not have the complete mRNA (i.e. mRNA precursors) proposed, so that the reads from the intron fragment to the Intron region;

(2) The genome Annotation error, the original area of exon annotation into an intron region, or vice versa;

(3) Low level of genome annotation, for genome annotation using transcriptome sequencing data, because the transcriptome sequencing can not traverse all the time and space points, so that the expression of the transcription data for the annotation is not expressed or low expression of the gene just in the sample of the project to detect a higher abundance, The reads from these genes are compared to the annotated genome, which is one of the foundations of the new gene and the new transcriptome excavation.

(4) There are differences between the sequencing sample and the reference genome, such as the mutation in the sequencing sample to form a new transcription group starting site to form a new gene of the sample, or splice site differences to form a new transcript, which is one of the basis for the new transcription of this excavation.

Statistics mapped reads the number of different regions (Exon, intron, and Gene region) of the specified reference genome, plotting the distribution histogram of mapped reads of each sample in different regions of the genome, as follows:

Histogram of reads distribution in different regions of genomic genome

Note: Each histogram represents a sample, the pink area is the exon region, the green area is the gene region, the blue area is the intron region, the height of the area is the percentage of the mapped reads in all mapped reads.

Quality control of sequencing data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.