FastQC and fastqc instructions

Last Update:2017-08-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

FastQC and fastqc instructions
1. Download fastqc

wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip

2. Decompress

unzip fastqc_v0.11.5.zip

3. Grant the execution permission. Otherwise, no execution permission is displayed.

cd FastQC
chmod 755 fastqc

4. Add to PATH

export PATH=/home/h/FastQC/:$PATH

5. Test

fastqc --help

Example

fastqc -o ./tmp.result/fastQC/ -t 6 ./tmp.data/fastq/H1EScell-dnase-2014-GSE56869_20151208_SRR1248176_1.fq

#-O -- storage path of the report file generated by outdir FastQC. The file name of the generated report is determined based on the input #-t -- threads selects the number of threads for running the program, each thread occupies MB of memory. The faster the thread is, the faster the FastQC report introduces that the entire report is divided into several parts. There will be a green check mark for the pass, and the warning is "!", Unqualified is the basic information of a red forks # Encoding indicates the version of the sequencing platform and the corresponding Encoding version number, this is useful when calculating the Phred reverse push error P # Total Sequences records the number of reads in the input text # The Sequence length is the sequencing length # % GC is an indicator that we need to focus on, this value indicates the GC content in the overall sequence, which is generally species specific, for example, human cells are about 42% sequencing quality statistics # in this figure, the horizontal axis is the sequencing sequence of 1st base to 101st base # The vertical axis is the quality score, Q =-10 * log10 (error P) that is, 20 indicates the error rate of 1%, and 30 indicates 0.1% # Each boxplot in the figure is a statistics of the sequencing quality of all sequences at this position. The bar above is 90% quantile, the bar below is 10% quantile, the middle of the box is 50% quantile, and the top of the box is 75% quantile, below is the 25% quantile # the Blue Line in the figure is the line of the average of each position # average It is required that the 10% quantile values at all locations in this figure be greater than 20, that is, we often call the Q20 filter # therefore, the above sequencing results need to be removed from the subsequent sequence after 87bp, so as to ensure the correctness of subsequent analysis # Warning alarm if any base mass is less than 10, or any median is less than 25 # Failure reports an error if any base mass is less than 5, or any median less than 20 per tail sequencing # the horizontal axis is the same as before, represents each location of the 101 base # The vertical axis is the Index number of tail # This figure is mainly used to prevent the Sequencing process, some tail is affected by uncontrollable factors, resulting in low sequencing quality # Blue indicates high sequencing quality, and warm colors indicate low sequencing quality. If some tail is warm, in the subsequent analysis, we can remove all the results of the tail sequencing from the sequencing quality statistics of each sequence. # If the length of one sequence I tested is 101bp, the average value of the Q value at each position in the 101 position is the quality value of this reads # the horizontal axis of the figure is 0-40, indicating the Q value # The vertical axis is each The number of reads corresponding to the value # In our data, the sequencing results are mainly concentrated in high scores, proving that the sequencing quality is good! GC content statistics # the horizontal axis is 1-101 bp; the vertical axis is the percentage # the four lines in the figure represent the average content of a t c g at each position # theoretically, A and T should be equal, G and C should be equal, but the status of the sequencer is unstable at the beginning of sequencing, which is likely to happen. In this case, even if the sequencing score is very high, the sequence information at the beginning of cut is required, the distribution chart shows the average GC content of the 5bp sequence before cut # the horizontal axis is 0-100%; the vertical axis is the number of GC content corresponding to each sequence # the blue line is the theoretical value given by the program based on the empirical distribution, red is the actual value. The two values should be closer to each other. # When the red line appears, basically, it must be the DNA sequences of other species. # The information in this figure is good. Sequencing length statistics # The length measured by each sequencer should be completely equal in theory, but there will always be some deviations # 101bp is the main one, but there are still a small number of 100 and BP lengths, but the number is relatively small, no impact on subsequent analysis # When the sequencing length is different, if it is serious, it indicates that the data produced by the sequencer in the sequencing process is untrusted sequence Adapter # This figure measures the situation of the adapter at both ends of the sequence # If NO content is found in option a during fastqc analysis at the time, by default, four generic adapter sequences in the legend are used for statistics. # In this example, the adapter has been removed. If the adapter sequence is not cleaned, in the subsequent analysis, we need to use the cutadapt software for de-joint repeat the short sequence # This graph calculates that, the number of repeated occurrences of Short Sequences of some features in a sequence # We can see that when 1-8 BP is used, several short sequences in the legend are frequently used. In general, this situation occurs, either the adapter is not cleaned, and the-a parameter is not used; or the sequence itself may have a high degree of repetition, for example, bias occurs during database creation PCR # in this case, my solution is to cut down some of the preceding lengths, and try to cut 5 ~ 8bp

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

FastQC and fastqc instructions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

FastQC and fastqc instructions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support