FastQC and fastqc instructions

Source: Internet
Author: User

FastQC and fastqc instructions
1. Download fastqc

wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip
2. Decompress
unzip fastqc_v0.11.5.zip
3. Grant the execution permission. Otherwise, no execution permission is displayed.
cd FastQC
chmod 755 fastqc
4. Add to PATH
export PATH=/home/h/FastQC/:$PATH
5. Test
fastqc --help
Example
fastqc -o ./tmp.result/fastQC/ -t 6 ./tmp.data/fastq/H1EScell-dnase-2014-GSE56869_20151208_SRR1248176_1.fq
#-O -- storage path of the report file generated by outdir FastQC. The file name of the generated report is determined based on the input #-t -- threads selects the number of threads for running the program, each thread occupies MB of memory. The faster the thread is, the faster the FastQC report introduces that the entire report is divided into several parts. There will be a green check mark for the pass, and the warning is "!", Unqualified is the basic information of a red forks # Encoding indicates the version of the sequencing platform and the corresponding Encoding version number, this is useful when calculating the Phred reverse push error P # Total Sequences records the number of reads in the input text # The Sequence length is the sequencing length # % GC is an indicator that we need to focus on, this value indicates the GC content in the overall sequence, which is generally species specific, for example, human cells are about 42% sequencing quality statistics # in this figure, the horizontal axis is the sequencing sequence of 1st base to 101st base # The vertical axis is the quality score, Q =-10 * log10 (error P) that is, 20 indicates the error rate of 1%, and 30 indicates 0.1% # Each boxplot in the figure is a statistics of the sequencing quality of all sequences at this position. The bar above is 90% quantile, the bar below is 10% quantile, the middle of the box is 50% quantile, and the top of the box is 75% quantile, below is the 25% quantile # the Blue Line in the figure is the line of the average of each position # average It is required that the 10% quantile values at all locations in this figure be greater than 20, that is, we often call the Q20 filter # therefore, the above sequencing results need to be removed from the subsequent sequence after 87bp, so as to ensure the correctness of subsequent analysis # Warning alarm if any base mass is less than 10, or any median is less than 25 # Failure reports an error if any base mass is less than 5, or any median less than 20 per tail sequencing # the horizontal axis is the same as before, represents each location of the 101 base # The vertical axis is the Index number of tail # This figure is mainly used to prevent the Sequencing process, some tail is affected by uncontrollable factors, resulting in low sequencing quality # Blue indicates high sequencing quality, and warm colors indicate low sequencing quality. If some tail is warm, in the subsequent analysis, we can remove all the results of the tail sequencing from the sequencing quality statistics of each sequence. # If the length of one sequence I tested is 101bp, the average value of the Q value at each position in the 101 position is the quality value of this reads # the horizontal axis of the figure is 0-40, indicating the Q value # The vertical axis is each The number of reads corresponding to the value # In our data, the sequencing results are mainly concentrated in high scores, proving that the sequencing quality is good! GC content statistics # the horizontal axis is 1-101 bp; the vertical axis is the percentage # the four lines in the figure represent the average content of a t c g at each position # theoretically, A and T should be equal, G and C should be equal, but the status of the sequencer is unstable at the beginning of sequencing, which is likely to happen. In this case, even if the sequencing score is very high, the sequence information at the beginning of cut is required, the distribution chart shows the average GC content of the 5bp sequence before cut # the horizontal axis is 0-100%; the vertical axis is the number of GC content corresponding to each sequence # the blue line is the theoretical value given by the program based on the empirical distribution, red is the actual value. The two values should be closer to each other. # When the red line appears, basically, it must be the DNA sequences of other species. # The information in this figure is good. Sequencing length statistics # The length measured by each sequencer should be completely equal in theory, but there will always be some deviations # 101bp is the main one, but there are still a small number of 100 and BP lengths, but the number is relatively small, no impact on subsequent analysis # When the sequencing length is different, if it is serious, it indicates that the data produced by the sequencer in the sequencing process is untrusted sequence Adapter # This figure measures the situation of the adapter at both ends of the sequence # If NO content is found in option a during fastqc analysis at the time, by default, four generic adapter sequences in the legend are used for statistics. # In this example, the adapter has been removed. If the adapter sequence is not cleaned, in the subsequent analysis, we need to use the cutadapt software for de-joint repeat the short sequence # This graph calculates that, the number of repeated occurrences of Short Sequences of some features in a sequence # We can see that when 1-8 BP is used, several short sequences in the legend are frequently used. In general, this situation occurs, either the adapter is not cleaned, and the-a parameter is not used; or the sequence itself may have a high degree of repetition, for example, bias occurs during database creation PCR # in this case, my solution is to cut down some of the preceding lengths, and try to cut 5 ~ 8bp

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.