High-throughput sequencing data after the machine's original FASTQ file, containing 4 lines, one of the behavior quality value, the other line is the corresponding sequence, we all understand high-throughput data processing first to carry out quality control, these processes include the connector, filter low-quality reads, remove low-quality 3 ' and 5 ' end, Remove n more reads and so on, and for high-throughput sequencing data quality control software is also a lot, here to introduce a "old brand" quality control tool Fastx_toolkit, it is a package, contains a number of quality control commands, the following we will explain the parameters and use:
1. Fastq_quality_converter [-h] [-a] [-n] [-z] [-i INFILE] [-F OUTFILE] visual observation of mass values
[-h] = Print help
[-a] = output ASCII quality score (default).
[-N] = output quality value data.
[-Z] = gzip compressed output.
[-I INFILE] = Enter files in FASTA/FASTQ format.
[-o OUTFILE] = output FASTA/FASTQ file.
2. Fastq_masker [-h] [-v] [-Q N] [-R C] [-z] [-i INFILE] [-O OUTFILE] shielding low-quality base
[-Q N] = quality threshold, the mass value below this threshold value will be mask off, the default value is 10
[-r C] = Replace low-quality base with C, default N to replace
[-Z] = output is compressed with gzip.
[-I INFILE] = input Fasta file
[-o OUTFILE] = output file
[-v] = verbose-report sequence number, if you use-o then the report is directly in stdout, if not, enter to stderr
3. Fastq_quality_filter [-h] [-v] [-Q n] [-P n] [-z] [-i INFILE] [-O OUTFILE] filter low mass sequence
[-Q N] = minimum quality value to be left
[-P N] = The minimum number of bases per reads is required to have a quality value of-Q
[-Z] = compressed output
[-v] = verbose-report sequence number, if you use-o then the report is directly in stdout, if not, enter to stderr
4. Fastq_quality_trimmer [-h] [-v] [-t n] [-l n] [-z] [-i INFILE] [-o OUTFILE] trim reads end
[-t N] = starting from the 5 ' end, the base of low and N masses will be trimmed off
[-l N] = minimum allowable length of the reads after construction
[-Z] = compressed output
[-v] = verbose-report sequence number, if you use-o then the report is directly in stdout, if not, enter to stderr
5. Fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o outfile]fastq Convert to Fasta [-r] = sequence with serial number rename
[-n] = sequence with n reserved, not reserved by default
[-Z] = compressed output
6. Fastx_trimmer [-h] [-f N] [-l n] [-t n] [-M minlen] [-z] [-v] [-I INFILE] [-O OUTFILE] from 3 ' start to 5 ' which parts remain
[-f N] = Starting from the base of the first, the default
[-l N] = back from the base of the first to retain, the default all the base is retained.
[-t n] = the tail of the sequence is trimmed off N bases.
[-m minlen] = trim off a sequence that is less than minlen in length.
7. Fastx_quality_stats [-h] [-n] [-I INFILE] [-o outfile]fastq file quality values are counted
[-I INFILE] = input FASTQ file
[-o OUTFILE] = output text file name
[-N] = using the new output format, using the old format by default
Old format output file: The following line represents a column of the output file
Column=1 to 36
Count = How many bases are in this column
min = base mass minimum for this column
max = base mass maximum for this column
sum = the sum of the base mass of this column
mean = base mass average of this column
Q1 = 1/4 Base Mass value
Med = median number of base mass values
Q3 = 3/4 base mass value.
IQR = q3-q1
LW = ' Left-whisker ' value (for boxplotting).
RW = ' Right-whisker ' value (for boxplotting).
A_count = number of this column A
C_count = number of this column C.
G_count = number of this column G.
T_count = number of this column T.
N_count = The number of this column n.
Max-count = maximum number of bases
New output format:
Number of Cycles
Maximum number
For each cycle of the base (all/a/c/g/t/n):
Count = number of base bases in this column
min = minimum value of the base mass of this column
Max = maximum value of the base mass of this column.
sum = synthesis of the base mass of this column.
mean = average of the base mass of this column
Q1 = 1/4 Base Mass value
Med = median number of base mass values
Q3 = 3/4 Base Mass value
IQR = q3-q1
LW = ' Left-whisker ' value (for boxplotting).
RW = ' Right-whisker ' value (for boxplotting).
8. fastq_quality_boxplot_graph.sh [-I. INPUT. TXT] [-t TITLE] [-p] [-o OUTPUT] plot base mass distribution box diagram
[-P] = generated. PS file, which produces PNG images by default
[-I INPUT. txt]= input file as fastx_quality_stats output file
[-o Output] = name of the output file
[-T title] = The title of the output image
9. fastx_nucleotide_distribution_graph.sh [-I. INPUT. TXT] [-t TITLE] [-p] [-o OUTPUT] Map base distribution
[-P] = generated. PS file, which produces PNG images by default.
[-I INPUT. TXT] = output file with input file as Fastx_quality_stats
[-o Output] = The name of the output file.
[-T title] = The title of the output image
10. Fastx_clipper [-h] [-a ADAPTER] [-d] [-l n] [-n] [-D n] [-c] [-c] [-o] [-v] [-z] [-I I NFILE] [-o outfile] Remove connector sequence
[-a ADAPTER] = Connector sequence (default = Ccttaagg)
[-L-N] =& nbsp, ignoring those reads with a base number less than N, default to 5
[-D-N] = retains the N base of the connector sequence after the default -d 0
[-c] = discard the sequences without connectors.
[-c] = keep only the sequence without connectors.
[-K] = reports only the sequence of connectors.
[-n] = reserved N-Series, default not reserved
[-v] = verbose-report Sequence number
[-Z] = compressed output.
[-d] = output debug results.
[-M-N] = requires a minimum match to the length of the connector n, if the length of the match with the connector is less than n not trimmed
[-I INFILE] = input file
[-O OUTFILE ] = output File
Reprint this article please contact the original person to obtain the authorization, at the same time please indicate this article from Chengchao Science Net Blog.
Link Address:http://blog.sciencenet.cn/blog-1509670-848270.html
Fastx_toolkit Software usage Instructions