Preparations before sequence comparison and sequence preparation
After using FastQC, if we find some problems (low sequence quality), what tools should we use to solve these problems? Fastx Toolkit is a series of tools that process fastq/fasta files. It is developed based on java and is most commonly used for high-throughput sequencing. This software is used for reads cropping (trim) FASTQ-to-FASTA Description: This command is mainly used to convert the FASTA format and FASTQ format
fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i] [-o]
[-H] = get help information [-r] = use the serial number to replace the original reads name [-n] In the fastq file = if there is N in fastq, reserved (N sequence is deleted by default) [-v] = Total number of reads reported [-z] = GZip software called, the output file is automatically compressed [-I] = input file, which can be in the fastq/fasta format [-o] = output path. If this parameter is not set, the output file is directly output to the screen FASTX Statistics. Description: it mainly counts the basic information of the sequence, such as the GC content or something, which is rarely used and basically replaced by FastQC.
fastx_quality_stats [-h] [-i INFILE] [-o OUTFILE]
[-H] = get help information [-I] = FASTA/Q format input file [-o] = output path, if this parameter is not set, it will be directly output to the screen FASTA/Q Clipper. Note: Read is mainly used to filter the reads and crop the adapter.
fastx_clipper [-h] [-a ADAPTER] [-i INFILE] [-o OUTFILE]
[-H] = get help information [-a ADAPTER] = Adapter sequence information. The default value is CCTTAAGG [-l N] = if one reads is less than N, discard it, the default value is 5 [-d N] = keep the adapter and keep the Nbp behind it, if the value of-d 0 is set to "-c", this parameter is not used. [-C] = only the sequence containing the adapter is retained. [-c] = only the sequence containing the adapter is retained. [-k] = the sequence reporting adater sequence information [-n] = if there is N in reads, reserved reads (N sequence is deleted by default) [-v] = Total Number of report sequences [-z] = GZip software called, the output file is automatically compressed. [-D] = Debug output FASTA/Q Trimmer Description: This is my most common tool and can be used to quickly cut the sequence.
fastx_trimmer [-h] [-f N] [-l N] [-z] [-v] [-i INFILE] [-o OUTFILE]
[-H] = get help information [-f N] = Reserved starting from the nth base in the sequence. The default value is 1 [-l N] = The Last number of bases in the sequence, by default, the entire sequence is retained [-z] = the GZip software is called. The output file is automatically compressed by the cutadapt software. This cutadapt software is the most commonly used tool for de-adapter. It is the most basic usage of a Python package written based on Python # The cutadapt function is particularly powerful, and there are actually many parameters, with dozens of parameters, we usually only use a few of them. I will introduce them here. # In the most basic form, you can remove the 3' adapter sequence
cutadapt -a AACCGGTT -o output.fastq input.fastq
# You can directly input the compressed output file, without modifying the file, and add .gz to the output file.
cutadapt -a AACCGGTT -o output.fastq.gz input.fastq.gz
# If you remove the three 'adapter AAAAAAA and 5' adapter TTTTTTT
cutadapt -a AAAAAAA -g TTTTTTT -o output.fastq input.fastq
# Cutadapt can also be used to cut reads and remove the 5bp at the beginning.
cutadapt -u 5 -o trimmed.fastq input_reads.fastq
# Filtering reads sequencing quality # The cutadapt software can use the-q Parameter to filter reads quality. The basic principle is that the reads header and tail may cause poor sequencing quality due to the sequencer status or reaction time. A rough filtering method is-q for filtering. It should be particularly noted that the numbers corresponding to-q here are different from the phred values. It is a simple filter by the #3 'End calculated by the software based on a certain algorithm, -- quality-base = 33 indicates that the sequence uses the phred33 scoring system.
cutadapt -q 10 --quality-base=33 -o output.fastq input.fastq
#3 'end 5' are filtered. The 3' threshold is 10, and the 5' threshold is 15.
cutadapt -q 10,15 --quality-base=33 -o output.fastq input.fastq
Reads length Filtering [-- minimum-length N or-m N] # When the sequence length is less than N, reads throws [-- too-short-output FILE] # The sequences obtained by the preceding parameters are not directly discarded, instead, it is output to a file [-- maximum-length N or-m n] # When the sequence length is greater than N, reads throw [-- too-long-output FILE] # The sequences obtained by the preceding parameters are not directly discarded, but output to the cropped red-Reads (trim) in a FILE) # Many of the current sequencing processes are dual-end sequencing, so from the sequencing principle, a pair of reads comes from a cluster reaction, so the trim of adapter together may be better. Cutadapt naturally provides such a feature
cutadapt -a ADAPTER_FWD -A ADAPTER_REV -o out.1.fastq -p out.2.fastq reads.1.fastq reads.2.fastq
#-A is the adapter sequence of the 1st files #-A is the adapter sequence of the 2nd files #-o is the 1st output files #-p is the first output file. in practice, most of the data we get from the company has been cutadapt. In fact, we are more concerned with reads trim. We should first use fastqc to evaluate the quality of test1.fastq and test2.fastq. The main results of the evaluation are as follows: we can see from the above two figures that the sequencing quality of read1 is significantly better than that of read2, generally, we determine the trim bp to be evaluated based on the phred20 standard. For example, for our test data, read1 does not need trim, and read2 needs to retain 1-85bp. The corresponding fastx_trimmer command is as follows:
fastx_trimmer -i test_data_2.fastq -o test_data_2_trim.fastq -f 1 -l 85
[-F N] = indicates the number of bases in the sequence that are retained. The default value is 1 [-l N] = the number of bases that are retained at the end of the sequence. By default, the entire sequence is retained.