Analysis on the scale-up Operator 4: The Representative sequence and OTU table generated by the non-bacterial sequence of the de-embedding, and The otu table
This course, need to complete the expansion sub-Analysis Interpretation 1 Quality Control Experiment Design Dual-End sequence merging 2 extraction barcode quality control and sample splitting amplification Primers 3 format conversion de-redundant clustering first look at the expansion sub-analysis of the overall process, prepare for analysis from bottom to top
# Go To The working directory cd example_PE250
A review in the previous section: we created a Fasta file in Usearch format, filtered all sequences for de-redundancy and low abundance, and clustered them to generate OTU. Next, we further remove the embeddings of OTU and generate representative sequences and OTU tables. What is chimeras )? The mosaic sequence is composed of two or more template chain sequences, as shown below: In the PCR reaction, because the extension phase is not fully extended, it will lead to the appearance of the mosaic sequence. For example, in the process of amplified sequence X, in the sequence extension phase, only part of the X sequence extension phase is generated and the end is completed. In the next round of PCR reaction, this part of the sequence is further extended as a primer for the sequence Y, and the amplification will form the combination sequence of X and Y. Put a specific point, incomplete extension of the generation of the sequence as the product of the next round of PCR reaction, usually in the PCR process, there is a probability of about 1% of the occurrence of the mosaic sequence, in the analysis of 16 S/18 S/ITS scale-up sub-sequencing, the system similarity is extremely high, and the size of the mosaic can reach 1%-20%. Therefore, the mosaic sequence needs to be removed. The proportion of the combination is related to the number of PCR cycles. The higher the number of cycles, the higher the proportion of the combination. Have you ever played World of Warcraft? Do you have any friends that remember the ultimate weapons of the elves? Its English is chimera, that is, the combination of Chinese characters, and chimera is a transliteration. 10. Database-based deduplication (optional) in step 1 above, during the aggregation of OTU, a large number of embeddings have been directly removed by denied according to the sequence similarity in the group. At present, this step is necessary in the previous analysis based on database de-embedding. However, with the development of technology, this step may also cause false negatives. Readers can determine whether this step is required by designing experiments, preliminary results, and expectations. In this example, each step is operated in a personal style to show you a comprehensive process. Previously, Usearch recommended that you use RDP data for data mosaic and provide download links. Now, the author suggests that you use Sliva or Unite as a comprehensive large database, it is not recommended to use a small database such as RDP. Previous suggestions are incorrect. Software methods are constantly improving. I have not systematically compared the author's new suggestions. Here we still follow the original method. Readers can try new methods on their own.
# Download the reference database RDPwget http://drive5.com/uchime/rdp_gold.fa# recommended Usearch Based on RDP database comparison to remove known sequences of the mosaic. /usearch10-uchime2_ref temp/otus. fa \-db rdp_gold.fa \-chimeras temp/otus_chimeras.fa \-notmatched temp/otus_rdp.fa \-uchimeout temp/keys \-strand plus-mode sensitive-threads 96
Use the-uchime2_ref parameter to fit the object, followed by the OTU sequence (input file);-db specifies the reference database, where RDP is used;-chimeras output to detect the sequence as the mosaic; -notmatched outputs results that do not match the database, that is, non-Mosaic and non-identical sequences.-uchimeout indicates the detailed information of the input mosaic, such as the source of each mosaic, which is similar to those of the parent; -strand specifies the link direction, which is generally positive.-mode selects the mode, and the sensitive cost is the high false positive rate of the mosaic identification.-threads design thread count, by default, the program has less than 10 threads as a single thread, and more than 10 threads as 10 threads. It can be set based on the actual situation. The above calculation result is Chimeras 2669/5489 (48.6%), in db 51 (0.9%), not matched 2769 (50.4% ), that is to say, 5489 of the 2669 Otus are detected as the mosaic, 51 are not the same as the database sequence, and the other 2769 are not matched with the database. Corresponds to Y/N/? In the third column of the temp/otus_rdp.uchime File /? We want to exclude the mosaic part, that is, 51 + 2769 = 2820. The idea is to exclude all Otus from the authentication as a mosaic.
# Obtain the sequence IDgrep '> 'temp/otus_chimeras.fa | sed's/> // G'> temp/otus_chimeras.id # Remove the sequence filter_fasta.py-f temp/otus. fa-o temp/otus_non_chimera.fa-s temp/otus_chimeras.id-n # Check whether the expected number of sequences is 2820 grep '>'-c temp/otus_non_chimera.fa
11. Remove this step unless the bacterial sequence (optional) is not required and may cause false negative. There are many factors in the analysis of personal habits, so the analysis results of different people will be slightly different. There is also a lack of system evaluation to the end which is better, because there are conditions for good and bad, and it is not easy to clarify how to judge. This is experience; project experience has been accumulated through a large number of projects. My habits are in the face of big data, and the results are useless. I have to find meaningful things. In principle, I can give it away and discover patterns more easily. If you don't find it, go back and pick it up. If nothing persists, the rule may always be hidden in the ocean of big data. The principle of this step is to compare the OTU with the Align database of Greengene (http://greengenes.secondgenome.com), and filter the sequences with a similarity greater than 75% as the bacterial sequence; this step can eliminate external non-bacterial pollution, non-bacterial sequences cannot be annotated in the next analysis and are difficult to analyze.
# Download the latest Greengene database, 320 MBwget-c ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz# decompression packet size 3.4 Gtar xvzf gg_13_8_otus.tar.gz # OTU and 97% similar cluster representative series multi-sequence comparison, about 8 mintime align_seqs.py-I temp/hour-t hour/rep_set_aligned/97_otus.fasta-o temp/aligned/# cannot compare the number of bacteria grep-c '> 'temp/aligned/second #1860 # obtain OTU IDgrep '> 'temp/aligned/otus_non_chimera_failures.fasta | cut-f 1-d' | sed's/> // G'> temp/aligned/ filters # filter non-bacterial sequences filter_fasta.py-f temp/otus_non_chimera.fa-o temp/otus_rdp_align.fa-s temp/aligned/targets-n # check how many Otus are there: 975 grep '>'-c temp/otus_rdp_align.fa
After filtering this step, from 2820 non-Mosaic Otus, there are only 975 Otus similar to bacteria, which is closer to the truth. Some studies often involve thousands or tens of thousands of Otus, and the false positive result is more than 90%. What do you think is the significance and how to guide downstream experiments. For fungi ITS/18 S, it is generally not recommended to use the Unite database for de-Mosaic, because ITS/18 S is available in all eukaryotic organisms and will be further confirmed after species annotation. 12. Generate a representative sequence and representative sequences of The OTU table, which is the final version of the OTU, similar to the reference genome/cDNA which will be indexed. Then, all the data is mapped to OTU to determine the abundance of each species. The OTU table is the abundance value of each OTU in each sample. Essentially, each high-throughput sequencing result has a similar table. For example, RNA-Seq is a table of Gene Expression and sample.
# Rename OTU, which is the representative sequence of the final version, that is, Reference (optional, personal habits) awk 'in in {n = 1 }; />/{print "> OTU _" n; n ++ }! />/{Print} 'temp/otus_rdp_align.fa> result/rep_seqs.fa # generate the OTU table. /usearch10-usearch_global temp/seqs_usearch.fa-db result/rep_seqs.fa-otutabout temp/otu_table.txt-biomout temp/export-strand plus-id 0.97-threads 10 # result information 0141 Mb 100.0% searchsorting, 32.3% matched # The default value is 10 threads. At 1 minute 20 seconds, 32.3% of the sequences are matched to OTU. When 30 threads are used, instead, 3 minutes 4 seconds is used. The faster the number of threads is, the more time-consuming the distribution task is.
Now we have obtained the OTU table. Use less temp/otu_table.txt to check it. Biom can also process standard json files for subsequent analysis.