Introduction to the Transcriptome database

Source: Internet
Author: User

I. NT and NR databases

NT Library and NR library Everyone is familiar with, a nucleic acid library, a protein library, both can be online blast through NCBI, but also in the ftp://ftp.ncbi.nih.gov/blast/db address will be downloaded as follows,

Local blast. Here is also a brief description of the online comparison method:

Open https://blast.ncbi.nlm.nih.gov/Blast.cgi, select the appropriate program (graph from Network) according to the following table

You can then directly compare the sequence to the NT and NR libraries, and if you have questions, you can view the Help documentation:

Ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_BLASTGuide.pdf.

For these two databases no longer elaboration, if it is a large-scale ratio, we can provide high-quality services.

Second, Swissprot database

Swissprot (Http://www.uniprot.org/uniprot/?query=*&fil=reviewed%3ayes), the latest updated version contains 554,515 protein sequences, as Uniptort said: "It is a high quality annotated and non-redundant protein sequence database, which brings together experimental results, Compu Ted features and scientific conclusions. " These sequences are both validated and annotated protein sequences with high reliability.

As you can see, the common species is listed on the left side of the URL, and if you are looking for protein sequence information for a particular species, you can either enter direct lookup at B or enter a lookup in a. Often encountered in the work of the UniProt protein as a protein library for the analysis of the Itraq project (for the transcriptome, this is the expansion of knowledge), for some of the protein concerned, if you want to understand the protein specific annotated information, then the following methods can be achieved:

Login http://www.uniprot.org/uploadlists/or click on the homepage of the website

Will jump to the following page

Enter the UNIPROTKB AC or ID number at a, or import the list directly at B (one per line) and proceed to the next step by default. Of course, some teachers will say what is UniProt KB or ID number, a complete Swissprot ID is this

In general, SP represents the protein from swissprot,| | The middle part is the ID number, also becomes the entry number, | The following general format is the gene name abbreviation plus "_" with the name of the species. Then use the ID number or the | later section to search for the corresponding information. By the way, open it.

As you can see, it supports a number of input formats and is also available for the most commonly used gene name, but the ID number is unique and allows you to search for protein information precisely.

Of course click on the right

You can also specify a database to which to compare, and you can query the relevant annotation information as needed.

After clicking Go, the following screen will pop up:

When you tick a sequence, 1 of the blast will light up, and you can compare the tick sequence to the UniProt sequence blast. When you check multiple sequences, 2 of the align will be lit, you can tick the sequence between the alignment,3 is to download all the information or check the information, 4 is the selected protein (will be the corresponding in the lower left corner with the basket mark, such as 7 display) to join the basket (basket), The added protein will be shown in the top right corner of 6, click 6 to do a collection of specific analysis of the sequence, do not repeat this.

You need to focus on the 5Columns option, click on it, you will find the new world.

In theory, you can think of common annotation database information, can be found in it, including Go,kegg, sequence information, protein name, gene name, subcellular localization, pfam information and so on. Because the information contained here is so complex that it is not listed, just tick the options and click Save in the upper right corner to get the information into the bag.

In the general Transcriptome analysis, Swissprot's annotation rate is only lower than NR, relying on Swissprot annotations, in fact, can be extended to get a lot of other relevant comments, such as Go,kegg,pfam.

Third, Kegg database

Kegg database should also be familiar to everyone, in this also do not do too much, you can refer to

Http://muchong.com/html/201009/2325769.html to more in-depth understanding of the Kegg database, here, only for the teacher some sequence of Kegg notes for teachers to provide an online submission method:

1. Open the URL http://www.genome.jp/kaas-bin/kaas_main to do the following:

2. After the task is uploaded, the email will receive an email informing the mission that it is accepted:

3. Follow the instructions and you will receive a message when you are finished.

4. Open the link in the email and click on the corresponding task's HTML:

5. Select Brite Hierarchies:

6. Select KEGG orthology (KO)

7. Select Download Htext to download the file locally, the filename remains the default (Q00001.keg)

This will result in a sequence of Kegg annotations, which, in general, can be completed in 1-6 hours for 10,000 or so sequences.

Iv. KOG Database

"KOG" is an abbreviation of clusters of orthologous groups for eukaryotic complete genomes (cluster of eukaryotic protein adjacent classes). The proteins that make up each kog are assumed to come from an ancestral protein and are therefore either orthologs or paralogs. Orthologs refers to proteins from different species that evolve from a vertical family (speciation), and a typical reservation has the same function as the original protein. Paralogs, a protein derived from gene replication in certain species, may evolve new functions related to the original. Database links: Ftp://ftp.ncbi.nih.gov/pub/COG/KOG/kyva.

The database, there is no online submission of the annotation method, but if you look at the above, you will find that in fact can be obtained through the Swissprot kog annotation information, if not found, please go back and look carefully.

V. String database

The string database (https://string-db.org/) is a system that searches for the interaction between known proteins and predicts proteins. This relationship can be either a physical or indirect functional correlation between proteins. It calculates the co-expression of genes or proteins based on biological information such as chromosome proximity, phylogenetic spectra, Gene Fusion, and microarray data.

The latest string database is version 10.5, which contains 2031 species of 9 ' 643 ' 763 protein 1 ' 380 ' 838 ' 440 interactions. You can do local blast by downloading the species protein sequence, click Download

Selected species Post download action relationship files *.protein.links.v10.5.txt.gz and *.protein.sequences.v10.5.fa.gz

Files can be.

For online comparison, string is very convenient to use, such as

Can use the gene name or protein sequence to query, protein sequence Query method We do not do more explanation, the method of gene name query for us to more commonly used, for example, enter gene symbol (one line), for example, click Search

The following page will then pop up

The string Web site will match the input gene name to the gene of the species in the database, and the output matches one of the best hooks, which, in general, can be checked to see if the match is correct, in the vast majority of cases, there is no problem, sometimes the input of the gene name and the gene name included may be slightly out of the way, The actual output is based on the Sring website. When you're sure, click Continue (the more input, the slower the step, the more complex the graphs and tables that follow, so it's not recommended to enter too many genes).

In the page that pops up, the upper part is the protein cross-drawing, the middle is some setting parameters, such as:

For an interactive picture, you can use it directly, or you can export the data and use Cytoscape to draw it yourself.

is to look at the results in a different way, usually the default is the network, for other interested can click to see a bit.

are descriptions, including icon descriptions, line descriptions, and input instructions.

is to make some settings, such as using the interactive relationship or confidence to display lines, Image Preservation format (PNG or SVG vector), and the minimum confidence, etc., using 0.4 will show all the score value above 0.4 nodes, up to 0.9 to reduce the low confidence nodes, can also be a picture to make simple and beautiful.

These input genes can be annotated and enriched with go and Kegg, and the results are output.

You can output the result file, including the picture and table information, as mentioned earlier, if you need to draw your own cytoscape, click the TSV output format to get an interactive Excel table.

is the input to cluster, in fact, if the Cytoscape can also be clustered, the use of Mcode, you can view the small part of an article detailed understanding. As shown, the results of the TSV output are plotted and clustered, and the different cluster are displayed in different colors.

About string another file to explain, is download in the Species.v10.5.txt file, the file is sorted by taxon_id, you can find the appropriate species to search, here, the introduction of the string is over.

Liu, Animaltfdb

Animaltfdb (http://www.bioguo.org/AnimalTFDB/) is a database of animal transcription factors that contains the majority of 71 transcription factor families of 50 species, including humans, pigs, toads, and fruit flies, in most animal model species. Their annotation information is based on Ensemble 6.0 (ftp://ftp.ensembl.org/pub/release-60/gtf/), so it is also possible to download data for local blast comparison. The database supports data retrieval in a variety of input formats, such as

Seven, Plntfdb

Plntfdb (http://plntfdb.bio.uni-potsdam.de/v3.0/) is a database of plant transcription factors that contains the majority of plant-model species, including 84 transcription factor families of 20 species such as Arabidopsis, Rice, and 28193 Protein models, 26184 distinct protein sequences. It supports online blast for comparison, and can also be downloaded to local blast after downloading the data.

Eight, Prgdb

Prgdb (http://prgdb.crg.eu/wiki/Resistance_genes) is a plant resistance gene database, it also supports local and online blast comparison, about which it can introduce a little, here simply mention.

Introduction to the Transcriptome database

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.