Source: X2yline's comments on the evolutionary tree of Sheng http://www.biotrainee.com/thread-626-1-1.html
Ensemble (ensembl.orgwebsites are commonly usedEukaryotic organismsReferenceGenome Groupone of the sourcesis able to annotate human genes automatically, including human, mouse, zebra fish, pigs and rats, including manual annotated information from Havana.
ENSEMBL is a bioinformatics research programDesigned to develop a software system that can automatically annotate eukaryotic genomes (automatic annotation) and maintain them. The program is operated by the Wellcome Foundation of the British Sanger Institute and the European Institute of Bioinformatics, affiliated with the European Molecular Biology Laboratory.
Ensembl and NCBI's NCBI Map Viewer and UCSC are the most commonly used genomic retrieval databases.
The maximum difference between ENSEMBL and NCBI Map Viewer and UCSC is shown in the following 5 points:
A.ENSEMBL's genetic data set is based on the data information of the mRNA and the sequence in the egg white notes. Data sources for new genomic data, Uniprot/swissprot and UNIPROT/TREMBL protein sequences, NCBI DNA and protein sequences in refseq, and EMBL cDNA sequences.
B.ENSEMBL is an open source (Perl API) fully automated gene annotation software system, and many websites use the ENSEMBL software system.
C.ENSEMBL has its own unique biomart function. Biomart can conditionally search the genome according to the set requirements, and the retrieved results are given in the form of a chart.
D. Integrate with other databases, such as Das.
E. Comparative analysis between genomes.
Gene Annotation Agency
There are many organizations that are currently involved in gene annotation, and here are just a few of the more commonly used ones.
1. ENSEMBL: The goal is to make the best set of gene annotations.
2.Havana (VEGA): It is a gene annotation organization of the Sanger Center, its goal and eiisembl-, therefore, the combination is also the most closely.
3. HGNC-gives the unique name and symbol of the human genome.
4. UniProt mainly concentrates on the information annotations of proteins.
ENSEMBL's general gene annotation has two kinds, one is the Ensembl genebuild, it is the automatic annotation, the speed, the real-time update, applies in the different species, the other is the Wellcome Foundation's Havana (VEGA) group's note, it is the manual annotation, the slow, But accurate, it is based on the already validated mRNA and protein sequence to annotate, more time-consuming. Therefore, there are two types of annotations in the ENSEMBL genome database.
the Havana (VEGA) group's comments are often in the following categories:
More information: http://vega.sanger.ac.uk/info/about/gene_and_transcript_types.html
Protein Coding: Includes open reading box (ORF).
Processed transcript: No open reading box (ORF)
Pseudogene: A pseudo-gene refers to a fragment of DNA that is very similar to a sequence of genes known to other organisms in the base sequence of DNA. But this fragment, because of the mutation of the code or the nonsense mutation, destroys the ORF, unable to perform the original gene function, that is, unable to produce protein
IG Gene: Immunoglobulin family
TR gene:t cell receptor gene
TEC (to be experimentally confirmed)
the GTF file of the human and mouse genome is the same as the gene set file published by the Gencode program.
The goal of the Gencode project is to provide high-quality annotated information and experimental validation of human and mouse genomes.
The Gencode gene sets is widely used as a reference for other projects (e.g., Genomes).
More information: https://www.gencodegenes.org/about.html
files with the abinitio extension are generated using the Genescan and Abinitio gene prediction tools
Comment file for predictive genes
Reprint: Ensemble Program and Database