Simple and useful
Identify repeats, to align ESTs and proteins to the genome,
And to automatically synthesize these data into Feature-rich gene annotations, including alternative splicing and UTRs, as Well as attributes such as evidence trails, and confidence measures.
Easily configurable and trainable
Its output formats must is both comprehensive and database ready.
Provide an easy means to annotate, view, and edit individual Contigs and BACs. This allows users to analyze partial genome assemblies and to independently annotate regions of interest using their own d ATA sets, ideally without the overhead of a database and with only minimal compute resources such as a laptop computer.
MAKER identifies repeats, aligns ESTs and proteins to a genome, makes gene predictions, and integrates these data into pro Tein-coding gene annotations. Moreover, its outputs can is loaded directly into Gmod browsers and databases with no post-processing.
MAKER is not exhaustive:it does not identify noncoding RNA genes, nor is it intended as a comprehensive solution to every Problem in genome annotation. Rather, MAKER is designed to jump-start genomics in emerging model organisms by providing a robust first round of database -ready protein-coding gene annotations.
We used MAKER on the genomes of both a established and an emerging model organism. Our results for the C. elegans genome demonstrate, the accuracy of MAKER on a model organism genome are compar Able to this of other annotation pipelines, whereas we work on the S. mediterranea genome shows that MAKER provi Des an effective means to annotate an emerging genome and to create a genome database.
MAKER is ideal for smaller projects
MAKER can also is used to annotate individual contigs and BACs.
Maker's structure:
MAKER Overview. MAKER uses four external executables:repeatmasker, BLAST, SNAP, and exonerate. Actions corresponding to the five basic steps of automatic annotation is shown in red.
Step 1:compute Phase
a Battery of sequence analysis programs are run on input genomic sequence. The purpose of these computes is to identify and Mask repeats and to assemble protein EST and mRNA alignments that'll be Used to inform MAKER ' s gene-annotation process, which are outlined in steps 4 and 5 below. The default MAKER configuration uses four external programs:repeatmasker (http://repeatmasker.org), BLAST (Altschul et al . 1990; korf et al 2003), exonerate (Slater and Birney 2005), and SNAP (Korf 2004). Each are publicly available and free for academic use. All four programs is also easy to install and run on UNIX, Linux, and OS x.
Unless repeats is effectively masked, gene predictions and gene annotations would contain portions of transposons and Viru Ses. MAKER uses a two-tier process to avoid this problem. First, Repeatmasker are used to screens the genome for Low-complexity repeats; These is then "soft-masked," e.g, transformed to lowercase letters rather than to Ns. Soft masking excludes these region s from nucleating BLAST alignments (Korf et al. 2003) but leaves them available for inclusion in annotations, as many prot Ein-coding genes contain runs of low complexity sequence. MAKER also uses blastx together with a internal library of transposon and virally encoding proteins to identify mobile-el Ements. This process had been shown to substantially improve repeat masking as it identifies genome regions that is distantly rel Ated to the protein coding portions of transposons and viruses; These tend to being missed by Repeatmasker ' s nucleotide-based alignment process, even when genome specific repeat libraries a ReAvailable (Smith et al. 2007). Repeat regions identified in this process is masked to Ns. MAKER performs all of the actions automatically.
BLAST is used throughout the compute phase, first to repeat identification with Repeatmasker (as described above) and the N to identify EST, mRNAs, and proteins with significant similarity to the input genomic sequence. Because BLAST does not take splice sites to account, its alignments is only rough approximations. MAKER therefore uses exonerate (Slater and Birney 2005), a splice-site aware alignment algorithm to realign, or polish, se Quences following filtering and clustering (see Steps 2 and 3, below). Exonerate ' s ability to align both protein and nucleotide sequences to the genome make it a economical choice for this TAS K.
Step 2:filter/cluster
Filtering consists of identifying and removing marginal predictions and sequence alignments on the basis of scores, Percen T identities, etc. Filtering criteria for each external executable is set by modifying the text-based maker_bopts. CTLFile (see Configuration README distributed with MAKER). The New users is not a expected to edit this file, but the advanced users of the behavior of the. After filtering, the remaining data is then clustered against the genomic sequence to identify overlapping alignments and Predictions. Clustering has a purposes. First, it groups diverse computational results into a single cluster of data, all of which support the same gene or TRANSC Ript. Second, it identifies redundant evidence. For example, highly expressed genes is supported by hundreds if not thousands of identical ESTs. Clustering criteria is set in the Maker_bopts. CTLFile, which instructs MAKER to keep some maximum number of members within each cluster, sorted on some series of filtering Attributes such as score or fraction of the hit-sequence aligned. The default parameters is appropriate for most applications but can is easily modified.
Step 3:polish
This step realigns BLAST hits using a second alignment algorithm to obtain greater precision at exon boundaries. MAKER uses exonerate (Slater and Birney 2005) to realign matching and highly similar ESTs, mRNAs, and proteins to the Geno Mic input sequence. Because exonerate takes splice-sites into account when generating it alignments, they provide MAKER with information abou T splice donors and acceptors. This information was especially useful in the synthesis and annotation steps (see below). The thresholds in the Maker_bopts. ctl file Earmark BLAST hits for polishing and is suitable for most application s but can is easily altered if desired (see Configuration README distributed with MAKER).
Step 4:synthesis
MAKER synthesizes information from the polished and clustered EST and protein alignments to produce evidence for Annotatio Ns. To doing so, it identifies ESTs that it judges correspond to the same alternatively splice transcript. This was accomplished by comparing the coordinates of each polished sequence alignment on the genomic sequence in the same That's a human annotator might, e.g., by looking for internal exons with differing boundaries. Next MAKER identifies those protein alignments whose coordinates is consistent with each of the EST splice forms. Once a set of EST and protein alignments-all consistent with the same spliced Transcript-has been identified, positions on The genomic input sequence upstream and downstream of the alignments are labeled as possible intergenic regions. Those bases on the genomic sequence that fall between exons is labeled as putative introns, and bases overlapping the pro Tein alignments is labeled as putative translated sequence. MAKER then calculates A score for each of these nucleotides on the query sequence based upon the percentage of similarity of the alignment, Typ E of alignment, and a query nucleotide ' s position within the alignment. These scores together with their putative sequence types, e.g, intergenic, Coding, intron, and UTR, is then passed to SN AP. Based upon this information, SNAP and modifies its internal Hidden Markov Model (HMM). In the absence of no supporting EST or protein alignments, MAKER uses SNAP ' s ab initio prediction (For additional details , see Training SNAP).
Step 5:annotate
MAKER also post-processes the synthesis-generated SNAP predictions and recombines them with evidence to generate complete Annotations. Each synthesis-generated SNAP prediction are checked against all ESTs and MRNAs, and 5′and 3′utrs consistent with the pre Diction is identified based upon their coordinates relative to the predicted coding exons. The coordinates of the SNAP prediction is then altered to include these regions. This process is repeated for each of the synthesis-based predictions. Finally, compute evidence supporting each exon was added, and alternatively spliced forms is documented.
Additional details regarding MAKER ' s architecture and implementation can be found in the release materials. All MAKER source code is publicly available; The current release along with installation instructions and documentation are available at Http://www.yandell-lab.org/make R.
Inputs and outputs
The input to MAKER are a genomic sequence (of any length) in FASTA format and three configuration files describing external Executable, sequence database locations, and various compute parameters (see configuration README distributed with MAKER) . MAKER also uses four sequence database files during the compute phase:atransposonsFile, an optionalRepeatmasker DatabaseFile, aproteinsFile, and anEsts/mrnasFile. Each file is in FASTA format. ThetransposonsFile is bundled with MAKER and contains a selection of known transposon and virally encoded protein. This file was used to identify and mask repeats missed by Repeatmasker, as this have been shown to substantially improve ACC Uracy (Smith et al. 2007). In cases where no organism-specific repeat library is available, MAKER would automatically use theTransposonFile to mask mobile elements and the Repeatmasker program to identify and mask low-complexity sequences. TheRepeatmaskerFile is an optional Fasta file containing organism specific repeat sequences, if available. Theproteinsfile contains any proteins users would like aligned to the genome. We recommend they use the latest version of the Swiss-prot database for this purpose (Bairoch and Apweiler 2000). Finally, users should also supply a file of ESTs and/or mRNAs sequences derived from the organism being annotated. Assembling these into contigs was helpful, but it was not required.
MAKER outputs gmod-compliant annotations in GFF3 format (http://www.sequenceontology.org/gff3.shtml) containing Alternatively spliced transcripts, UTRs, and evidence for each gene ' s annotated transcript and protein sequences. This file can is directly imported into genome browsers and databases, adhere to Sequence Ontology (Eilbeck et al. 200 5) and Gmod (http://www.gmod.org) standards. For convenience, MAKER also outputs Multifasta files of transcripts and protein sequences for both annotations and AB Init Io SNAP predictions.
MAKER also writes a GAME XML file (http://www.fruitfly.org/annot/apollo/game.rng.txt) containing the same contents as the corresponding GFF3 file (http://www.sequenceontology.org/gff3.shtml); This file can is directly viewed in the Apollo genome browser (Figure 3) (Lewis et al. 2002). Apollo can also is used to directly edit annotations and to save them to GFF3 format, thus changes to MAKER annotations CA N is saved prior to uploading them into a Gmod browser or database. Apollo can also directly export the revised transcripts and protein sequences in FASTA format. Especially useful feature for those seeking to annotate a single contig or BAC rather than an entire genome, as It circumvents the overhead associated with creating and maintaining a Gmod database. figure 3 shows a portion Of an annotated contig viewed in Apollo genome browser. Compute evidence assembled by MAKER are shown in the top panel; Its resulting annotation, below. This is a demonstrates how to makeR synthesizes data gathered by it compute pipeline into evidence-informed gene annotations; While SNAP produced Bothab initio predictions in this region, the EST and protein alignments clearly-a single gene. Note too the 3′utr on the MAKER annotation derived from the EST alignments.
The MAKER MRNA Quality Index
Compute data is essential for discriminating real genes from false positives. To simplify the quality evaluation process, each maker-annotated transcript have an associated quality index included in it S GFF3 and GAME XML outputs. This is a nine-dimensional summary (Table 2) of a transcript's key features and how they was supported by the data gathere D by MAKER ' s compute pipeline. The quality index associated with the MRNA shown in Figure 3 is qi:0|0.77|0.68|1|0.77|0.78|19|462|824.
Quality indices play a central role in training MAKER for a particular genome, where they is used to identify transcripts That is well supported by EST and protein evidence but poorly supported by ab initio SNAP predictions. These cases is used to retrain SNAP via the bootstrap procedure outlined below. MAKER ' s quality indices also provide an easy means to sort and rank transcripts by key features such as number of exons, p Resence or absence of UTR, or degree of computational support. Quality indices were used to assemble the HC S. Mediterranea genes described in the Results section.
Training MAKER
for optimal accuracy, a gene finder must be trained for a specific genome (Korf 2004), Gen Erally using several hundred existing gene-annotations drawn from a body of experimental data gathered over many years. Unfortunately, many emerging genomes do not has a history of experimental molecular biology. It has therefore become a common practice to use gene finders trained in one genome to predict genes in another-a far from Optimal solution to the problem (for discussion, See korf 2004). Information gathered from ab initio predictions was essential for the annotation process, even if other evidence was avail Able. Moreover, in the absence of experimental evidence and sequence similarities, the probabilistic models produced by AB Initi O Gene prediction programs offer the best guesses at gene structure. The SNAP (Korf 2004) Gene Finder is designed from the outset to being easily configured for any genome; Hence its use in MAKER.
maker is trained for a genome using a two-step process. First, SNAP is trained by aligning a set of universal genes to the input genome (Parra et al. 2007). These universal genes is highly conserved in all eukaryotes and can be identified using pairwise and PROFILE-HMM alignmen T methods. The resulting gene structures is used to create a first-pass version of SNAP for use in the next stage of the training PR Ocess. This initial stage of the training procedure was automated, and complete details of the process can being found in the MAKER R Eadme. More extensive documentation is provided By parra et al.
The Genome-specific HMM produced in the first stage of the SNAP training was further refined with a second stage of training. This was accomplished by running MAKER on a few megabases of genomic sequence (enough to result in a few hundred annotation s). The resulting GFF3 outputs was then used as inputs to a script called maker2zff.pl, whose output was a zff file that CA n is used to automatically create a revised HMM. The maker2zff.pl script uses the quality index MAKER attaches to each annotation to identify a set of gene models with int Ron-exon structures that is unambiguously supported by EST alignments and protein homology. These genes is then used to further refine the SNAP HMM. The maker2fzff.pl script is bundled with MAKER, and programs necessary to create the HMM was included in the SNAP package. To train MAKER for theS. MediterraneaGenome, we first trained SNAP using the universal gene set as outlined above. We then ran MAKER on a randomly selected 100-mb portion of theS. MediterraneaGenome (∼10% of the entire genome). The resulting GFF3 files were used as inputs to maker2zff.pl, and the refined snap-hmm were used in the final annotation ru N.
Downloading and installing MAKER
MAKER is available-download from http://www.yandell-lab.org/downloads/maker/maker.tar.gz. Once downloaded, the MAKER package should is unzipped and untared. Full installation and usage instructions is available in the file called README.
Creating SMEDGD
The GFF3 output files generated by MAKER were used to populate smedgd. The files were uploaded into a mySQL database, using a-standard Bioperl (http://www.bioperl.org) loading script, Bp_seqfea ture_load.pl. This script converts GFF3 formatted annotations to Bio∷seqfeaturei objects, which is stored in the MySQL database. GBrowse, a tool distributed by Gmod (http://www.gmod.org) Implementing a Bio∷db∷seqfeature∷store database adaptor, Accesse S and displays rows of data or tracks that is mapped to specific locations in the genome. SMEDGD consists of MAKER annotations as well as project specific features, such as additional protein homology, human cura Ted genes, and RNA interference phenotype data. The database is available at http://smedgd.neuro.utah.edu.
Actual use:
Waiting to be added ...
Freemao
Fafu
Maker 2008 published in Genome Res