Prediction | Number of sequences |
---|---|
Proteins | 15,688 |
unique Proteins | 15,160 |
Transcripts | 15,688 |
unique Transcripts | 15,518 |
MAKER 2 is a genome annotation pipeline for smaller eukaryotic and prokaryotic genomes. The MAKER algorithm uses different prediction programs (see figure 1) for mask repetitive elements in the genome (RepeatMasker), aligns proteins and RNA evidence to the genome assembly (BLAST), includes data from relatives, and identifies splice sites (Exonerate). The repeat annotation step was skipped because it was performed earlier with a custom designed repeat database (see Repeat masking section ). MAKER makes use of gene prediction software, Augustus and Semi-HMM-based Nucleic Acid Parser (SNAP), as gene finders. Augusts predicts genes in eukaryotic genomic sequences by using the Generalized Hidden Markov Model (GHMM) to align Expressed Sequence Tags (ESTs) and proteins to the genome. SNAP deals with Hidden Markoc Models (HMMs), as does Augustus, but this program uses these models to calculate similarities of intron length probabilities. Additionally, MAKER uses tRNAscan ffor detection of tRNA. The Annotation Edit Distance (AED) was developed (Eilbeck et al., 2009; Holt and Yandell, 2011; Yandell and Ence, 2012) to measure the evidence of the annotations coming from MAKER. The AED is a number between 0 and 1, where 0 means the entire gene is covered by RNA matching regions. MAKER can be trained by itself and by other programs like Augustus or SNAP. The data from GeMoMa as well as the data from relatives were handed to the pipeline (see figure 2). For the first run of MAKER, the Nasonia reference model for Augustus detection was used. Resulting MAKER predictions from the previous run with an AED smaller that 0.25 were handed to SNAP for generating more accurate gene models to train MAKER in the next round. This was done in order to provide a reduction of false positive protein predictions. The third MAKER round was done by using a BUSCO trained Augusts reference model based on single copy ortholgous genes from Hymenoptera clades in the P. califonicus assembly. The GHMM from the Augustus training were used to train MAKER in the fourth and last round.
Description: Ending of the protein ID (with RB, RC, RD etc.) refers to the isoforms of the protein coding gene. MRNA in the gff file refers to the gene and its region in the genome.
Download unique transcripts fasta from (15,518 sequences): P.cal transcript sequences.fasta
Download proteins according to unique transcripts (15,518 sequences): P.cal protein sequences.fasta
Download structural annotation of unique transcripts gff: P.cal final predictions.gff