Institute of Bioinformatics Münster
GeMoMa preditction
GeMoMa annotation based on predictions from homolog species

Summary

Results of the homolgy-based genome annotation for P. californicus.
Target genome Reference genome Number of predicted genes
P. californicus A. melliferra 711
P. californicus C. floridanus 2,207
P. californicus S. invicta 3,151
P. californicus P. barbatus 8,944
P. californicus Summary of unique predictions from A. melliferra, C. floridanus, S. invicta, and P. barbatus 15,013

Description

The GeneModelMapper (GeMoMa) iis an annotation pipeline used to find protein coding genes while using the genome of an already annotated relative species as a reference, genome. First it extracts protein-coding exons from well-annotated reference genomes (see Figure 1). After this, individual exons are matched to locations on the target genome by tblastn. GeMoMa tries to match the resulting models from a tblastn run to the genome by using additional RNA sequences from the target species. The resulting model of this step can have a huge variance depending on the quality and completeness of the reference species. Four different species (A. melliferra, C.floridanus, S. invicta, and P. barbatus) which are relative to P. californicus (see Overview) were used as reference genomes (see Table in Results). The final step includes the merging of the different runs by using different reference genomes and filtering them with a final annotation filter. Finally, all resulting annotations are filtered for identical duplicated genes.

GeMoMa pipeline overview
Figure 1: Overview of the algorithm from the GeMoMa pipeline. Blue items represent input data sets, green boxes represent
GeMoMa modules, while grey boxes represent external modules.[Source: https://doi.org/10.1093/nar/gkw092 ]

Results

The final GeMoMa gene prediction is based on species which are relatives on different levels. The most relative species to P. californicus is P. barbatus, then S. invicta, and the least is C.floridanus. The smaller the pylogenetic distance of the species is to P. californicus, the more the predictions are based on them as a reference organism.

Download GeMoMa annotation: GeMoMa_annotation.gff

Download Protein sequences: GeMoMa_proteins.fasta

Download Transcript annotation: GeMoMa_transcript.fasta

References

  • Jens Keilwagen, Michael Wenk, Jessica L. Erickson, Martin H. Schattat, Jan Grau, Frank Hartung; Using intron position conservation for homology-based gene prediction, Nucleic Acids Research, Volume 44, Issue 9, 19 May 2016, Pages e89 https://doi.org/10.1093/nar/gkw092
  • Jens Keilwagen, Frank Hartung, Michael Paulini ,SvenO.Twardziok and Jan Grau Combining RNA-seq data andhomology-based gene prediction for plants,animals and fungi, BMC Bioinformatics, 2018, https://doi.org/10.1186/s12859-018-2203-5
2020-11-18 21:33