Institute of Bioinformatics Münster
Genome comparison
Comparison of whole genomes

It is recommended to use any other browser besides Firefox for opening this webpage.

Summary

This section includes some whole genome comparison analyses between P. californicus and relative ants with LAST aligner (see Methods for more informations). The Similarity Table displays the whole genome alignments between two genome-assemblies.

Genome Assemblies

Query: Assembly 1 (Assembler) (Size of Assembly) Target: Assembly 2 (Assembler) (Size of Assembly) Mean of Identity [%] Median of Identity [%] Identical aligned nucleotide fraction of the query [%] Aligned nucleotides of the query [%] Number of gaps in the query [x 106] Identical aligned nucleotide fraction of the target [%] Aligned nucleotides of the target [%] Number of gaps in the target [x 106] Total alignment length [Mb]

Transcript Assemblies

Repeat Annotation

Species Occupied by repeats Interspersed Repeats Low complexity Simple repeats Availability
DNA LINE LTR SINE Unknown
P. californicus 20.33 % 7.70 % 1.19 % 3.45 % 0.03 % 2.96 % 0.53 % 3.95 % Own Pipeline
(see Repeat Annotation)
P. barbatus 15.38 % 2.5 % 0.26 % 1.39 % 0.04 % 3.75 % 1.97 % 1.71 % Paper Supplement (Chapter 12, Table S8)
S. invicta 11.04 % [DNA] % [LINE] % [LTR] % [SINE] % [Unknown] % [Low complexity] % [Simple repeats] % [Source]
C. floridanus 32.6 % [DNA] % [LINE] % [LTR] % [SINE] % [Unknown] % [Low complexity] % [Simple repeats] % [Source]
A. melliferra 7.78 % 1.38 % 0.19 % 0.15 % 0.01 % 0.11 % 5.7 % Paper Supplement (Table S6)

Genome Annotation

Species protein-coding genes non-coding genes Availability
mean of exons per gene min. gene length [bp] mean gene length [bp] median gene length [bp] max. gene length [bp] CDS duplicated CDS genes with variants tRNA rRNA lncRNA snoRNA snRNA total
P. californicus 5.86 48 2,115 1,524 50,390 15,688 899 1,700 1,180 55 931 lncRNA sharing with P. barbatus 12 24 1,395 See Annotation section
P. barbatus 7.85 73 2,453 1,931 50,720 19,128 3,748 3,619 201 17 1,138 12 19 1,213 NCBI Annotation report
S. invicta 7.22 82 2,384 1,815 47,190 25,235 3,551 4,703 227 36 1,376 11 26 1,517 NCBI Annotation report
C. floridanus 8.12 68 2,936 2,130 58,720 23,971 5,665 5,000 208 30 1,243 13 22 1,158 NCBI Annotation report
A. mellifera 9.13 62 2,998 2,245 60,580 23,471 5,235 5,385 218 57 3,146 14 26 2,397 NCBI Annotation report

Methods

The Comparison of two genomes was performed with LAST. In order to calculate the weighted Mean and Median of Identity and the Aligned nucleotide fraction, the following steps were taken:

Building Last-Database:
lastdb -P5 -uNEAR -R01 LASTDB-name Target-Assembly
Training LAST for optimal Matrix-parameter:
last-train -P0 --revsym --matsym --gapsym -E0.05 -C2 LASTDB-name Query-Assembly > last-train.mat
Aligning and filtering for unique best alignments of the query:
lastal -m50 -E0.05 -P0 -C2 -p last-train.mat LASTDB-NEAR $1 | last-split -m1 > last-split.maf
Get 1-to-1 alignments:
maf-swap last-split.maf |last-split -m1 | maf-swap > maf-swap.maf

Calculate Aligned nucleotide fraction of query and target:
Sum of Alignmentlength - (Sum of number of gaps + Sum of number of Mismatches)

Calculate Aligned nucleotide fraction of the query (%):
get sum of Alignmentlength (Prefix is specific for the reference genome):
grep -v '^#\|^a\|^p' maf-swap.maf | grep Prefix (eg: "NC_*") | awk '{sum+=$4} END {print sum}'
get sum of number of gaps:
grep -v '^#\|^a\|^p' maf-swap.maf | grep Prefix | grep -o "-" | wc -m
Generate blast tab output from LAST output (blast outfmt 6):
last-postmask maf-swap.maf | maf-convert blasttab | awk -F'=' '$2 <= 1e-5' > last.blasttab
Sum of number of Mismatches:
less last.blasttab | awk '{sum+=$5} END {print sum}'

Calculate Aligned nucleotide fraction of the target (%):
get sum of Alignmentlength:
grep -v '^#\|^a\|^p' maf-swap.maf | grep -v Prefix | awk '{sum+=$4} END {print sum}'
get sum of number of gaps:
grep -v '^#\|^a\|^p' maf-swap.maf | grep -v Prefix | grep -o "-" | wc -m

A whole genome comparison was performed by generating dot plots with GEnome Pair Rapid Dotter (gepard). Dot matrix analysis is a method used to compare two proteins or nucleic acid sequences. Every dot in the dot plot represents an identical element of the compared sequences.

Two parameters were set:
Word length, this is the minimum sequence length for identical subsequences used to create a hit in the dot plot.
Window size, this parameter specifies the window size over which an average dot value will be calculated.

References

  • Jan Krumsiek, Roland Arnold, Thomas Rattei; Gepard: a rapid and sensitive tool for creating dotplots on genome scale, Bioinformatics, Volume 23, Issue 8, 15 April 2007, Pages 1026–1028, https://doi.org/10.1093/bioinformatics/btm039
2020-11-18 22:14