Institute of Bioinformatics Münster

Overview Sequencing Summary Assembly Summary Repeat masking GeMoMa prediction MAKER prediction Non-coding RNA genes Functional Protein Classification Protein domain annotation Comparative Genomics DOGMA analysis BUSCO analysis

Genome comparison

Comparison of whole genomes

It is recommended to use any other browser besides Firefox for opening this webpage.

Summary

This section includes some whole genome comparison analyses between P. californicus and relative ants with LAST aligner (see Methods for more informations). The Similarity Table displays the whole genome alignments between two genome-assemblies.

Genome Assemblies

Query: Assembly 1 (Assembler) (Size of Assembly)	Target: Assembly 2 (Assembler) (Size of Assembly)	Mean of Identity [%]	Median of Identity [%]	Identical aligned nucleotide fraction of the query [%]	Aligned nucleotides of the query [%]	Number of gaps in the query [x 10⁶]	Identical aligned nucleotide fraction of the target [%]	Aligned nucleotides of the target [%]	Number of gaps in the target [x 10⁶]	Total alignment length [Mb]

Transcript Assemblies

Repeat Annotation

Species	Occupied by repeats	Interspersed Repeats					Low complexity	Simple repeats	Availability
Species	Occupied by repeats	DNA	LINE	LTR	SINE	Unknown	Low complexity	Simple repeats	Availability
P. californicus	20.33 %	7.70 %	1.19 %	3.45 %	0.03 %	2.96 %	0.53 %	3.95 %	Own Pipeline (see Repeat Annotation)
P. barbatus	15.38 %	2.5 %	0.26 %	1.39 %	0.04 %	3.75 %	1.97 %	1.71 %	Paper Supplement (Chapter 12, Table S8)
S. invicta	11.04 %	[DNA] %	[LINE] %	[LTR] %	[SINE] %	[Unknown] %	[Low complexity] %	[Simple repeats] %	[Source]
C. floridanus	32.6 %	[DNA] %	[LINE] %	[LTR] %	[SINE] %	[Unknown] %	[Low complexity] %	[Simple repeats] %	[Source]
A. melliferra	7.78 %	1.38 %	0.19 %	0.15 %	0.01 %	0.11 %	5.7 %		Paper Supplement (Table S6)

Genome Annotation

Species	protein-coding genes								non-coding genes						Availability
Species	mean of exons per gene	min. gene length [bp]	mean gene length [bp]	median gene length [bp]	max. gene length [bp]	CDS	duplicated CDS	genes with variants	tRNA	rRNA	lncRNA	snoRNA	snRNA	total	Availability
P. californicus	5.86	48	2,115	1,524	50,390	15,688	899	1,700	1,180	55	931 lncRNA sharing with P. barbatus	12	24	1,395	See Annotation section
P. barbatus	7.85	73	2,453	1,931	50,720	19,128	3,748	3,619	201	17	1,138	12	19	1,213	NCBI Annotation report
S. invicta	7.22	82	2,384	1,815	47,190	25,235	3,551	4,703	227	36	1,376	11	26	1,517	NCBI Annotation report
C. floridanus	8.12	68	2,936	2,130	58,720	23,971	5,665	5,000	208	30	1,243	13	22	1,158	NCBI Annotation report
A. mellifera	9.13	62	2,998	2,245	60,580	23,471	5,235	5,385	218	57	3,146	14	26	2,397	NCBI Annotation report

Methods

The Comparison of two genomes was performed with LAST. In order to calculate the weighted Mean and Median of Identity and the Aligned nucleotide fraction, the following steps were taken:

Building Last-Database:

lastdb -P5 -uNEAR -R01 LASTDB-name Target-Assembly

Training LAST for optimal Matrix-parameter:

last-train -P0 --revsym --matsym --gapsym -E0.05 -C2 LASTDB-name Query-Assembly > last-train.mat

Aligning and filtering for unique best alignments of the query:

lastal -m50 -E0.05 -P0 -C2 -p last-train.mat LASTDB-NEAR $1 | last-split -m1 > last-split.maf

Get 1-to-1 alignments:

maf-swap last-split.maf |last-split -m1 | maf-swap > maf-swap.maf

Calculate Aligned nucleotide fraction of query and target:
Sum of Alignmentlength - (Sum of number of gaps + Sum of number of Mismatches)

Calculate Aligned nucleotide fraction of the query (%):
get sum of Alignmentlength (Prefix is specific for the reference genome):

grep -v '^#\|^a\|^p' maf-swap.maf | grep Prefix (eg: "NC_*") | awk '{sum+=$4} END {print sum}'

get sum of number of gaps:

grep -v '^#\|^a\|^p' maf-swap.maf | grep Prefix | grep -o "-" | wc -m

Generate blast tab output from LAST output (blast outfmt 6):

last-postmask maf-swap.maf | maf-convert blasttab | awk -F'=' '$2 <= 1e-5' > last.blasttab

Sum of number of Mismatches:

less last.blasttab | awk '{sum+=$5} END {print sum}'

Calculate Aligned nucleotide fraction of the target (%):
get sum of Alignmentlength:

grep -v '^#\|^a\|^p' maf-swap.maf | grep -v Prefix | awk '{sum+=$4} END {print sum}'

get sum of number of gaps:

grep -v '^#\|^a\|^p' maf-swap.maf | grep -v Prefix | grep -o "-" | wc -m

A whole genome comparison was performed by generating dot plots with GEnome Pair Rapid Dotter (gepard). Dot matrix analysis is a method used to compare two proteins or nucleic acid sequences. Every dot in the dot plot represents an identical element of the compared sequences.

Two parameters were set:
Word length, this is the minimum sequence length for identical subsequences used to create a hit in the dot plot.
Window size, this parameter specifies the window size over which an average dot value will be calculated.

References

Jan Krumsiek, Roland Arnold, Thomas Rattei; Gepard: a rapid and sensitive tool for creating dotplots on genome scale, Bioinformatics, Volume 23, Issue 8, 15 April 2007, Pages 1026–1028, https://doi.org/10.1093/bioinformatics/btm039

2020-11-18 22:14

LEGAL DISCLOSURE

DATA PROTECTION POLICY