BIOINFO 1 - Problem set 2

1. Based on the codon usage frequencies check if the given sequence could be a protein coding sequence. Do calculation in three reading frames : +1, +2, +3. Refer to slides 56-59 from the lecture on gene predictions to overview calculation method.

CACATCGAGTTGCACACAGATGGA

Remember, for noncoding sequence we use the frequency 15.625 (1000/64=15.625)

UUU 17.10 UCU 14.70 UAU 12.10 UGU 10.10

UUC 20.40 UCC 17.50 UAC 15.50 UGC 12.40

UUA 7.30 UCA 11.90 UAA 0.80 UGA 1.40

UUG 12.70 UCG 4.50 UAG 0.60 UGG 13.00

CUU 12.90 CCU 17.30 CAU 10.60 CGU 4.70

CUC 19.50 CCC 20.00 CAC 15.00 CGC 10.80

CUA 7.00 CCA 16.70 CAA 11.90 CGA 6.30

CUG 40.10 CCG 7.00 CAG 34.40 CGG 11.80

AUU 15.80 ACU 12.90 AAU 16.70 AGU 12.00

AUC 21.30 ACC 19.10 AAC 19.30 AGC 19.40

AUA 7.20 ACA 14.90 AAA 24.00 AGA 11.70

AUG 22.30 ACG 6.20 AAG 32.50 AGG 11.60

GUU 10.90 GCU 18.60 GAU 22.10 GGU 10.80

GUC 14.60 GCC 28.40 GAC 25.70 GGC 22.60

GUA 7.00 GCA 16.00 GAA 29.00 GGA 16.40

GUG 28.70 GCG 7.60 GAG 40.30 GGG 16.40

2. Download DNA sequence from here. Assume that sequence is of human origin.

a) Run Genscan - save all the predicted proteins and CDS sequences.

b) Run RepeatMasker on each predicted gene nucleotide sequence.

c) Run BLAST for each predicted protein.

d) Summarize results in the following table:

* In the comment you may want to put following information: putative splice variant, new human gene similar to other species gene or other human genes from the same family; mispredicted structure, putative alternative splicing, information in regard to differences between predicted gene and gene in the GenBank (if there is any); putative reasons for differences (wrongly predicted gene structure, misprediction due to repetitive element, maybe a new splice variant). please look for any interesting information and try to investigate it. Use your knowledge from previous lectures, maybe you can find evidence from EST sequences for your conclusions? Maybe you can use genomic-mRNA alignments to get a better picture. The goal is to find as much information as possible about predicted genes. If you do additional analysis, please describe it and provide evidence if the analysis led you to some conclusions.

3. Sequence alignment given below represents a new regulatory element, which is not annotated in any database.

CTCGTTGAG

CTGGTTCAG

CTGGCCGCA

GTGGTCGGG

CTGGCTACA

CTGGTCACG

CTGGGTGGG

GTGGTCGGT

CTGGGCGTT

CTGGTCGTG

CTCGGTAAA

CTGGGCTCA

CTGGTCGAG

CTCGGCGCA

CTGGTCCTA

a) Create a frequency matrix in "TRANSFAC-style" according to the following example.

NA AML-1a

DE runt-factor AML-1

BF T02256; AML1a; Species: human, Homo sapiens.

P0 A C G T

01 5 1 2 49 T

02 2 2 52 1 G

03 4 14 1 38 T

04 0 0 57 0 G

05 1 0 55 1 G

06 1 4 0 52 T

b) Use Cister to check if the sequence used in exercise 2 contains this element. Change motif probability threshold to 0.8.

c) Now run cister again and check your sequence for any of the motifs listed on the input page. Does your element overlaps with any of these? Summarize your results in a table.

Position	Strand	Probability	Overlap with a known motif

Number of exons

Repetitive elements

Putative gene name based on BLASTP search

Perfect match (yes/no)