Vol. 17 no. 1 2001
Pages 13–15
BIOINFORMATICS
Pro-Frame: similarity-based gene recognition in
eukaryotic DNA sequences with errors
Andrey A. Mironov, Pavel S. Novichkov and Mikhail S. Gelfand
State Scientific Center for Biotechnology NIIGenetika, Moscow, 113545, Russia
Received on September 22, 1999; revised on June 5, 2000; accepted on July 27, 2000
ABSTRACT
numerous protein–DNA alignment algorithms accounting
Summary: Performance of existing algorithms for
similarity-based gene recognition in eukaryotes drops
when the genomic DNA has been sequenced with er-
rors. A modification of the spliced alignment algorithm
allows for gene recognition in sequences with errors, in
particular frameshifts. It tolerates up to 5% of sequencing
errors without considerable drop of prediction reliability
when a sufficiently close homologous protein is available
for frameshifts (Posfai and Roberts, 1992; Birney et al.,
1996; Guan and Uberbacher, 1996; Zhang et al., 1997;
Pearson et al., 1997). However, none of them handles
introns.
We have implemented a modified version of the spliced
alignment algorithm performing gene recognition in the
presence of frameshift errors. The algorithm treats introns
as non-penalized gaps that may start only at dinucleotide
GT and end at dinucleotide AG. Frameshifts and in-frame
stop codons in the genomic sequence are allowed, but
heavily penalized. There is an option for acceleration
of the dynamic programming, using the k-tuple align-
ment technique due to M.Roytberg (Nazipova et al.,
(
normalized evolutionary distance similarity score 50% or
higher).
Availability: The program is free for academic users and
available upon request at http://www.anchorgen.com
1
995). Since sequencing errors can destroy the invariant
Analysis of sequence similarity is a powerful tool for
gene recognition. It is employed in a number of database
search programs, most notably BLASTX (Gish and States,
dinucleotides at splicing sites, the program has a post-
processing step. At this step the program identifies runs
of deletions at exon termini, and moves the exon–intron
boundary even if there are no suitable dinucleotides. More
exactly, observing more than 50% deleted positions in the
region (−30, +30) around the exon junction, the program
searches for the optimal position of the donor and acceptor
splicing sites allowing for a single deviation from the
invariant dinucleotide at each site. The program outputs
the exon positions before and after the correction and the
alignment of the predicted exons and the target protein.
Results of testing the algorithm on a sample of human
genes and related proteins from (Mironov et al., 1998) are
given in Figure 1. This sample consists of 256 genes. The
average length of genomic sequences is approximately
8100 nucleotides, with the longest sequence exceeding
180 000 nucleotides. The number of exons ranges from 1
through 54, the average number of exons per gene is 5.5.
The average length of exons in multi-exon genes is 140
nucleotides. The total number of protein targets is 731,
their average length is 575 amino acids. Five independent
rounds of mutations were performed for each sequence.
The 3655 predictions (five times 731 comparisons) were
done in about 5 h on a PC with Pentium II 400 MHz
processor under Windows NT.
1
993), and programs for exact prediction of exon–intron
structure, in particular, Procrustes (Gelfand et al., 1996;
Mironov et al., 1998), INFO (Hultner et al., 1994; Laub
and Smith, 1998), GeneWise (Birney and Durbin, 1997).
The common idea behind these algorithms is that among
numerous possible exon chains, an algorithm chooses the
chain having the highest similarity to a related protein
(
target). This is done by modified dynamic programming
treating introns as a special case of gaps (GeneWise) or by
spliced alignment (Procrustes).
Testing of the similarity-based gene recognition pro-
grams demonstrated that given sufficiently close relatives,
they produce highly reliable predictions. In particular, the
correlation between predicted and real human genes is
9
6–99% when homologous vertebrate genes are available
(
Mironov et al., 1998; Laub and Smith, 1998). However,
the quality of gene predictions when the genomic DNA
contains sequencing errors is much lower (Burset and
Guigo, 1996). One possibility to avoid this problem is to
use the DNA spliced alignment instead of aligning trans-
lated candidate exons with proteins (Sze and Pevzner,
1
997). However, it is well known that protein alignments
are much more sensitive to distant similarities than
nucleotide alignments. Thus it is indicative that there exist
The performance at different error levels is estimated
using the standard correlation coefficient measure (Burset
ꢀ
c Oxford University Press 2001
13