E.M.Zdobnov and R.Apweiler
in other stand-alone ad-hoc solutions. The Perl-based
InterProScan is capable of providing post-processed,
integrated results in several formats and it could be
used as a simple retrieval system for the underlying
data.
The tool has become popular in the bioinformatics
community. The EBI public web interface serves more
than 10 000 interactive requests a month. There are
more than 60 installations worldwide of the Perl-based
InterProScan package that has already been used to
analyse the complete genomes on a production scale.
INTERPROSCAN
InterProScan is a tool that combines different protein sig-
nature recognition methods into one resource. The number
of signature databases and their associated scanning tools
as well as the further refinement procedures increase the
complexity of the problem. InterProScan is more than a
simple wrapping of sequence analysis applications since
it requires performing a considerable data look-up from
some databases and program outputs. The need for pro-
duction scale efficiency and an easy extensibility require
a robust and efficient (parallel) internal architecture that
can benefit from network distributed computing with the
support of UNIX queueing systems.
ACKNOWLEDGEMENTS
We developed an SRS-based InterProScan suite as well
as the stand-alone Perl-based InterProScan package.
Nowadays SRS (Etzold et al., 1996) has become an inte-
gration system for both data retrieval and applications for
data analysis that is ideally suited to resolve the data flow
complexity in InterProScan. Firstly, InterProScan was im-
plemented using the introduced technique of joining some
of the SRS integrated applications into one virtual appli-
cation that can organize the execution of the underlying
steps in an efficient (parallel) manner. Later, we devel-
oped a client web interface using the SRS Perl API that
is a compromise between the SRS inter-database linking
integrity and the simplicity of the user interface, providing
‘one-click-away’ results.
While the SRS-based InterProScan has several benefits
from the close integration with other databases it requires
some SRS expertise and is bound to the licensed SRS
distribution. To overcome these limitations we decided
to develop a stand-alone InterProScan version based
on the popular scripting language Perl. The Perl-based
InterProScan was intended as an extensible and scalable
system optimised to cope with bulk data processing. In
the package a Perl-based simple data retrieval system
was introduced to provide the required data look-up
efficiency and easy extensibility. The system has a mod-
ular structure and is designed in an SRS-like fashion.
Each of the data description modules defines the data
schema of the source text data and the parsing rules. The
corresponding Perl module provides an object-oriented
interface to the underlying entry attributes. The parsing
of the source data into the memory objects happens only
once and is done upon request, implementing so-called
lazy-parsing. Hierarchical parsing rules are implemented
using the recursive-descent approach (Parse-RecDescent
package). Fast data retrieval is implemented using the Perl
native B-trees indexing (DB File.pm, based on Berkeley
DB). The simple ‘one Perl module per data source’
organisation makes it possible to reuse the modules
We would like to thank Rodrigo Lopez for general support
and the mailserver backend as well as Thure Etzold and
Henning Hermjakob for useful discussions and ideas.
REFERENCES
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z.,
Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-
BLAST: a new generation of protein database search programs.
Nucleic Acids Res., 25, 3389–3402.
Attwood,T.K., Croning,M.D., Flower,D.R., Lewis,A.P., Mabey,J.E.,
Scordis,P., Selley,J.N. and Wright,W. (2000) PRINTS-S: the
database formerly known as PRINTS. Nucleic Acids Res., 28,
225–227.
Bateman,A., Birney,E., Durbin,R., Eddy,S.R., Howe,K.L. and
Sonnhammer,E.L. (2000) The Pfam protein families database.
Nucleic Acids Res., 28, 263–266.
Corpet,F., Gouzy,J. and Kahn,D. (1999) Recent improvements of
the ProDom database of protein domain families. Nucleic Acids
Res., 27, 263–267.
Etzold,T., Ulyanov,A. and Argos,P. (1996) SRS: information re-
trieval system for molecular biology data banks. Meth. Enzymol.,
266, 114–128.
Hofmann,K., Bucher,P., Falquet,L. and Bairoch,A. (1999) The
PROSITE database, its status in 1999. Nucleic Acids Res., 27,
215–219.
Schultz,J., Copley,R.R., Doerks,T., Ponting,C.P. and Bork,P. (2000)
SMART: a web-based tool for the study of genetically mobile
domains. Nucleic Acids Res., 28, 231–234.
Scordis,P., Flower,D.R. and Attwood,T.K. (1999) Finger-
PRINTScan: intelligent searching of the PRINTS motif database.
Bioinformatics, 15, 799–806.
The InterPro Consortium (Apweiler,R., Attwood,T.K., Bairoch,A.,
Bateman,A., Birney,E., Biswas,M., Bucher,P., Cerutti,L., Cor-
pet,F., Croning,M.D., Durbin,R., Falquet,L., Fleischmann,W.,
Gouzy,J., Hermjakob,H., Hulo,N., Jonassen,I., Kahn,D.,
Kanapin,A., Karavidopoulou,Y., Lopez,R., Marx,B., Mul-
der,N.J., Oinn,T.M., Pagni,M., Servant,F., Sigrist,C.J. and
Zdobnov,E.M.) (2001) The InterPro database, an integrated doc-
umentation resource for protein families, domains and functional
sites. Nucleic Acids Res., 29, 37–40.
848