BioMolQuest
database cross-references to retrieve all instances of the
same protein or a class of proteins in all of the legacy
databases. The use of cross-references creates the effect
of amplifying the annotation content: it is sufficient to
find the search terms in one of the database instances
of a protein or a protein class to retrieve all or most
of the other instances. This allows the search engine to
find database entries where annotation is incomplete or
alternative terms are used. While performance of SRS is
close to that of BioMolQuest, the user must possess a
certain degree of sophistication to use it the way we did.
Casual users would probably just use an SRS server to
search the PDB directly, which would not give them any
advantage over using the RCSB web site. Simplicity of the
user interface is an important factor in web design. While
scientists are generally more sophisticated than average
web users, most of them would rather spend their energies
on their research topics than on the subtleties of database
querying techniques.
improves recall of keyword queries without any additional
effort on the part of the user. At the same time, application
of the queries to individual database fields, rather than
entire entries, ensures good precision. The output of the
queries is presented in a logical order that clarifies the
relationships among the retrieved database entries. A
complex query is possible through repeat searches that
limit the results of the previous searches. This search
engine can be used as a convenient gateway to the legacy
databases, particularly the PDB.
BioMolQuest should be considered a work in progress.
Its current weaknesses include sluggish response times
for large queries and the lack of quality control of
cross-references. The first problem will be addressed by
better software design, possibly including parallelization
of the queries. The cross-references shall be verified
by sequence identity. Other directions of BioMolQuest
development include integrating additional resources into
the service, such as metabolic pathway information and
protein structures predicted by our group.
Table 2 also shows the time it takes BioMolQuest
and other servers to deliver results of a query. The
times reported for the SRS servers are the sums of
three operations, as discussed above. One should not
overinterpret the exact times we report, as they are almost
certainly affected by factors that have nothing to do with
the search engine design, such as network delays and
differences in the server hardware and loads. We report
the minimal observed times along with the averages, as
they reflect the server performance under light load and/or
minimal delays in the network. It is obvious that the
BioMolQuest response times are comparable to those of
the other search engines for queries of moderate size.
BioMolQuest does slow down considerably for larger
queries. For example, if one searches for ‘transcription
and factor’, the search engine has to go through about
5000 SWISS-PROT entries, finding 356 PDB entries in
the end. The average BioMolQuest response time to this
query is 59 s. RCSB seems to slow down for large queries
as well (42 s average for ‘transcription and factor’), while
SRS servers are not affected.
ACKNOWLEDGEMENTS
This research was supported in part by NIH grant No. GM-
48835 of the Division of General Medical Sciences.
REFERENCES
An,J., Nakama,T., Kubota,Y. and Sarai,A. (1998) 3DinSight: an
integrated relational database and search tool for structure,
function and property of biomolecules. Bioinformatics, 14, 188–
195.
Anahory,S. and Murray,D. (1997) Data Warehousing in the Real
World. Addison-Wesley, Longman, UK.
Apweiler,R., Attwood,T.K., Bairoch,A., Bateman,A., Birney,E.,
Bucher,P., Codani,J.-J., Corpet,F., Croning,M.D.R., Durbin,R.,
Etzold,T., Fleischmann,W., Gouzy,J., Hermjakob,H., Jonassen,I.,
Kahn,D., Kanapin,A., Schneider,R., Servant,F. and Zdobnov,E.
(2000) InterPro—an integrated documentation resource for pro-
tein families, domains and functional sites. CCP11 Newsletter,
10.
Bairoch,A. (2000) The ENZYME database in 2000. Nucleic Acids
Res., 28, 304–305.
Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein
sequence database and its supplement TrEMBL in 2000. Nucleic
Acids Res., 28, 45–48.
Baker,W., van den Broek,A., Camon,E., Hingamp,P., Sterk,P.,
Stoesser,G. and Tuli,M.A. (2000) The EMBL nucleotide se-
quence database. Nucleic Acids Res., 28, 19–23.
Barker,W.C., Garavelli,J.S., Huang,H., McGarvey,P.B., Orcutt,B.,
Srinivasarao,G.Y., Xiao,C., Yeh,L.S., Ledley,R.S., Janda,J.F.,
Pfeiffer,F., Mewes,H.W., Tsugita,A. and Wu,C. (2000) The
Protein Information Resource (PIR). Nucleic Acids Res., 28, 41–
44.
DISCUSSION
We have implemented a search engine based on a re-
lational database of annotations imported from PDB,
SWISS-PROT, ENZYME, and CATH (Berman et al.,
2000; Bairoch and Apweiler, 2000; Bairoch, 2000;
Orengo et al., 1997). This search engine allows for more
powerful annotation searches than is possible with any
of the individual legacy databases. Information from
the legacy databases is integrated using inter-database
cross-references provided in the legacy database entries or
inferred when necessary and/or possible. Automatic use of
these cross-references by the search engine significantly
Benson,D.A.,
Karsch-Mizrachi,I.,
Lipman,D.J.,
Ostell,J.,
Rapp,B.A. and Wheeler,D.L. (2000) GenBank. Nucleic
Acids Res., 28, 15–18.
477