E.Kretschmann et al.
empirical evidence is under construction and is planned
to be implemented in later versions of this tool.
At the current status, the rules can be used to support
the manual annotation process performed by the SWISS-
PROT database curators. It is also ready to be made publi-
cally available (http://golgi.ebi.ac.uk:8080/Spearmint/).
Further developments
There are two ways for further developments. Firstly, the
rule generation process can be improved and secondly, the
rules can be applied in different applications. Currently,
the following projects are under development:
REFERENCES
Apweiler,R. (2000) Protein sequence databases. Adv. Protein
Chem., 54, 31–71.
Apweiler,R. (2001) Functional information in SWISS-PROT: the
basis for large-scale characterisation of protein sequences. Brief-
ings in Bioinformatics, 2, 9–18.
• A web-based application that allows application of the
rules not only on TrEMBL entries, but also on raw
amino acid sequences.
Apweiler,R., Attwood,T.K., Bairoch,A., Bateman,A., Birney,E.,
Biswas,M., Bucher,P., Cerutti,L., Corpet,F., Croning,M.D.R,
Durbin,R., Falquet,L., Fleischmann,W., Gouzy,J., Herm-
jakob,H., Hulo,N., Jonassen,I., Kahn,D., Kanapin,A., Kar-
avidopoulou,Y., Lopez,R., Marx,B., Mulder,N.J., Oinn,T.M.,
Pagni,M., Servant,F., Sigrist,C.J.A. and Zdobnov,E. (2000)
InterPro—an integrated documentation resource for protein
families, domains and functional sites. Bioinformatics, 16,
1145–1150.
• Mining for Description-, Comment-, and Feature
Lines.
• Integration of other data mining techniques.
• Application of rules on entries in Ensembl (http://
Attwood,T.K.,
Croning,M.D.R.,
Flower,D.R.,
Lewis,A.P.,
Mabey,J.E., Scordis,P., Selley,J.N. and Wright,W. (2000)
PRINTS-S: the database formery known as PRINTS. Nucleic
Acids Res., 28, 225–227.
• Automated application of the rules on proteins in
TrEMBL without human interaction.
Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein
sequence database and its supplement TrEMBL in 2000. Nucleic
Acids Res., 28, 45–48.
Bateman,A., Birney,E., Durbin,R., Eddy,S.R., Howe,K.L. and
Sonnhammer,E.L.L. (2000) The Pfam protein falilies database.
Nucleic Acids Res., 28, 263–266.
Corpet,F., Servant,F., Gouzy,J. and Kahn,D. (2000) ProDom and
ProDom-CG: tools for protein domain analysis and whole
genome comparisons. Nucleic Acids Res., 28, 267–269.
Fleischmann,W., Mo¨ller,S., Gateau,A. and Apweiler,R. (1999) A
novel method for automatic functional annotation of proreins.
Bioinformatics, 15, 228–233.
The Gene Ontology Consortiuum (2000) Gene Ontology: tool for
the unification of biology. Nature Genet., 25, 25–29.
Hofmann,K., Bucher,P., Falquet,L. and Bairoch,A. (1999) The
PROSITE database, its status in 1999. Nucleic Acids Res., 27,
215–219.
• Automated rule generation on GO terms (The
Gene Ontology Consortiuum, 2000) rather than on
Keywords.
One of the most imperative tasks is to achieve an improved
confidence calculation. The current routine does not use
information about the number of TNs or FNs in the
calculation and hence prefers the generation of frequent
keywords rather than rare ones, i.e. general Keywords are
produced more often than specific ones, but the latter are
the more valuable ones. Extracting reliable rules for rare
Keywords will certainly be a very useful improvement of
the tool.
For all calculations independence between training and
target set were assumed, which is clearly not the case. An
empirical test to collect data about the influence of the bias
is helpful to get a better picture about the performance of
the rules on unknown data.
Hyafil,L. and Rivest,R.L. (1976) Constructing optimal binary deci-
sion trees is NP-complete. Inf. Process. Lett. 5, 1, 15–17.
Junker,V., Apweiler,R. and Bairoch,A. (1999) Representation of
functional information in the SWISS-PROT data bank. Bioinfor-
matics, 15, 1066–1067.
CONCLUSION
The presented method mines for Keyword annotation in
SWISS-PROT using a Java implementation of the C4.5
algorithm on protein groups assembled in InterPro entries.
The results are satisfactory in terms of coverage and
confidence, yet it was pointed out that both aspects can
be further improved. Including other methods to group
proteins into sets containing similar proteins like CluSTr,
Prodom (Corpet et al., 2000) or others will help to
increase the coverage while a refined statistical analysis
will improve ordering of the generated rules in terms of
reliability. This is supposed to lead to higher values of
confidence.
Krause,A., Nicode`me,E., Bornberg-Bauer,M., Rehmsmeier,M. and
Vingron,M. (1999) WWW access to the SYSTERS protein
sequence cluster set. Bioinformatics, 15, 262–263.
Kriventseva,E.V., Fleischmann,W., Zdobnov,E.M. and Apweiler,R.
(2001) CluSTr: a database of clusters of SWISS-PROT +
TrEMBL proteins. Nucleic Acids Res., 29, 33–36.
Mo¨ller,S., Leser,U., Fleischmann,W. and Apweiler,R. (1999)
EDITtoTrEMBL: a distributed approach to high-quality auto-
mated protein sequence annotation. Bioinformatics, 15, 219–227.
Quinlan,,J.R. (1986) Induction of decision trees. Mach. Learn., 1,
81–106.
Quinlan,J.R. (1993) C4.5: Programs for Machine Learning. Morgan
Kaufmann, San Francisco, CA.
926