Scholarly article on 6-methyl-2-(2,4,5-trimethyl-phenyl)-4,5-dihydro-2H-pyridazin-3-one 859956-09-5 from Gazzetta Chimica Italiana p. 300

DOI: 10.1093/bioinformatics/17.10.920

Source and publish data:

Gazzetta Chimica Italiana p. 300 (1915)

Update date:2022-08-03

Topics:: Authors:

Mungioli

Read Full Text PDF DownLoad Join now for total 90,000,000 free articles

Article abstract of DOI:10.1093/bioinformatics/17.10.920

Full text of DOI:10.1093/bioinformatics/17.10.920

Vol. 17 no. 10 2001

Pages 920–926

BIOINFORMATICS

Automatic rule generation for protein annotation

with the C4.5 data mining algorithm applied on

SWISS-PROT

Ernst Kretschmann, Wolfgang Fleischmann and Rolf Apweiler

The EMBL Outstation, The European Bioinformatics Institute, Wellcome Trust

Genome Campus, Hinxton, Cambridge CB10 1SD, UK

Received on April 20, 2001; revised and accepted on July 8, 2001

ABSTRACT

found), the literature (in which it was mentioned), etc.

Motivation: The gap between the amount of newly

submitted protein data and reliable functional annotation in

public databases is growing. Traditional manual annotation

by literature curation and sequence analysis tools without

the use of automated annotation systems is not able

to keep up with the ever increasing quantity of data

that is submitted. Automated supplements to manually

curated databases such as TrEMBL or GenPept cover raw

data but provide only limited annotation. To improve this

situation automatic tools are needed that support manual

annotation, automatically increase the amount of reliable

information and help to detect inconsistencies in manually

generated annotations.

• Information or annotation: the statement of an aspect

that is relevant or important to describe the protein

as a whole or parts of it, e.g. ‘It is expressed in the

mitochondrion’, ‘Amino acids 1–26 encode a Signal’,

etc.

• Knowledge: the process that draws conclusions about

an unknown protein using gathered information,

e.g. ‘The protein sequence contains pattern x. Since

all known sequences having this pattern belong

to transmembrane proteins, this should also be a

transmembrane protein.’

• Data mining: any technique that uses information to

gain knowledge on data.

Results: A standard data mining algorithm was suc-

cessfully applied to gain knowledge about the Keyword

annotation in SWISS-PROT. 11 306 rules were generated,

which are provided in a database and can be applied to

yet unannotated protein sequences and viewed using a

web browser. They rely on the taxonomy of the organism,

in which the protein was found and on signature matches

of its sequence. The statistical evaluation of the generated

rules by cross-validation suggests that by applying them

on arbitrary proteins 33% of their keyword annotation can

be generated with an error rate of 1.5%. The coverage

rate of the keyword annotation can be increased to 60%

by tolerating a higher error rate of 5%.

INTRODUCTION

How to obtain information about a protein? If the protein

was biochemically characterized before and this informa-

tion was entered into a database like SWISS-PROT, which

is a completely human expert controlled and maintained

database (Bairoch and Apweiler, 2000), one can simply

make use of the provided information, i.e. a human be-

ing has used his knowledge to compose annotations on

this very protein data and established a one to one rela-

tionship between this data and its annotation which can be

used by others. However, often the information is incom-

plete, which is a fact for the majority of the known pro-

teins. Many of those poorly annotated proteins are stored

in databases like TrEMBL (Bairoch and Apweiler, 2000),

which is only partly annotated by human experts but also

by automated annotation systems like EDIT to TrEMBL

(Mo¨ller et al., 1999) and RuleBase (Fleischmann et al.,

1999; Apweiler, 2001). The protein can even be hypothet-

ical, so there is no information available at all.

Availability: The results of the automatic data mining

process can be browsed on http://golgi.ebi.ac.uk:8080/

Spearmint/ Source code is available upon request.

Contact: kretsch@ebi.ac.uk

TERMINOLOGY

This paper is about data, information, and knowledge on

protein sequences. As far as we know there is no standard

deﬁnition to distinguish between these concepts. In the

following, we are going to use the deﬁnitions given below:

In these cases one mostly resorts to sequence similarity

or signature searches, hoping to ﬁnd well annotated

protein features sharing some similarity with the protein

in question. Apart from similarity searches against com-

prehensive, non-redundant protein sequence databases

• Data: the measurable or observable facts, e.g. the

sequence, the organism (in which the protein was

920

ꢀ Oxford University Press 2001

Automatic rule generation for protein annotation

like SWISS-PROT and TrEMBL (Apweiler, 2000), the

use of protein sequence signature databases such as

Prosite (Hofmann et al., 1999), PRINTS (Attwood et

al., 2000) or Pfam (Bateman et al., 2000) can be helpful

as are protein cluster databases like SYSTERS (Krause

et al., 1999) and CluSTr (Kriventseva et al., 2001). In

those cases, there is a many-to-many relationship between

annotation and data, i.e. one annotation is stored for many

proteins and one protein sequence might match various

signatures and their annotation. Obviously, the process of

gathering, analyzing, evaluating, and deriving information

is time-consuming and cumbersome. It can be regarded as

manual data mining across various databases.

We have developed a method to automate this process

for a subset of the information available in SWISS-PROT,

the Keyword Line. Keywords are particularly useful for

analysis because they are controlled, limited in number

(at the time of this writing there were 850 different

Keywords allowed), they show little inherent structure or

dependencies and are either annotated or not. These facts

make automated knowledge acquisition much easier as for

comment lines and description lines, which often are in

unstructured free text.

Fig. 1. Example of data distribution in InterPro IPR003009 (only

part of data is shown). The ﬁrst column contains the SWISS-PROT

accession numbers of some proteins in this entry.

Mammal?

yes

do nothing

Prosite

(5 instances)

PS00487?

yes

The implementation uses the C4.5 data mining algo-

rithm to detect decision trees which are an equivalent no-

tation to rules. C4.5 shows particularly good results for

non noisy data, which is the case for SWISS-PROT. The

derived rules are not only ﬁtting the training set, but are

also human readable and kept short. This is obtained by

an elaborated heuristic approach inherent to the standard

algorithm. Also statistical evidence is given for every rule,

which can be used to order rules in terms of conﬁdence.

This property can be used to select subsets of rules for dif-

ferent applications, i.e. only the highly conﬁdent ones for

error critical purposes where coverage is less important

and all of the generated rules where coverage is the main

concern.

do nothing

(1 Instance)

annotate ’FAD’

(3 instances)

Fig. 2. Decision tree describing the data in Figure 1.

Prosite pattern PS00487, three to Pfam pattern PF01493,

ﬁve belong to mammalia and three have the Keyword

‘FAD’. The distribution is as follows.

A decision tree is generated that has a preferably small

number of leaves to make rules better readable and at the

same time more reliable. Less leaves mean that on the

average there are more examples per leaf that give the rule

better statistical conﬁrmation. In general, there are several

possible equivalent decision trees. The example decision

tree in Figure 2 covers all the instances in the training set

in Figure 1.

SYSTEM AND METHODS

Algorithm

But the decision tree in Figure 3 classiﬁes the data

more compactly. The problem of ﬁnding the optimal

decision tree is known to be NP-complete (Hyaﬁl and

Rivest, 1976). C4.5 uses the gain ratio criterion, which

is based on information theory and produces suboptimal

trees heuristically (Quinlan, 1993). Note that if there are

two instances having the same core data but a different

annotation, there is no tree that classiﬁes all examples of a

training set correctly.

The precision of a tree can be checked by analyzing

the number of correct and incorrect classiﬁcations it

produces when applied on the training set. This analysis

gives the number of True Positives (TPs) (annotation

One of the basic ideas of artiﬁcial intelligence algorithms

is to derive knowledge from training sets and apply it on

yet unknown data. The C4.5 algorithm expects input in a

tabular format where the last column contains the target,

in this particular case the information if a given keyword

is present or not. The previous columns store core data

about the proteins like taxonomy details or the presence

of sequence patterns. The algorithm tries to derive the

contents of the last column by using the information in

the other columns.

To illustrate the procedure, a simple example for the

SWISS-PROT proteins in InterPro (Apweiler et al., 2000)

IPR003009 is given. One of those proteins matches to

921

E.Kretschmann et al.

estimation has to be performed of how well they will

perform on unknown data in terms of coverage and error

rate. This aspect was tested by a tenfold cross-validation

(see Results).

Pfam

PF01493

yes

The standard algorithm was designed to classify in-

stances into groups where there is an interest in all classes.

Yet in this particular application there is no need for rules

suggesting the non-annotation of certain Keywords. The

standard statistical evaluation implemented in C4.5 was

tried as a method to order rules in terms of quality.

Parameters to trigger the calculation were adapted to the

particular problem but the results were unsatisfactory.

Therefore a procedure was chosen that derives conﬁdence

by exclusively using the number of TP and FP examples.

The formula calculates the following value of likelihood:

given the number of TP and FP examples it calculates,

which rules lie above a given threshold in 95% of all

cases. To illustrate the idea: suppose drawing from an

urn containing an inﬁnite number of balls. Drawing ten

black balls and one white ball, over which value does the

true ratio black to white balls in the urn lie in 95% of

the cases? (TP = True Positives, FP = False Positives,

c = conﬁdence)

annotate ’FAD’

(3 instances)

do nothing

(6 instances)

Fig. 3. Compact decision tree describing the data in Figure 1.

exists in instance and is predicted), True Negatives (TNs)

(annotation does not exist in instance and is not predicted),

False Positives (FPs) (annotation does not exist in instance

but is predicted) and False Negatives (FNs) (annotation

exists in instance but is not predicted).

A brute force implementation of the C4.5 algorithm

could successively produce a decision tree for every

allowed Keyword using all the protein core data available

in SWISS-PROT. This procedure would produce huge

data tables which can not be analyzed efﬁciently (every

table would consist of more than 90 000 rows, one for

each protein in SWISS-PROT). To produce decision trees

with a satisfactory conﬁdence fast enough, the number

of instances for this application should be between 100

and 1000. Hence, a subdivision of SWISS-PROT into

protein groups, which ideally contain similar proteins

has to be performed. Thus, the grouping into proteins

common to InterPro (Apweiler et al., 2000) entries will be

analyzed, since those entries usually contain a convenient

number of similar proteins. Other groupings like using

proteins common to CluSTr (Kriventseva et al., 2001)

entries were performed but not analyzed in detail. Brief

investigations of some decision trees starting from CluSTr

entries showed similar results to that starting from InterPro

entries.

z = 1.96

(constant for 95%)

n = TP + FP

p = precision =

TP + FP

ꢀ

p²

z²

p + − z ∗

−

4n²

c = conﬁdence =

1 +

Formula 1. Ordering rules in terms of conﬁdence. The

formula depends on TP and FP examples exclusively.

Conﬁdence gives the value above which all experiments

would lie when an urn experiment was perfomed with the

same distribution of correct and false outcomes.

Figure 4 gives a short overlook, for which conﬁdence

would be calculated for given numbers of TP and FP

examples. To be introduced in the database, a rule had to

have a conﬁdence of 50% or more.

Extensions

The algorithm produces a large amount of rules with

varying qualities. In many cases the annotation is not a

result of sequence signature or taxonomy and therefore a

decision tree trying to classify instances on this basis will

produce annotation at random. The application of those

trees on unknown data would lead to a massive error rate.

Therefore, a selection of the more trustworthy rules has to

be made and evaluated.

There are two steps of statistical evaluation of the

results: ﬁrstly, not every generated rule is suitable to be

applied, since many proved to have either a too high

ratio of FPs to TPs or simply too few sample cases to

derive a good statistical conﬁrmation. Therefore, a smart

criterion had to be used to select only the best rules

with a reasonably high conﬁdence. Secondly, once rules

with a reliability over a given threshold are selected, an

IMPLEMENTATION

The core application is Java based and uses the Weka

Machine Learning Software package which is open

source software and issued under the GNU General Public

License (download at http://www.cs.waikato.ac.nz/∼ml/

weka/). The system is divided into a loader module that

translates the core information stored in various databases

into the tabular input format of the algorithm and an

analyzer module that derives rules and stores the result

into a newly created database to allow quick and easy ac-

cess. Thus, the classical pipeline input–processing–output

922

Automatic rule generation for protein annotation

1.0

0.8

0.6

0.4

0.2

0.0

IPRxxx

IPRyyy

SWISS-PROT

IPRzzz

100

C4.5

Rule

KW z

KW a

KW b

100

True Positives

Confidence > Threshold?

yes no yes no no

Fig. 4. Ratio TP to FP examples and the resulting conﬁdence. The

yes

curves are for 0, 1, 3, 5, 10, 25, 50 and 100 FPs from top to bottom.

Spearmint Database

was implemented with the processing and output unit

tied together, an approach that makes extensions and

maintenance easier than that of a single monolithic

application.

Two different modes of operation were implemented:

One produces rules on the basis of all suitable SWISS-

PROT proteins and writes them to a database. The other

performs the cross-validation and evaluates rules without

storing them. The data ﬂow of both applications is shown

in Figures 5 and 6.

A graphical user interface was developed that allows

browsing of the generated information. It has some

functionality implemented that can be valuable for the

work of the professional annotators but also for a broader

range of applications:

IPRzzz

IPRxxx

TrEMBL

IPRyyy

Fig. 5. Dataﬂow for a hypothetical production run. Rules are

generated from InterPro families in SWISS-PROT, their conﬁdence

is tested against a given threshold (50%) and they are either

discarded or added into the Spearmint database. From there they

can be applied on proteins in TrEMBL. Note that rules starting from

a given InterPro family are applied on the very same family and that

the distribution of the InterPro families is different in SWISS-PROT

and TrEMBL.

• It is possible to input the accession number of a protein

in TrEMBL and get the suggested keyword annotation

together with a conﬁdence for each keyword.

the other. The inﬂuence of the bias is difﬁcult to measure

and is further analyzed below.

Within SWISS-PROT there are different levels of data

quality due to a varying degree of experimental veriﬁca-

tion of different characteristics of a protein. Uncertain or

predicted properties are categorized as probable, poten-

tial, putative and hypothetical with decreasing reliability

in that order (Junker et al., 1999; Apweiler, 2001; http:

//ch.expasy.org/cgi-bin/lists?annbioch.txt).

The general characterization status of a protein can

be taken from the Description Line of the entry. The

annotation of hypothetical proteins is usually bare and

incomplete, which makes their usage in training sets

unreasonable, hence they were not used for this purpose

(apart from very few hypothetical proteins which are not

marked as such in the Description Line). Probable and

putative proteins were kept, but will be ﬁltered out in

future versions of the tool, since their annotation has

unknown reliability.

• It is possible to track inconsistencies in SWISS-PROT

by using SWISS-PROT both as a training and a test set.

Sometimes it is not possible to ﬁnd a rule without FNs

and/or FPs. Those can be examined to validate their

annotation.

• The manual generation of rules in RuleBase can be

supported by proposing rules for a given set of proteins

for manual processing.

RESULTS

Successively applied on the proteins assembled in each

InterPro entry the algorithm generated 11 306 rules whose

reliability was evaluated by a tenfold cross-validation. The

quality of the rules depends on the quality of the data in the

training set on one hand and on the bias between training

data and the data on which rules are going to be applied on

923

E.Kretschmann et al.

Table 1. Fragments included in trainings set, conﬁdence > 90%

SWISS-PROT

IPRxxx

IPRyyy

trailing digit: 0-8

End

No. of

Covered

No. of

errors

% covered

% errors

IPRzzz

digit

keywords

28 225

28 033

27 899

28 040

27 498

28 058

28 247

28 049

27 748

28 129

9 629

9 533

9 579

9 553

9 340

9 609

9 647

9 386

9 380

9 414

214

155

172

214

223

171

210

192

206

33.36

33.24

33.78

33.46

33.19

33.45

33.55

32.71

33.11

32.73

2.22

2.24

1.62

1.80

2.29

2.32

1.77

2.24

2.05

2.19

C4.5

Rule

KW a

Rule

KW b

Rule

KW z

Confidence > Threshold?

yes no yes no no

279 926

95 070

1971

33.26

2.07

yes

Table 2. Fragments included in trainings set, conﬁdence > 67%

IPRyyy

IPRxxx

IPRzzz

SWISS-PROT

trailing digit: 9

End

digit

No. of

keywords

Covered

keywords

No. of

errors

% covered

% errors

Count inconsistencies

28 225

28 033

27 899

28 040

27 498

28 058

28 247

28 049

27 748

28 129

17 456

17 337

17 148

17 348

17 037

17 311

17 457

17 274

16 905

17 279

1 049

966

1 001

907

1 045

1 142

983

1 087

998

58.13

58.40

57.88

58.63

58.16

57.63

58.32

57.71

57.33

58.06

6.01

5.57

5.84

5.23

6.13

6.60

5.63

6.29

5.90

5.49

Fig. 6. Dataﬂow for a cross-validation run. Rules are generated from

InterPro families in SWISS-PROT having trailing accession number

digits 0–8, their conﬁdence is tested against a given threshold (90

and 67%) and they are virtually applied on proteins in SWISS-PROT

having trailing accession number digit 9.

948

279 926

172 552

10 126

58.02

5.87

Protein fragments in the training set also reduce the

data quality. Suppose having a number of proteins with

a common sequence signature that induces a certain

annotation. If there are only sequence fragments of those

proteins contained in the database some might show the

pattern, others might not depending on which part of the

sequence is covered by the fragment. This is a highly

random process introducing noise into the training set

and their removal leads to lower error rates at the cross-

validation. But it also leads to a bias between training

data and target data, since the latter obviously will contain

fragments as well as whole protein sequences.

Cross-validation has been performed on both, training

sets including and excluding fragments. This was done by

splitting the whole set of proteins contained in SWISS-

PROT into ten parts of almost equal size. SWISS-PROT

accession numbers always end with a digit. The digit does

not encode any information about the protein as such and

was therefore used as the split criterion. Nine parts of the

split were used as training set to generate the decision

trees, which were tested on the remaining tenth part called

the test set. This procedure was repeated ten times, each

time changing training and test sets. In each run the

coverage and the error rate of the generated decision trees

was measured. As stated above, not all rules are useful

to be applied on unknown data. Rules can be selected

using the conﬁdence criterion described in Formula 1 to

increase the reliability by decreasing the coverage and

vice versa. Two tests were performed to test the value of

that criterion and its inﬂuence on the observed error rate

in the cross-validation: the ﬁrst test used rules having a

conﬁdence of over 90% and the second test used rules with

a conﬁdence of over 67%. The numerical results are shown

in Tables 1–4.

DISCUSSION

Reliability of the results

Obviously, the values of the observed error rate using

the cross-validation give better results than the conﬁ-

dence obtained from Formula 1 would suggest. This

is due to the very careful calculation of this criterion

(z = 1.96) and could indicate a non-statistical distribution

of SWISS-PROT proteins allowing the supposition of

924

Automatic rule generation for protein annotation

Table 3. Fragments excluded in trainings set, conﬁdence > 90%

44.9%

Insecta

1.4%

TrEMBL

SWISS– PROT

9.4%

Viridiplantae

End

digit

No. of

keywords

Covered

keywords

No. of

errors

% covered

% errors

0.0%

23.4%

Vertebrata

Rodentia

Primates

91.6%

25 471

25 303

25 275

25 418

24 824

25 338

25 676

25 310

25 158

25 590

8 443

8 381

8 597

8 480

8 286

8 542

8 600

8 277

8 335

8 348

121

120

102

138

155

101

146

123

136

32.67

32.65

32.64

32.96

32.82

33.10

32.13

32.64

32.09

1.43

1.09

1.20

1.67

1.81

1.17

1.76

1.48

1.63

4.7%

10.3%

17.8%

28.2%

39.4%

Mammalia

84.5%

100

Fig. 7. Distribution of the Taxa from which the proteins in

253 363

84 289

1236

32.78

1.47

IPR000301 descend (selection).

Table 4. Fragments excluded in trainings set, conﬁdence > 67%

25.0%

Pfam PF00039

95.4%

75.0%

PRINTS PR00012

18.2%

End

digit

No. of

keywords

Covered

keywords

No. of

errors

% covered

% errors

32.1%

Prosite PS01253

100.0%

25 471

25 303

25 275

25 418

24 824

25 338

25 676

25 310

25 158

25 590

15 632

15 578

15 482

15 679

15 334

15 560

15 773

15 529

15 285

15 671

803

752

762

668

808

907

769

845

775

748

58.22

58.59

58.24

59.06

58.52

57.83

58.44

58.02

57.68

58.32

5.14

4.83

4.92

4.26

5.27

5.83

4.88

5.44

5.07

4.77

10.7%

PF00039+PR00012

18.2%

21.4%

PF00039+PS01253

95.4%

10.7%

PR00012+PS01253

18.2%

TrEMBL

10.7%

18.2%

SWISS–PROT

all

20 40

100

253 363

155 523

7837

58.29

5.04

Fig. 8. Distribution of protein signatures in IPR000301 (selection).

higher conﬁdences in rules than the statistics assuming

a random distribution would suggest, e.g. ﬁnding four

proteins matching to a common signature and sharing

the same annotation without ﬁnding a counter-example

produces on the average a rule with an error rate much

less than the predicted 51% from Formula 1.

Furthermore, for the cross-validation it was assumed

that the information in SWISS-PROT is true and with-

out errors. Clearly, this precondition is not completely

fulﬁlled, leading to an increased error rate for the cross-

validation. In fact, there are three possible error sources

that contribute to inconsistencies in the cross-validation:

Points 2 and 3 suggest that the real error rate is less than

the one observed in the cross-validation.

The method assumes equal distribution of the proteins

in the training set (SWISS-PROT proteins in an InterPro

entry) and the data to be classiﬁed (TrEMBL entries or yet

unknown sequences matching the same InterPro entry).

This is clearly not the case as Figures 7 and 8 indicate.

This distribution is a result of the fact that TrEMBL

proteins are not randomly chosen to be annotated and

transferred to SWISS-PROT. For example, there is a high

interest in human proteins, which leads to their over-

representation in comparison to all other species.

Often whole protein families are annotated or updated.

For some families, all proteins share the same signatures,

which leads to an over-representation of these signatures

in SWISS-PROT. Generating rules from these sets and

applying them on proteins of different origin or matching

different additional patterns might lead to systematic

errors. An improved method of validation based on

(1) The rule suggests a Keyword which is not contained

in a target due to a biological reason (True error).

(2) The Keyword has been forgotten to be annotated

(Inconsistency in SWISS-PROT).

(3) The protein does not match the precondition of the

rule due to a FP match to one of the signature

databases (Inconsistency in SWISS-PROT).

925

E.Kretschmann et al.

empirical evidence is under construction and is planned

to be implemented in later versions of this tool.

At the current status, the rules can be used to support

the manual annotation process performed by the SWISS-

PROT database curators. It is also ready to be made publi-

cally available (http://golgi.ebi.ac.uk:8080/Spearmint/).

Further developments

There are two ways for further developments. Firstly, the

rule generation process can be improved and secondly, the

rules can be applied in different applications. Currently,

the following projects are under development:

REFERENCES

Apweiler,R. (2000) Protein sequence databases. Adv. Protein

Chem., 54, 31–71.

Apweiler,R. (2001) Functional information in SWISS-PROT: the

basis for large-scale characterisation of protein sequences. Brief-

ings in Bioinformatics, 2, 9–18.

• A web-based application that allows application of the

rules not only on TrEMBL entries, but also on raw

amino acid sequences.

Apweiler,R., Attwood,T.K., Bairoch,A., Bateman,A., Birney,E.,

Biswas,M., Bucher,P., Cerutti,L., Corpet,F., Croning,M.D.R,

Durbin,R., Falquet,L., Fleischmann,W., Gouzy,J., Herm-

jakob,H., Hulo,N., Jonassen,I., Kahn,D., Kanapin,A., Kar-

avidopoulou,Y., Lopez,R., Marx,B., Mulder,N.J., Oinn,T.M.,

Pagni,M., Servant,F., Sigrist,C.J.A. and Zdobnov,E. (2000)

InterPro—an integrated documentation resource for protein

families, domains and functional sites. Bioinformatics, 16,

1145–1150.

• Mining for Description-, Comment-, and Feature

Lines.

• Integration of other data mining techniques.

• Application of rules on entries in Ensembl (http://

www.ensembl.org/) to predict functionality.

Attwood,T.K.,

Croning,M.D.R.,

Flower,D.R.,

Lewis,A.P.,

Mabey,J.E., Scordis,P., Selley,J.N. and Wright,W. (2000)

PRINTS-S: the database formery known as PRINTS. Nucleic

Acids Res., 28, 225–227.

• Automated application of the rules on proteins in

TrEMBL without human interaction.

Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein

sequence database and its supplement TrEMBL in 2000. Nucleic

Acids Res., 28, 45–48.

Bateman,A., Birney,E., Durbin,R., Eddy,S.R., Howe,K.L. and

Sonnhammer,E.L.L. (2000) The Pfam protein falilies database.

Nucleic Acids Res., 28, 263–266.

Corpet,F., Servant,F., Gouzy,J. and Kahn,D. (2000) ProDom and

ProDom-CG: tools for protein domain analysis and whole

genome comparisons. Nucleic Acids Res., 28, 267–269.

Fleischmann,W., Mo¨ller,S., Gateau,A. and Apweiler,R. (1999) A

novel method for automatic functional annotation of proreins.

Bioinformatics, 15, 228–233.

The Gene Ontology Consortiuum (2000) Gene Ontology: tool for

the uniﬁcation of biology. Nature Genet., 25, 25–29.

Hofmann,K., Bucher,P., Falquet,L. and Bairoch,A. (1999) The

PROSITE database, its status in 1999. Nucleic Acids Res., 27,

215–219.

• Automated rule generation on GO terms (The

Gene Ontology Consortiuum, 2000) rather than on

Keywords.

One of the most imperative tasks is to achieve an improved

conﬁdence calculation. The current routine does not use

information about the number of TNs or FNs in the

calculation and hence prefers the generation of frequent

keywords rather than rare ones, i.e. general Keywords are

produced more often than speciﬁc ones, but the latter are

the more valuable ones. Extracting reliable rules for rare

Keywords will certainly be a very useful improvement of

the tool.

For all calculations independence between training and

target set were assumed, which is clearly not the case. An

empirical test to collect data about the inﬂuence of the bias

is helpful to get a better picture about the performance of

the rules on unknown data.

Hyaﬁl,L. and Rivest,R.L. (1976) Constructing optimal binary deci-

sion trees is NP-complete. Inf. Process. Lett. 5, 1, 15–17.

Junker,V., Apweiler,R. and Bairoch,A. (1999) Representation of

functional information in the SWISS-PROT data bank. Bioinfor-

matics, 15, 1066–1067.

CONCLUSION

The presented method mines for Keyword annotation in

SWISS-PROT using a Java implementation of the C4.5

algorithm on protein groups assembled in InterPro entries.

The results are satisfactory in terms of coverage and

conﬁdence, yet it was pointed out that both aspects can

be further improved. Including other methods to group

proteins into sets containing similar proteins like CluSTr,

Prodom (Corpet et al., 2000) or others will help to

increase the coverage while a reﬁned statistical analysis

will improve ordering of the generated rules in terms of

reliability. This is supposed to lead to higher values of

conﬁdence.

Krause,A., Nicode`me,E., Bornberg-Bauer,M., Rehmsmeier,M. and

Vingron,M. (1999) WWW access to the SYSTERS protein

sequence cluster set. Bioinformatics, 15, 262–263.

Kriventseva,E.V., Fleischmann,W., Zdobnov,E.M. and Apweiler,R.

(2001) CluSTr: a database of clusters of SWISS-PROT +

TrEMBL proteins. Nucleic Acids Res., 29, 33–36.

Mo¨ller,S., Leser,U., Fleischmann,W. and Apweiler,R. (1999)

EDITtoTrEMBL: a distributed approach to high-quality auto-

mated protein sequence annotation. Bioinformatics, 15, 219–227.

Quinlan,,J.R. (1986) Induction of decision trees. Mach. Learn., 1,

81–106.

Quinlan,J.R. (1993) C4.5: Programs for Machine Learning. Morgan

Kaufmann, San Francisco, CA.

926

Products guided by the article

Product name:6-methyl-2-(2,4,5-trimethyl-phenyl)-4,5-dihydro-2H-pyridazin-3-one

Cas No:859956-09-5

R&D Labs maybe for 859956-09-5

Shandong Shouguang Songchuan Industrial Additives Co.,Ltd

Contact:+86-536-8566856

Address:Shouguang,Shandong,China
Winchem Industrial Co. Ltd.(expird)

Contact:86-574-83851061 86-574-87083208

Address:Room 905, No.3 Building,East Business Center, 456 Xingning Road, Ningbo City,China
Borun Chemical Co.,Ltd.

Contact:+86-574- 87178138; 87297407

Address:No. 809, Liudingxingzuo, cangsong road, Ningbo, China
Fuxin Jintelai Fluorin Chemical Co., Ltd.

Contact:+86-0418-8229599

Address:, 7th Huagong Road, Fluorine industry development zone (Yimatu Town,Fumeng County),Fuxin City, Liaoning Province, China
Xuzhou Tianrun Chemical Co.,Ltd

website:http://www.tianrunchem.cn

Contact:86-516-83832636

Address:fuxing road

Relevant to this article

Tandem Oxidative Ring-Opening/Cyclization Reaction in Seconds in Open Atmosphere for the Synthesis of 1-Tetralones in Water-Acetonitrile

Doi:10.1021/acs.orglett.8b03246
(2018)
Doi:10.1021/ja01265a112
(1939)
Doi:10.1021/ja01867a042
(1940)
Synthesis and structure of bismuth-containing complexes [(Ph ₄BiO)₂{2,5-(CH₃)₂C₆H ₃S(O)} ₂ ⁺ [Ph₂Bi₂I ₆](2-),[(Ph(4)Bi](+)[PhBi(C (5)H

Doi:10.1007/s11173-005-0043-0
(2004)
Tert -Butyl nitrite (TBN) as the N atom source for the synthesis of substituted cinnolines with 2-vinylanilines and a relevant mechanism was studied

Doi:10.1039/c7ob01553d
(2017)
Use of kinetic isotope effects to delineate the role of phenylalanine 87 in P450_BM-3

Doi:10.1006/bioo.2002.1239
(2002)

Article Doi

DOI: 10.1093/bioinformatics/17.10.920

Source and publish data:

Authors:

Article abstract of DOI:10.1093/bioinformatics/17.10.920

Full text of DOI:10.1093/bioinformatics/17.10.920

Products guided by the article

R&D Labs maybe for 859956-09-5

Relevant to this article

Hot Product