Full Papers
species, and an additional redox pair centered on a nitro func-
tionality or strong hydrogen bond are among the processes
that can modulate the potential gap between the oxidized
and reduced species (Table 2).
Table 3. Statistical characteristics of consensus multilinear regression
(ISIDA MLR) and support vector machine (EED SVM) models for cross-vali-
dation (XV) and external test challenges.[a]
Descriptor
Training (103ÀXV)
Test
RMSE
R2
RMSE
R2
0.032 (0.027)[b] 0.86 (0.90)[b]
0.054 (0.030)[b] 0.60 (0.88)[b]
QSPR modeling studies of the redox properties
ISIDA MLR
EED SVM
0.080
0.082
0.85
0.84
Previous work[22] reported QSPR models for redox potential
that were built on a set of various quinones and indolone-N-
oxides, both novel and taken from literature. One of the ex-
plored modeling protocols, based on ISIDA molecular fragment
counts (the best compromise between accuracy and technical
web deployment costs) has been posted on our web server for
public use. Basically, it predicts the redox potential values by
adding fragment-specific increments for each of the key frag-
ments shown, in the training stage, to best explain experimen-
tal property values. The approach accepts a structure file for
organic compounds, then proceeds, for each molecule, to the
detection and counting of the mentioned key fragments. Each
occurrence of a key fragment triggers a fragment-specific in-
crement, expressed in volts (which may be positive or nega-
tive, as calibrated by hand using training compounds) to be
summed to the predicted redox potential value. Note that sev-
eral other theoretical models based on different molecular de-
scriptor schemes—notably Electronic Effect Descriptors (EED),
designed for the purpose of modeling reactivity-related prop-
erties—have also been explored, with very promising results.
Revisiting the technicalities of the previous modeling work is
not the scope of the present paper. The reader is advised to
refer to the previous article for details on the employed molec-
ular descriptor schemes, which are also adopted in the present
work.
[a] RSME: root-mean-squared error; R2: determination coefficient. [b] Ad-
justed statistical parameters when the outlier (2a) is excluded from the
test set.
the best descriptor space to host optimal SVM models, EED
terms clearly outperformed individual descriptor spaces. How-
ever, models based on different ISIDA molecular fragments
were combined into a consensus model (in which only models
with R2 of cross-validation >0.5 are accepted). This consensus
effect, over many different ISIDA descriptor spaces, interesting-
ly compensated for the advantage of EED over each individual
ISIDA fragmentation scheme and also for the alleged advant-
age of nonlinear modeling. As shown, the single descriptor-
space (EED) nonlinear model and the multi-fragmentation con-
sensus approach eventually performed equally, and very well,
in a threefold cross-validation challenge.
It is, nevertheless, of greater practical interest to evaluate
a model’s performance using the external test set. As Table 3
shows, consensus models based on ISIDA descriptors per-
formed well in both cases, with the R2 of the test being close
to the R2 of cross-validation. The model with EED descriptors
had a globally lower R2, but this was due to one molecule
being predicted more poorly, which, due to the modest size of
the test set, had a greater impact on the overall score
(Figure 6). The poorly predicted molecule (2a) is the derivative
with a hydroxy substituent, and the phenomenon has been re-
ported previously.[22] However, the updated training set now in-
cludes examples of analogues with such an intramolecular hy-
drogen bond and, accordingly, the structural pattern was rec-
ognized when using ISIDA fragment counts. The monitored
atom and bond sequences, with lengths ranging from two to
ten, include the corresponding O=CÀC:CÀO pattern (“:” repre-
sents an aromatic bond) of the fragment responsible for the
intramolecular hydrogen bond and only appear in the context
of the presence of a 5-hydroxy substituent. Therefore, the
model learned to associate this fragment with a negative incre-
ment for the predicted redox potential (the hydrogen bond
polarizes the quinone carbonyl even further, rendering the qui-
none system more electron-depleted). Of course, the presence
of such a fragment itself is not sufficient to trigger a decrease
in the redox potential; it must be determined whether the
fragment is in the right context, that is, involving the actual
quinone carbonyl and the adjacent phenolic hydroxy group in
a hydrogen bonding position. Should the mentioned sequence
appear in other moieties of a molecule, not connected to the
quinone system, the model would also add the associated
(negative) increment to the predicted redox potential and
would likely result in an error. This shows that even the signifi-
cantly extended compound set used in this work is still prone
Expansion of the chemical space of interest to include ben-
zoyl derivatives naturally raises the question of the compe-
tence of the previous model with respect to this new chemo-
type, which has not been previously employed for training.
Therefore, the first logical step was to challenge the old model
to make a prediction for the newly synthesized compounds.
The predictions by the model, are, on the absolute, quite inac-
curate (root-mean-squared error [RMSE]=0.176). This is not
surprising, as the tested compounds are all derivatives of new
families; the molecules contain benzyl and benzoyl substitu-
ents that were never present in the training set of the model,
which had no chance to learn their impact on the redox po-
tential value (Figure S43). The experimental, predicted (with
and without the applicability domain), and corrected (by cor-
rection coefficients) values are available as Supporting Informa-
tion (Table S4).
Table 3 reports the statistical parameters for new models
built on a combination of old and new data with different
methods. The training set for modeling consisted of 81 mole-
cules (all previously used data and examples of both new fami-
lies combined), and 14 benzyl and benzoyl derivatives were
manually selected—or tested only after model building—to
serve as an external test set.
The cross-validation R2 scores were quite high for all model-
ing methods. Note that, in the evolutionary competition for
ChemMedChem 2016, 11, 1339 – 1351
1346
ꢀ 2016 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim