Journal of the American Chemical Society
Article
selected from the in silico library of phosphoric acids with the
Kennard-Stone algorithm were used in the training set (Figure
9). Further, only reactions containing imines 1 and 2 and thiols
A−C (Figure 5) used in training and cross-validation. Thus, 12
catalysts with 6 substrate combinations gave 72 reactions for
model training and validation. Although not an insignificant
number of reactions, performing this number of reactions is
well within the capability of most synthetic organic chemistry
laboratories. The remaining 1003 reactions were used as an
external test set.
Three different model types were investigated: PLS, SVR,
and random forest (RF). PLS was selected because it is a
simple linear model with well precedented use in chemo-
informatics using molecular field-type descriptors with
relatively limited data sets. SVR and RF were selected because
they are popular machine learning methods capable of
modeling nonlinear relationships. For the PLS model, the
optimal number of latent variables was determined using 3-fold
cross-validation. Similarly, hyperparameter optimization for the
SVR and RF models were selected by a grid search of
hyperparameters, with the best performers identified by q2.
The complete protocol for model optimization and selection is
When comparing the cross-validation results, the SVR model
is the highest performer (q2 = 0.803), followed by the PLS
model (q2 = 0.785) and finally the RF model (q2 = 0.693).
Using this metric, the more complex model is actually a higher
performer in this data-limited case study. When the perform-
ance of each model is compared by predicting the external test
set, a similar result is observed (Figure 10). In this analysis, the
Figure 11. Learning curve, depicting MADtest and q2(5-fold) plotted
against the number of training reactions.
performance. In practice, if one desired to run only 24
reactions, it is likely that a single substrate combination would
be used with the Universal Training Set (UTS). To illustrate
this, we have constructed a PLS model examining only one
reaction, the enantioselective addition of thiol B to imine 1.
The 24 UTS catalysts have been used to train the model and
the 19 test set catalyst have been used to evaluate the model.
This model is indeed very accurate (q2 = 0.70, MADTest
=
0.156 kcal/mol), and the most selective catalyst for this
reaction, which was not included in the training data, is indeed
predicted as the most selective catalyst (Figure 12).
RF most accurately predicts the external test set (MADtest
=
0.23 kcal/mol), followed by the SVR model (MADtest = 0.24
kcal/mol), and third the PLS model (MADtest = 0.25 kcal/
mol).61 Thus, even in data limited cases, more complicated
machine learning models can be as good or better than simpler
modeling techniques. This illustration indicates that researchers
should consider using more complex models even in data-limited
scenarios. It is noteworthy that this phenomenon is likely
dependent on the specific system and molecular representation
used in any study.
2.4. Case Study 4: Learning Curve Generation. To
further examine the amount of data necessary to create
accurate models, a learning curve was constructed. To be
consistent in analysis of the results, the 384 training +
validation/691 test reaction partitioning used in the original
study was used at the outset of this experiment. From the 384
possible training reactions, some number of reactions n were
randomly selected to use in model training and cross-
validation, and those models were further evaluated by the
MAD in the 691-member external test set. For each value of n,
five training sets were randomly selected. These training sets
were used in an ensemble of linear models, and the average
MAD of the test set for each value of n was used to evaluate
the ensemble. The results are depicted in Figure 11.
As the number of training reactions increases from 24 to
336, a notable increase in q2 occurs until 96 training reactions
is reached. From 144 to 336 training reactions, only
incremental improvements in q2 are observed. However, the
MADtest continues to improve with an increasing number of
training reactions, reaching 0.21 kcal/mol at 336 reactions. It is
worth noting that this model is relating both catalyst and
substrate features to enantioselectivity; thus, it is unsurprising
that extremely data limited sets (n = 24) have poor
Figure 12. Model for the enantioselective addition of thiol B to imine
1.
2.5. Case Study 5: Improved Predictive Performance
with Algorithmic Training Set Selection. In our first
publication of this computer-guided workflow,49 an algorithmi-
cally selected set of compounds from the in silico library was
identified, termed the UTS. The logic behind this subset
selection using the Kennard−Stone algorithm62 is as follows:
(1) the in silico library contains every catalyst candidate that is
of interest to be evaluated experimentally on the basis of its
synthetic accessibility, (2) the Kennard−Stone algorithm will
select boundary cases and sample uniformly over the chemical
space of interest, and (3) consequently, all future predictions
should be within the convex hull of the training data.
Consequently, future predictions will be interpolative; we
hypothesize this process will lead to greater confidence in future
predictions. Alternatively, many researchers might be interested
J
J. Am. Chem. Soc. XXXX, XXX, XXX−XXX