E.V. Filho et al.
Bioorganic & Medicinal Chemistry Letters 48 (2021) 128240
Table 5
Data frame applied to build machine learning models.
Compound
Total Mol weight
cLogP
cLogS
Total Surface Area
Relative PSA
Polar Surface Area
Druglikenes
Binding Energy (Kcal/mol)
1M17
4HLW
5LGE
5QGF
6a
6b
6c
6d
6e
6f
356.229
402.234
371.244
314.351
359.348
329.366
253.328
236.277
404.293
449.290
419.308
2.35
2.41
1.67
2.62
1.69
1.94
3.31
2.10
3.86
2.94
3.18
ꢀ 3.526
ꢀ 5.833
ꢀ 3.602
ꢀ 3.732
ꢀ 4.192
ꢀ 3.808
ꢀ 4.436
ꢀ 3.231
ꢀ 4.272
ꢀ 4.732
ꢀ 4.348
190.53
265.66
199.05
245.3
0.19530
0.25457
0.26365
0.26914
0.35855
0.32027
0.29559
0.27178
0.06305
0.15739
0.11360
51.8
97.62
77.82
82.51
128.33
108.53
80.04
67.59
17.82
63.64
43.84
ꢀ 1.44
ꢀ 9.09
ꢀ 1.44
3.03
ꢀ 7.9
ꢀ 8.1
ꢀ 7.9
ꢀ 8.6
ꢀ 9.0
ꢀ 8.2
ꢀ 7.2
ꢀ 7.7
ꢀ 8.7
ꢀ 8.1
ꢀ 8.3
ꢀ 6.9
ꢀ 6.4
ꢀ 6.7
ꢀ 7.4
ꢀ 8.5
ꢀ 7.3
ꢀ 6.3
ꢀ 6.5
ꢀ 7.1
ꢀ 7.0
ꢀ 6.7
ꢀ 7.4
ꢀ 7.4
ꢀ 7.3
ꢀ 7.4
ꢀ 8.1
ꢀ 7.5
ꢀ 6.8
ꢀ 7.0
ꢀ 7.2
ꢀ 7.2
ꢀ 7.4
ꢀ 7.6
ꢀ 7.9
ꢀ 7.5
ꢀ 7.4
ꢀ 7.8
ꢀ 7.2
ꢀ 6.4
ꢀ 6.7
ꢀ 6.5
ꢀ 6.6
ꢀ 6.3
268.97
253.82
194.76
188.39
282.95
306.62
291.47
ꢀ 2.34
3.03
6g
6h
9a
9b
9c
1.46
0.52
1.29
ꢀ 3.90
1.25
An exhaustive combination of features was carried out to search the
best ML model using and KNN and LR algorithms for each cell line. The
significant models were found for SB19 and PC3 cell lines through KNN
and LR, respectively. In both models, the best features were cLogP,
druglikenes and the binding energy for the respective molecular target
(5LGE and 4HLW for SB19 and PC3, respectively). As a result, the KNN
model could achieve a precision of 1.00 and 0.67 to predict the active
and inactive compounds, respectively, with an accuracy of 0.83 and a
cross validation value of 0.67. These results means that KNN model had
a good prediction for active compounds (100%), but it had few efficient
to predicted inactive compounds. This weakness in the KNN model is
due to docking methodology, which generates false-positive results. In
addition, LR model had 0.59, 0.95 and 0.32 of regression coefficient,
cross validation value and root mean square error, respectively. The
ꢀ 0.48, 0.02, ꢀ 0.09 coefficient values were obtained for 4HLW, LogP
and druglikenes, respectively.
The Fig. 8 shows the correlation between real and predict biological
activity values. These findings suggest a good model able to biological
activity. For instance, the real log of biological active values was 9.22,
8.67 and 9.90; whereas the predicted values were 9.12, 9.33, 9.60,
respectively, for 6a, 6f and 9b compounds. Hence, two good ML models
were obtained, which can be used to predict the biological activity of
SB19 and PC3 cell lines, respectively.
Fig. 8. Correlation between real and predict biological activity for PC3 cell line
through linear regression using 4HMW, cLogP and druglines as features.
are more active than crystallographic ligand for the cell line previously
described.
Fig. 7 summarize the ligands with the best biological activity. As can
be see, the ferrocene 6a and 9c compounds showed the best biological
activity values for HCT116 (Fig. 7A), SNB19 (Fig. 7C) and HL60 cell
In conclusion, we developed a simple, fast, and efficient methodol-
ogy for the Atwal reaction under microwave irradiation to synthesize
eight 2-amino-4-phenylpyrimidine substituted at carbon C6 with het-
erocycles and ferrocene and three pyrazole derivatives with phenyl and
ferrocene in yield ranging from good to excellent – 52–80%. Further-
more, eight crystal structures were determined successfully.
lines (Fig. 7D), with binding energy of ꢀ 8.3, ꢀ 7.4 and ꢀ 7.6 Kcal.molꢀ 1
,
respectively. These values are low than the respective crystallographic
ligands (Fig. 6). In addition, as can be seen in the Fig. 6, the ferrocene
moiety can be form complex into hydrophobic and hydrophilic cavities
of the molecular targets. This feature can explain their biologic activity,
which the ferrocene compounds were activities in 3 of the 4 cell lines. In
contrast the pyrimidine 6g was the most active compounds for PC3 cell
line with binding energy of ꢀ 6.3 Kcal.molꢀ 1 performing hydrophilic and
hydrophobic molecular interaction with 4HLW molecular target.
To Machine Learning (ML) the Autodock Vina51 binding energy (BE)
and physical-chemistry features (descriptors) were obtained by Data-
warrior software.52 The parameters, cLog P, cLog S, Total Molecular
Weight, Relative Polar Surface Area (PSA), Drug likeness, Total Surface
and Polar Surface Area were used to develop ML models. Thus, super-
vised, and unsupervised models were generated using K-Nearest
Neighbors (KNN) and Linear Regression (LR) through Jupyter Note-
book.53 The data frame was per-processed using pandas and numpy li-
brary, which a Log function was applied for the biological values. No
missing values were found in the data frame. In addition, the compounds
with value equal to 20,000 were classified as inactive and the others as
active to build the KNN supervised model, which n value was set to 3.
The biological activity was the target to be predicted. In addition, the
structures with biological activity more than 2000 were removed from
the dataset. The binding energy, together with other descriptors, were
used to generate machine learning models, as shown in Table 5.
All hybrids screened for their in vitro anti-proliferative activities
against four cancer cell lines where it was possible to correlate the
anticancer activity with the structures of the synthesized hybrids. The
compounds 6g and 9c stand out showed predominant cytotoxic poten-
tial in all tested cell lines. Furthermore, the biological activity of PC3 cell
line could be estimated by docking simulation, whereas the biological
activity for HL60 and HCT116 could be estimated by ML model using LR
and RF, respectively. These calculations can be used to design new
compounds with anticancer activity.
Declaration of Competing Interest
The authors declare that they have no known competing financial
interests or personal relationships that could have appeared to influence
the work reported in this paper.
Acknowledgments
The authors acknowledge the Conselho Nacional de Desenvolvi-
´
mento Científico e Tecnologico (CNPq 305117/2017-3 and 2020), the
Coordenadoria de Aperfeiçoamento de Pessoal do Nível Superior
7