Pharmaceuticals 2021, 14, 540
15 of 19
generation and validation of QSAR models were performed using QSARINS 2.2.4 (Univer-
sity of Insubria, Varese, Italy) [46].
In order to reduce a large number of calculated descriptors, constant and semi-constant
descriptors, i.e., those with a constant value for more than 85% of compounds, and descrip-
tors that were too intercorrelated (>95%) were rejected by QSARINS. The final number
of remaining descriptors was 514. Due to the high number of inactive compounds (14), 8
of them were randomly chosen and excluded from the dataset. A genetic algorithm (GA)
was used to generate the best model. The number of descriptors in the multiple linear
regression equation was limited to three. The splitting of compounds into the training set
(
n = 27 molecules) and test set (n = 5 molecules) was performed by activity sampling [47].
Compounds were ranked by their activities (from the most active to the least active com-
pound) and then divided into five groups of the approximately same size. One compound
was selected randomly from each group and assigned to the test set. The models were
validated by the internal cross-validation performed using the “leave-one-out” (LOO)
and Yscrambling method [46]. The following evaluation criteria were included: coeffi-
2
2
cient of determination (R ), adjusted coefficient of determination (R adj), cross-validated
correlation coefficient (Q2LOO), inter-correlation among descriptors (Kxx), the difference
of the correlation among the descriptors and the descriptors plus the responses (∆K),
the standard deviation of regression (s), Fisher ratio (F), root-mean-square error (RMSE);
LOO cross-validated root-mean-square error (RMSEcv), concordance correlation coefficient
(CCC), LOO cross-validation concordance correlation coefficient (CCCcv), mean absolute
error of the training set (MAE), mean absolute error of the internal validation set (MAEcv),
and LOO cross-validated predictive residual sum of squares (PRESScv). QSAR model
2
2
robustness was tested using the Y-randomization test, giving R Yscr and Q
values [34].
Yscr
External validation parameters included the coefficient of determination of the test set
2
(
R
ext), external validation set root-mean-square error (RMSEext), external validation set
concordance correlation coefficient (CCCext), external validation set mean absolute error
2
2
2
(
MAEext), predictive squared correlation coefficients (Q F1, Q F2, Q F3) and the average
value of squared correlation coefficients between the observed and LOO predicted values
2
of the compounds with and without intercept (r m) [48].
To identify the possible outliers and compounds out of the warning leverage (h*) in
a model, a leverage plot (plot of standardized residuals vs. leverages (h); the Williams
0
plot) was used. The warning leverage is generally defined as 3p /n (n being the number of
0
training compounds, and p the number of model adjustable parameters [49]. Outliers in
the Williams plot are compounds that have values of standardized residuals higher than
two standard deviation units.
3.4.2. Preparation of the Complex Structure
The complex between the enzyme and compound 12 was built using the semi-closed
conformation of hDPP III obtained earlier [8] by MD simulations of the structure avail-
able in the Protein Data Bank (PDB code: 3FVY), since it has been proved that this is
the most preferable enzyme form in water solution [50]. Before the docking procedure,
the protonation of histidines was checked according to their ability to form hydrogen
bonds with neighboring amino acid residues. All Glu and Asp residues are negatively
charged ( 1) and all Arg and Lys residues are positively charged (+1), as expected at
−
physiological conditions. AutoDock Vina 1.1.2 [51] was used to search for the best pose of
the ligand to the enzyme active site. The docking site was defined as a cubical grid box
3
2+
with dimensions 75
×
75
×
75 Å and the center placed on the Zn . Docking simulation
was done with the standard 0.375 Å resolution and 20 conformations were generated. The
complex with the best AutoDock Vina docking score was chosen for the productive MD
simulations. Parameterization of the complex structure was performed by the AMBER-
Tools16 modules antechamber and tleap using General Amber Force Field (GAFF) [52] and
ff14SB [53] force fields to parameterize the ligand and the protein, respectively. For the zinc
2+
cation, Zn , new hybrid bonded-nonbonded parameters were used from our previous