M. S. Bhatia et al. / Bioorg. Med. Chem. 17 (2009) 1654–1662
1661
bond doner count (H-don), Chi3cluster (c3c), chiV3Cluster (cV3c),
dipole moment (DM), QMDipoleX (QMDx), QMDipoleY (QMDy),
QMDipoleZ (QMDz), Information based descriptor (Idaverage),
Quadrupole1 (Q1), Quadrupole2 (Q2), Heat of Formation (HF),
highest occupied molecular orbital (HOMO), lowest unoccupied
molecular orbital (LUMO), ionization potential (IP), molar refrac-
tivity (SMR), partition coefficient (slogP), Moment of inertia along
x-axis (MIx) Moment of inertia along y axis (MIy), Moment of iner-
tia along z-axis (MIz) average hydrophobicity by kellog method
(SKaverave), average hydrophobicity by Audry method (SAaver-
age), Xcomponent of Dipole (Xdipole), Ycomponent of Dipole (Ydi-
pole), Zcomponent of Dipole (Zdipole), radius of gyration (RGYR),
Distanc Topological (DistTop), Kappa1 (K1), Kappa2 (K2), quantum
mechanics dipole moment (QMDM).
and these methods are implemented in either an agglomerative
(bottom–up) or divisive (top–down) procedure. On the other hand,
the partitional clustering assumes that the objects have nonhierar-
chical characters.12–16 Most popular partitional cluster algorithms
are k-mean cluster algorithms (k-MCAs) and another by Jarvis
and Patrick (also known as k-nearest neighbor cluster algorithm;
k-NNCA) algorithms. The k-mean clustering algorithms use an
interchange (or switching) method to divide n data points into k
groups (clusters) so that the sum of distances/dissimilarities
between the objects within the same cluster is minimized. The
k-mean approach requires that k (the number of clusters) is known
before clustering. The Jarvis–Patrick method requires that the user
specify the number of nearest neighbors, as well as the number of
neighbors in common to merge two objects. The Jarvis–Patrick
method is a deterministic algorithm; it does not require iterations
for computations.12–16 To design the training and test series, as well
as to demonstrate the structural diversity of the present database,
we carried out one of these kinds of cluster analyses (k-NNCA) for
anti-coagulant series. The Vlife graph package from the QSAR Mod-
ule was used to develop these CA‘s. In this study, we used the ‘aver-
age linkage’ metric as the method to merge objects into clusters.
The average linkage distance between two clusters is defined as
the average (Euclidean squared arithmetic mean) distance between
pairs of objects, one in each cluster. Average linkage tends to join
those clusters with small variances and produces new clusters with
roughly the same variance.
5.7.3. Analysis of principal components
To conduct the comparison of the MDs computed in this work,
we performed a factorial analysis, using the principal components
method. The theoretical aspects of this statistical technique have
been extensively exposed in the literature including many chemi-
cal applications.5–11 The main uses of factorial analytical tech-
niques are: (1) to reduce the number of variables, and (2) to
detect structure in the relationships between variables, namely,
to classify variables10,11 In this approach, factorial loadings (or
‘new’ variables) are obtained from original variables of Molecular
Descriptors. Thus, these factors capture all the ‘essence’ of these
MDs, because they are linear combinations of the original items.
Because each consecutive factor is defined to maximize that vari-
ability not captured by the preceding factor, consecutive factors
are independent of each other. Put in another way, consecutive fac-
tors are uncorrelated or orthogonal to each other. The first ob-
tained factor is generally more highly correlated with the
variables than the other factors. This is to be expected, because
these factors are successively extracted and will account for less
and less overall variance. The factor analysis was carried out using
‘varimax normalized’ as rotational strategy to obtain the factorial
loadings from the principal component analysis. The goal of this
rotational procedure is to obtain a clearer pattern of the loadings,
that is, factors that be somehow clearly marked by high loadings
for some variables and low loadings for others. The ‘varimax nor-
malized rotation’ is the method that is most commonly used as
‘varimax’ rotation.18 This rotational strategy is aimed at maximiz-
ing the variances of the squared normalized factorial loadings (row
factorial loadings divided by squared roots of the respective com-
munalities), across the variables for each factor. This strategy
makes the structure of the factorial pattern as simple as possible,
permitting a clearer interpretation of the factors without loss of
orthogonality between them. Finally, some of the most important
conclusions, which could can be drawn from a factor analysis that
will be of large usefulness in the present article are the following
(1) variables with a high loading in the same factor are interrelated
and will be the more, the higher the loadings, (2) no correlation ex-
ists between variables having nonzero loadings only in different
factors. These are the principal ideas that permit the interpretation
of the factorial structure, obtained using the factorial analysis as a
classification method, and (3) only variables with high loadings in
different factors may be combined in a regression equation to elim-
inate collinearities.
5.7.5. LDA and classification-based QSAR model
The discriminant functions were obtained using LDA,17 as
implemented in Vlife model building Wizard, the default parame-
ters of this program were used in the development of the model.
Forward stepwise was fixed as the strategy for variable selection.
The principle of maximal parsimony (Occam’s razor) was taken
into account as the strategy for model selection. In its original
form, Occam’s razor states that ‘Entities should not be multiplied
beyond necessity’. In this case, simplicity is loosely equated with
the number of parameters in the model. If we understand the pre-
dictive mistake to be the error rate for unseen examples, the Oc-
cam’s razor can be stated for the selection of QSAR models as
(‘QSAR Occam’s Razor’): Given two QSAR models with the same
predictive error, the simplest one should be preferred because
simplicity is desirable in itself.28 Relation to this, we select that
model with the highest statistical signification, but having as
few parameters as possible.
5.7.6. Validation of the obtained model
The statistical robustness and predictive power of the obtained
model were assessed using a prediction (test) set.29 In addition; a
leave-group-out (LGO) cross-validation (CV) strategy was carried
out. Eliminate a compound in the training set and predict its bio-
logical activity on the basis of the k-NN principle, that is, as the
weighted average activity of k most similar molecules. The simi-
larities are evaluated as Euclidean distances between compounds
using only the subset of descriptors that corresponds to the cur-
rent model. Repeat until every compound in the training set has
been eliminated and its activity predicted once. In this way,
every observation was predicted once (in its group of left-out
observations).
5.7.4. Cluster analysis
Cluster analysis (CA) encompasses a number of different classi-
fication algorithms, and it permits to organize the observed data
into meaningful structures. Many CA algorithms have been in-
vented, and they belong to two categories: hierarchical clustering
and partitional (nonhierarchical) clustering. Hierarchical clustering
rearranges objects in a binary tree-structure (joining clustering),
Acknowledgement
The authors are thankful to Dr. H.N. More, Principal, Bharati
Vidyapeeth College of Pharmacy, Kolhapur for providing facilities
to carry out the research work.