R. Huang et al. / Genomics 87 (2006) 315–328
327
at least five observations, the sampling distribution of H is a very close
approximation of the m2 distribution for k ꢃ 1 degrees of freedom. It is actually
a fairly close approximation even when one or more of the samples include as
few as three observations. In the present study, only pathways that have at least
three gene expression data vectors available are included in the calculations.
Significance levels ( p values) are obtained using the m2 distribution with 1
degree of freedom (two-sample populations, intra- and interpathway correlation
coefficients, are compared here; therefore, k = 2 and k ꢃ 1 = 1).
single-linkage hierarchical clustering of the 111 data vectors was performed
using MATLAB, which segregated the pathways into 15 groups. A dendrogram
showing the neighboring pathways was generated.
Acknowledgments
The authors thank the members of the STB staff, especially
Drs. Robert Shoemaker and Susan Mertins for valuable
contributions during the preparation of the manuscript. This
project has been funded in whole or in part with federal funds
from the National Cancer Institute, National Institutes of
Health, under Contract NO1-CO-12400. The content of this
publication does not necessarily reflect the views or policies of
the Department of Health and Human Services, nor does
mention of trade names, commercial products, or organizations
imply endorsement by the U.S. Government.
A large H score (H > 3.84) indicates that a statistically significant ( p <
0.05) difference exists between the intra- and the interpathway r populations. A
small r is assigned a lower rank and a large r a higher rank. If the average rank
of the interpathway r values is higher than that of the intrapathway r values,
then a negative sign is applied to the H score. Therefore, a large positive H
score indicates a high level of gene expression coherence within the pathway
compared to expression of genes not linked by a known pathway. Either
absolute or real r values can be used when calculating the H scores, and slightly
different results will be obtained, depending on whether the intrapathway r
values are dominated by positive or negative values. Absolute r values are used
in all calculations unless otherwise specified, since both significant positive and
negative correlations between two genes are assumed to be indicative of
coordinated expression.
Appendix A. Supplementary data
Randomization procedures are used to check the probability of getting a
large positive H score by chance compared to the p values obtained directly
from the m2 distribution. Random pathways of sizes approximately 5, 10, 15,
25, 50, and 100 are built by randomly picking and assigning genes in the data
set to each pathway, and their H scores are calculated. This procedure is
repeated 1000 times and the probabilities of getting H scores at various levels
for each pathway size are calculated. Further increase in the number of
randomizations does not appear to affect the outcome. The probabilities
obtained this way agree very well with the p values obtained directly using the
m2 distribution for pathway sizes 10 to 25 (number of genes in each pathway;
the size of the pathways we are working with is 10–25), that is, an H score of
>3.84 is required to get p < 0.05. However, the H score required to reach a
certain significance level (e.g., p < 0.05) shows a slight linear dependency on
pathway size (i.e., sample size). For larger pathways, higher H scores are
needed to get a significant p value. For this reason, we have applied the
randomization procedure to obtain the probability of observing an extreme H
score by chance for each pathway. Therefore, if a pathway in a particular
annotation system has n genes, then n genes are randomly selected from the
pool of m genes annotated by the system and the H score is calculated. This
procedure is repeated 1000 times for each pathway, and the fraction of H scores
that are more extreme than the actual H score of the pathway is then assigned as
the coherence p value for that particular pathway. If none of the 1000 random H
scores is more extreme than the actual H score, then the pathway is
significantly cohesive at p < 0.001. The number of significantly cohesive
( p < 0.05) pathways selected in this fashion, however, does not differ much
from that obtained directly from the m2 distribution because most pathways
(>90%) have fewer than 50 genes and most significant pathways (>60%) have
very large H scores (>7.5). For each gene annotation system, KEGG, BioCarta,
or GO, a distribution of the pathway H scores was calculated for each random
permutation; the mean and standard deviation (shown as error bars) of the
thousand random distributions were then plotted and are shown in Fig. 1. These
three random distributions are almost entirely coincidental.
References
[1] L. Hood, J.R. Heath, M.E. Phelps, B. Lin, Systems biology and new
technologies enable predictive and preventative medicine, Science 306
(2004) 640–643.
[2] J. Ihmels, S. Bergmann, N. Barkai, Defining transcription modules
using large-scale gene expression data, Bioinformatics 20 (2004)
1993–2003.
[3] J. Ihmels, R. Levy, N. Barkai, Principles of transcriptional control in the
metabolic network of Saccharomyces cerevisiae, Nat. Biotechnol. 22
(2004) 86–92.
[4] Z. Li, C. Chan, Inferring pathways and networks with a Bayesian
framework, FASEB J. 18 (2004) 746–748.
[5] Z. Li, C. Chan, Integrating gene expression and metabolic profiles, J. Biol.
Chem. 279 (2004) 27124–27137.
[6] H.H. Yang, Y. Hu, K.H. Buetow, M.P. Lee, A computational approach to
measuring coherence of gene expression in pathways, Genomics 84
(2004) 211–217.
[7] E.J. Williams, D.J. Bowles, Coexpression of neighboring genes in
the genome of Arabidopsis thaliana, Genome Res. 14 (2004)
1060–1067.
[8] H. Caron, et al., Evidence for two tumor suppressor loci on chromosomal
bands 1p35–36 involved in neuroblastoma: one probably imprinted,
another associated with N-myc amplification, Hum. Mol. Genet. 4 (1995)
535–539.
[9] M.J. Lercher, A.O. Urrutia, L.D. Hurst, Clustering of housekeeping genes
provides a unified model of gene order in the human genome, Nat. Genet.
31 (2002) 180–183.
We chose the H score as an indicator of pathway gene expression
cohesiveness because a significance measure exists for the H score calculated
for a pathway and, by using rank scores in place of actual correlation values, the
method is less sensitive to pathway size and gene outliers. This method is also a
more reliable measure of pathway coherence because not only gene–gene
relationships within a pathway are considered, but also comparisons are made
to check whether the intrapathway gene–gene interactions are in fact stronger
than interactions between genes not connected by some known pathway.
[10] B.A. Cohen, R.D. Mitra, J.D. Hughes, G.M. Church, A computational
analysis of whole-genome expression data reveals chromosomal domains
of gene expression, Nat. Genet. 26 (2000) 183–186.
[11] A.M. Boutanaev, A.I. Kalmykova, Y.Y. Shevelyov, D.I. Nurminsky, Large
clusters of co-expressed genes in the Drosophila genome, Nature
(London) 420 (2002) 666–669.
[12] P.T. Spellman, G.M. Rubin, Evidence for large domains of similarly
expressed genes in the Drosophila genome, J. Biol. 1 (2002) 5.
[13] M.J. Lercher, T. Blumenthal, L.D. Hurst, Coexpression of neighboring
genes in Caenorhabditis elegans is mostly due to operons and duplicate
genes, Genome Res. 13 (2003) 238–243.
Hierarchical clustering of KEGG pathways
The genes were first grouped into 50 clades based on their expression
patterns across the NCI60 using the SOM clustering procedure. The fraction of
genes that fall into each SOM clade was then calculated for every KEGG
pathway, yielding 111 data vectors each containing 50 fractions. Wards-based
[14] H. Ge, Z. Liu, G.M. Church, M. Vidal, Correlation between transcriptome
and interactome mapping data from Saccharomyces cerevisiae, Nat.
Genet. 29 (2001) 482–486.