ANALYSIS OF GENE ARRAY EXPERIMENTS
B55
per 10,000 would by chance produce false-positive findings
even when using a fourfold change as the criterion for accep-
tance; however, this number falls to near zero when n ϭ 5
mice per group. Calculations similar to those shown in Ta-
bles 1, 2, and 3 can be used to estimate the number of false
positives expected for any given empirical distribution of
CVs. We recommend that those groups wishing to report
gene array results without formal statistical evaluation of
significance should accompany their reports of two- and
threefold changes with a comparison table showing the num-
bers of false-positive results to be expected from their exper-
imental design and observed distribution of CVs.
criterion a p value of .05/1000 ϭ .00005. Such a criterion is
very conservative in the sense that it tends to produce large
numbers of false negative conclusions; it tends, in other
words, to make it hard to accept as proven hypotheses that
are in fact true. If an experiment testing 1000 genes pro-
duces p values Ͻ.00005 for, say, 8 genes, one could confi-
dently conclude that all eight genes are likely to distinguish
old from young mice; there would be only 1 chance in 20
that any of the eight effects is due to chance alone. Produc-
ing such a high p value requires either very large numbers
of animals or very small interanimal SDs—much smaller
than are seen in practical cases. (Evidence that the experi-
mental system in question gives very reproducible values
for replicate aliquots of the same sample is not germane; the
variation in weight among a set of laboratory members, for
example, is not diminished by weighing them on a scale ac-
curate at the microgram level.) If a survey of 10,000 genes
shows that 20 of them reach p(t) ϭ .001, it is likely that
some of these 20 will prove reproducible in subsequent
tests, but it is not possible to know which ones without fur-
ther experimental data.
One way of dealing with this problem is to use a two-
stage experimental design. The first stage is used for hy-
pothesis generation: all genes are tested and ranked in order
of statistical probability. In a typical case, few if any of the
genes will show a sufficiently large age effect, with suffi-
ciently low interanimal variance, to meet the Bonferroni cri-
terion (p ϭ.000005 for a set of 10,000 genes), but some are
likely to provide suggestive evidence of a real effect, say
p Ͻ .001. The second stage, then, involves testing a separate
set of animals, using either the array method or some other
convenient test (RT-PCR or RNAse protection assays, for
example) for each of these genes that shows the most ex-
treme probabilities in the initial survey. If, for example, the
initial screen generates a list of 25 genes where p Ͻ .001,
the second, hypothesis-testing phase of the study can em-
ploy a value of p ϭ .05/25 ϭ .002 as its criterion for hy-
pothesis confirmation; any genes that reach this level in the
second stage can be accepted as age-sensitive, at least in this
organ, genotype, and age range.
Criteria Based Upon Formal Significance Testing
An alternate approach is to base conclusions on formal
significance testing using a conventional statistical criterion,
an idea that is common outside the realm of gene-expression
screening but has yet to make much headway among users
of this cutting-edge technology. One plausible starting point
would be to compute the Student’s t test statistic for each
gene in the set of interest as an index of how likely it would
be to obtain the observed distribution of gene expression
values by chance alone. Purists would object that it is not
possible, for n Ͻ 5 or so, to check the assumptions on which
the t test is based (normality and equality of variance), but
even they may admit that a statistical test that includes in-
formation about interanimal variation is an improvement on
ratio-based tests that ignore variance entirely. Genes with
low interanimal variation will yield high values (i.e., low
probabilities) of the t statistic given modest age or genotype
effects (two- to fourfold, for example) and deserve more
confidence than those in which large intersubject variation
produces a nonsignificant p(t). Some laboratories specializ-
ing in array-based screening are beginning (6) to restrict
their conclusions to genes where p(t) Ͻ .05, the conven-
tional criterion for rejection of the null hypothesis of no
effect.
A key problem with a t-test–based approach in the context
of gene expression screening is that it ignores multiple com-
parison artifacts. Consider a hypothetical situation in which
a postdoctoral scientist decides to measure expression levels
of 10,000 genes in each of 20 young and 20 old mice and to
make her biological interpretations on the basis of those
genes where the age effect is large and consistent enough to
reach p(t) Ͻ .05. Alas, unbeknownst to this researcher, a dis-
gruntled technician has switched the identification codes on
all the mice at random, so that the nominally “young” group
actually contains an equal number of young and old animals.
Among 10,000 genes, however, 1 in every 20 will, entirely
by chance, reach p(t) ϭ .05; the postdoc, not knowing of the
deception, is pleased to find 500 genes that show “signifi-
cant” age effects, and she makes her interpretation and con-
ducts years of follow-up analyses on the basis of these en-
tirely spurious and unreproducible findings. The problem,
well described in most elementary statistics texts, is that a
significance criteria of .05 does not protect against false-pos-
itive conclusions in a large series of tests.
This method—like any method using small number of
animals to examine traits with high variance—is likely to
suffer from a high false negative rate: those genes that show
above-average interanimal variance will not produce signif-
icant p values at any stage of the analysis in tests that use
only 5 to 10 mice per group. Investigators who have in-
vested considerable effort in large-scale gene scanning sur-
veys may therefore wish to make public—either in a formal
report or in an associated electronic archive—lists of genes
that show relatively large effects (say two- or threefold
changes) even if these do not approach statistical signifi-
cance; genes that show large effects, even with high interan-
imal variation, may still deserve further attention if the pat-
terns of expression suggest or refute specific biological
theories of interest.
If the cost of the animals (or human samples or cell lines)
is relatively small compared with the overall cost of the test-
ing program, it may be useful to carry out the initial first-
stage survey using pools instead of individuals. If, for ex-
ample, a group of 24 young mice can be tested as six pools
The Bonferroni procedure is the accepted way to adjust
significance criteria in such a situation. When testing 1000
hypotheses simultaneously, for example, one would use as