48
WEINSTEIN ET AL.
averaging, that is, by running replicate arrays in which the
colors are reversed (4). After normalization, a number of
choices must be made before higher level analysis. Should
the data be log transformed to obtain more nearly normal
distribution and heteroskedasticity? Must it then be
thresholded? Should the mean over samples for a given
gene be subtracted from the expression levels? The mean
over genes for a given sample? Should the levels be di-
vided by a measure of the dispersion such as the standard
deviation in one or (by an iterative process) both direc-
tions? Should the continuous values be binned, binarized,
or turned into ranks for analysis? The answers to these
and other such questions often depend on characteristics
of the data or the nature of the question being asked. For
example, if the important information resides in relative,
rather than absolute, gene expression values across a
database, one will likely subtract the mean or median
across samples.
(17), which is freely available (along with databases and
uses a combination of GeneCards from the Weizmann
Institute, PubMed from the National Library of Medicine,
semantic analysis, syntactic analysis, and keywords to find
and organize key sentences from abstracts on genes, gene-
gene relationships, and gene-drug relationships. Med-
Miner can speed up by 5- to 10-fold the rate at which the
voluminous literature on important genes is organized and
interpreted. A version called EDGAR (Extraction of Data
on Genes And Relations) based on deeper semantic anal-
ysis is under development (18).
Task #7: To integrate the expression data with other
types of information. Very often, the gene expression
data are most richly understood and most valuable when
related to other types of information at the protein, DNA,
functional, or pharmacologic level. Figure 1 provides an
example of the pharmacologic connection (10).
Task #5: To analyze and visualize high-dimensional
data. The simplest experimental design is binary—for
example, comparison of cancer with normal cells or ma-
lignant with non-malignant. More complicated is the time
course, for example before and during a treatment. More
demanding still is the large database of samples to be
analyzed for patterns. The latter two types of data are
often presented in the form of what we term clustered
image maps, and others have called heat maps. We intro-
duced clustered image maps (CIMs) for pharmacological,
genomic, and proteomic studies in the mid-1990s (5–7).
Our collaborators later developed a red-black-green color
scheme for CIMs (8, 9). Figure 1 shows a slightly more
complex CIM (10) that relates patterns of gene expression
to patterns of pharmacologic potency in the 60 human
cancer cell lines used in the National Cancer Institute’s
Drug Discovery Program (11, 12). A flexible program for
nci.nih.gov
Depending on the questions to be asked, high-dimen-
sional data sets may be analyzed by supervised or unsu-
pervised methods. The former include, for example, tech-
niques based on regression, discrimination, or prediction;
the latter on techniques such as clustering (5, 6, 9, 10, 13),
principal components analysis, or multidimensional scal-
ing. There is no right method of analysis. Demands of the
data and the scientific questions asked will condition the
choice.
Task #6: To search the biomedical literature and pub-
lic databases for information on genes or gene-gene
relationships. Most gene expression microarray experi-
ments produce long lists of genes with possible signifi-
cance, and the problem is to distinguish causally interest-
ing relationships from epiphenomenal ones and from
statistical coincidence. For that purpose, outside informa-
tion is generally necessary. Microarray studies are a form
of omic research (14–16), but interpretation of the data
from them generally requires synergy with classical hy-
pothesis-driven studies of one gene, one gene product, or
one process at a time. To facilitate searches of the litera-
ture in this context, we developed the program MedMiner
Task #8: To design the study carefully (in terms of
controls, replicates, internal standards, and design
points). This step should come first, of course. In microar-
ray studies, it is often not feasible to go back afterward and
fill in the gaps in an imperfectly designed or executed
experimental series. Because arrays are expensive, the
tendency is to skimp on replicates and controls, but that
is almost always a mistake. Some of the best and most
often-used databases placed in the public domain to date
suffer from these insufficiencies. Even if it is not practical
to use sufficient replicates for all samples, selected repli-
cates (and replicated genes on each array) pay major
dividends.
This whirlwind summary of the tasks involved in anal-
ysis of microarray gene expression data has by no means
touched on all of the important ingredients of the prob-
lem, let alone presented them in satisfactory detail. More
important than the details of method, however, are com-
mon sense and an appreciation of basic statistical princi-
ples. Artificial intelligence may one day produce software
that can substitute for the human judgment and expertise
currently required for gene expression analysis. But such
software would look nothing like what is now available.
LITERATURE CITED
1. Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring
of gene expression patterns with a complementary DNA microarray.
Science 1995;270:467–470.
2. Chen Y, Dougherty ER, Bittner ML. Ratio-based decisions and the
quantitative analysis of cDNA microarray images. J Biomed Optics
1997;2:364–374.
3. Ermolaeva O, Rastogi M, Pruitt KD, Schuler GD, Bittner ML, Chen Y,
Simon R, Meltzer P, Trent JM, Boguski MS. Data management and
analysis for gene expression arrays. Nat Genet 1998;20:19–23.
4. Zhou Y, Gwadry FG, Reinhold WC, Miller L, Smith LH, Scherf U, Liu
E, Kohn KW, Pommier Y, Weinstein. Transcriptional regulation of
mitotic genes by camptothecin-induced DNA damage: Microarray
analysis of dose- and time-dependent effects. Cancer Res Submitted.
5. Weinstein JN, Myers TG, Buolamwini J, Raghavan K, van Osdol W,
Licht J, Viswanadhan VN, Kohn KW, Rubinstein LV, Koutsoukos AD,
Zaharevitz D, Grever MR, Monks A, Scudiero DA Chabner BA Ander-
son NL, Paull KD. Predictive statistics and artificial intelligence in the
U.S. National Cancer Institute’s Drug Discovery Program for Cancer
and AIDS. Stem Cells 1994;12:13–22.
6. Weinstein JN, Myers TG, O’Connor PM, Friend SH, Fornace AJ, Kohn
KW, Fojo T, Bates SE, Rubinstein LV, Anderson NL, Buolamwini JK,