OJELUND, BROWN, MADSEN, AND THYREGOD
3
72
For 5-fold cross-validation, the observations are rst divided values for variables clustered around the variable 5 are given
into ve equal-sized groups. Denoting these groups by by
2
L 1: : : 1 L and using an obvious notation, de ne
‚5Cj D 4hƒ—j—5 1 —j— µ h1
1
5
4
v5
where h is a xed integer controlling the cluster width. The
cluster at variable 15 is generated in the same way. The in u-
L
D LƒL 1 v D 11 : : : 151
v
4
v5
where L is the entire dataset. Now use the data L to estimate ence of noise is studied by testing three levels of signal-to-
the parameters and L to validate. While repeating this for noise (SN) ratio, SN 2 8115199. To obtain the desired SN ratio,
v
0
0
v D 11: : : 1 5, the mean squared error of prediction (MSEP) the coef cients are scaled so that ‚ X X‚=n D SN. The vector
0
C …, where
becomes
of dependent variables y is calculated from X ‚
the number of observations is n D 40 and … is sampled from
5
1
X
X
0
i
O
v
2
MSEP D
4y ƒx ‚ 5 1
v
®
401 I5. In this study, the explanatory variables and response
40
i
n
vD1 4y 1x 52L
variable were centered before estimation.
The performance of the shrinkage methods is measured by
calculating the mean model error (ME), de ned as
i
i
O
v
4v5
where ‚ is the estimate found using the data L . The number
of variables q is estimated by minimizing the MSEP value. In
the following section, leave-one-out cross-validation is used
O
0
O
0
O
2
ME D 4‚ ƒ‚5 ì4‚ƒ‚5C4yN ƒ xN ‚5 1
(12)
to determine the shrinkage factor of ridge regression. Leave- where yN and xN are the sample mean in each simulated dataset
one-out cross-validation is the same as n-fold cross-validation, of the response variable and the explanatory variables.
where n is the number of observations.
MONTE CARLO SIMULATION STUDY
Figures 2–4, show the average MEs for the methods. In
the left-side graphs in each gure, the shrinkage factors are
estimated using cross-validation. The Monte Carlo simulation
has been repeated 2,000 times for each combination of h and
3
.
In this section prediction with the mean subset is compared
to the best subset, ridge regression, garrote, and lasso meth-
ods through Monte Carlo simulations. The simulation study is
constructed not to show the mean subset in a favorable light,
but rather to demonstrate when the different shrinkage meth-
ods may be suitable. The competing shrinkage methods are
, and the estimated standard errors of the points in the graphs
are less than .02. Fivefold cross-validation was used for the
mean subset, lasso, garrote, and best subset, and leave-one-out
cross-validation was used for ridge regression. This method
of estimating the shrinkage factors was suggested by Breiman
(
1996). Whereas the leave-one-out estimate has lower bias,
de ned as follows. The ridge regression estimate is obtained it is degraded by its higher variance. Hence, leave-one-out
from
cross-validation may be used for stable methods like ridge
regression, whereas 5-fold or 10-fold cross-validation with
higher bias is suggested for unstable methods like best subset
selection. In the right-side graphs in gures 2–4, the true data-
generating model is assumed known (referred to as the crystal
ball), and the value of the shrinkage factor is selected such
that ME in (12) is minimized.
The graphs show that mean subset and ridge regression
are complementary to one another. In cases with only a few
nonzero coef cients, mean subset and best subset give good
prediction, but in cases with many nonzero coef cients, ridge
regression works best. When the shrinkage factor is estimated
using cross-validation, mean subset consistently gives better
prediction than best subset. The difference increases with an
O
0
ƒ1
0 0
‚
D 4X X Ck I5 X y 1
(9)
R
R
where k is the shrinkage factor.
R
The garrote starts with the ordinary least squares (OLS)
estimates and shrinks them by nonnegative factors whose sum
is constrained. For a given shrinkage factor t ¶ 0, the garrote
minimizes
³
´
n
p
2
p
X
X
X
yi ƒ cj ‚ jO LSx
subject to
c µ t1 c > 00 (10)
ij
j
j
iD1
jD1
jD1
O
The lasso estimate, ‚ , is de ned by
L
n
p
X
X
O
0
2
‚
D argmin 4y ƒx ‚5 subject to
—‚ — µ t1 (11) increasing number of nonzero coef cients and decreasing SN
L
i
i
j
4
‚5
iD
jD
1
1
ratio. The graphs also reveal that the main reason for this
difference in prediction performance is the instability of the
where t ¶ 0 is the shrinkage factor. In all of these shrink-
age methods, the shrinkage factors are estimated by cross-
validation.
best subset method. This is demonstrated by the degradation
in prediction performance when q is estimated using cross-
validation instead of the crystal ball. The graphs also show
that the lasso is preferred over the garrote and that the dif-
ference between these two methods increases with increasing
collinearity between the explanatory variables. Furthermore,
the lasso is better or as good as mean subset, except when the
underlying model is small and the SN ratio high.
The setup of the example is largely adopted from Breiman
(
1996), and the design makes it possible to investigate the
in uence of cross-validation to the prediction performance.
The explanatory variables are sampled from a 0-mean,
2
0-variable multivariate normal with covariance matrix
—
iƒj—
ì D . Three different values of are tested—0, .45,
ij
and .9—which spans uncorrelated to highly correlated vari-
ables. For each correlation structure, ve different coef cient
vectors are used to generate the dependent data. The nonzero
coef cients are in two clusters of adjacent variables with
4
.
PREDICTION OF MOISTURE AND PROTEIN
CONTENT IN WHEAT
In this real data example, the objective is to predict the
clusters centered at variables 5 and 15. The initial coef cient amount of moisture and protein in wheat using near-infrared
TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4