Scholarly article on 5ALPHA(H),17ALPHA(H),(20R)-BETA-ACETOXYERGOST-8(14)-ENE 4042-95-9 from Biochemische Zeitschrift p. 397,401

DOI: 10.1198/004017002188618563

Source and publish data:

Biochemische Zeitschrift p. 397,401 (1929)

Update date:2022-08-16

Topics:: Authors:

Sumi

Read Full Text PDF DownLoad Join now for total 90,000,000 free articles

Article abstract of DOI:10.1198/004017002188618563

Full text of DOI:10.1198/004017002188618563

This article was downloaded by: [University of Connecticut]

On: 08 October 2014, At: 12:16

Publisher: Taylor & Francis

Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office:

Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Technometrics

Publication details, including instructions for authors and subscription

information:

http://www.tandfonline.com/loi/utch20

Prediction Based on Mean Subset

H Ojelund , P J Brown , H Madsen & P Thyregod

Informatics and Mathematical Modelling, The Technical University of Denmark,

DK-2800 Lyngby, Denmark

Institute of Mathematics and Statistics, University of Kent at Canterbury,

Canterbury, Kent, CT2 7NF, UK

Informatics and Mathematical Modelling, The Technical University of Denmark,

DK-2800 Lyngby, Denmark

Informatics and Mathematical Modelling, The Technical University of Denmark,

DK-2800 Lyngby, Denmark

Published online: 01 Jan 2012.

To cite this article: H Ojelund, P J Brown, H Madsen & P Thyregod (2002) Prediction Based on Mean Subset,

Technometrics, 44:4, 369-378, DOI: 10.1198/004017002188618563

To link to this article: http://dx.doi.org/10.1198/004017002188618563

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”)

contained in the publications on our platform. However, Taylor & Francis, our agents, and our

licensors make no representations or warranties whatsoever as to the accuracy, completeness, or

suitability for any purpose of the Content. Any opinions and views expressed in this publication are

the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis.

The accuracy of the Content should not be relied upon and should be independently verified with

primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims,

proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever

caused arising directly or indirectly in connection with, in relation to or arising out of the use of the

Content.

This article may be used for research, teaching, and private study purposes. Any substantial

or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or

distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can

be found at http://www.tandfonline.com/page/terms-and-conditions

Prediction Based on Mean Subset

H. Ojelund

P. J. Brown

Informatics and Mathematical Modelling

The Technical University of Denmark

DK-2800 Lyngby, Denmark

Institute of Mathematics and Statistics

University of Kent at Canterbury

Canterbury, Kent, CT2 7NF, U.K.

(philip.J.Brown@ukc.ac.uk)

(

hoe@imm.dtu.dk)

H. Madsen

P. Thyregod

Informatics and Mathematical Modelling

The Technical University of Denmark

DK-2800 Lyngby, Denmark

Informatics and Mathematical Modelling

The Technical University of Denmark

DK-2800 Lyngby, Denmark

(

hm@imm.dtu.dk)

(pt@imm.dtu.dk)

Shrinkage methods have traditionally been applied in prediction problems. In this article we develop

a shrinkage method (mean subset) that forms an average of regression coef cients from individual

subsets of the explanatory variables. A Bayesian approach is taken to derive an expression of how the

coef cient vectors from each subset should be weighted. It is not computationally feasible to calculate

the mean subset coef cient vector for larger problems, and thus we suggest an algorithm to nd an

approximation to the mean subset coef cient vector. In a comprehensive Monte Carlo simulation study,

it is found that the proposed mean subset method has superior prediction performance than prediction

based on the best subset method, and in some settings also better than the ridge regression and lasso

methods. The conclusions drawn from the Monte Carlo study is corroborated in an example in which

prediction is made using spectroscopic data.

KEY WORDS: Bayesian variable selection; Best subset; Calibration; Garrote; Lasso; Model averag-

ing; Shrinkage.

INTRODUCTION

take averages over different candidate subsets. This has the

bene cial effect of both reducing over tting (Copas 1983)

and instability (Breiman 1996). A heuristic de nition of an

unstable shrinkage method is a method in which a small

In this article we focus on the problem of prediction, that

is, the problem of nding a function of the predictor vari-

ables x that is in some sense a good predictor of the response

change in the data can lead to large changes in the sequence

variable y. Given a regression model, there is of course a

super cial similarity between this problem of nding a pre-

dictor and the familiar problem of estimation, in the sense

that for both problems a vector of regression coef cients is

estimated. However, as Copas (1983) noted, the loss functions

for the two problems are different.

Traditionally, prediction problems have been dealt with

using shrinkage methods. In the Bayesian framework, shrink-

age is an inherent property resulting from the choice of prior.

The Stein estimator (James and Stein 1961) was the shrinkage

method for which it was rst proven that the mean squared

prediction errors will decrease when a bias is introduced.

Later, several other shrinkage schemes were proposed, the

most well known being ridge regression (Hoerl and Kennard

‚ 9, where ‹ is a real parameter that indexes the amount of

‹

shrinkage Breiman (1996). Hence estimation of the shrinkage

factor ‹ using cross-validation will be dif cult when the

method is unstable.

Two recently proposed methods, the garrote (Breiman 1995)

and the lasso (Tibshirani 1996), try to combine variable selec-

tion and shrinkage. The motivation for the development of the

garrote was the instability observed when a variable selection

method is combined with cross-validation. It has been found

(

Vach, Sauerbrei, and Schumacher 2001) that the combination

of selection and shrinkage makes the garrote and lasso meth-

ods more stable than regression with a subset of the variables,

and better prediction models are obtained in general.

In fact, ridge regression itself offers a stable form of shrink-

age and can be viewed as a weighted average of all least

squares tted subsets (see Leamer and Chamberlain 1976).

Bayesian model averaging in regression has become widely

advocated (see, e.g., Raftery, Madigan, and Hoeting 1997).

In the context of the multivariate general linear model, it has

been used for Bayesian variable selection and shown to be

970).

The common procedure of selecting a subset of the

available predictor variables and to estimate the regression

coef cients on the subset by least squares can also be viewed

as a shrinkage method. This somewhat extreme form of

shrinkage involves the complete pull-back to 0 of a subset

of coef cients. Variable selection and other more continuous

shrinkage forms were derived and compared by Dempster

(

1973). That article’s motivation was to compare more

2002 American Statistical Association and

the American Society for Quality

Bayesian versions of variable selection with continuous-

shrinkage forms like ridge regression. In so doing, Dempster

recognized that Bayesian versions of selection will typically

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

DOI 10.1198/004017002188618563

OJELUND, BROWN, MADSEN, AND THYREGOD

effective with spectroscopic data involving a large number of

predictors (see Brown, Vannucci, and Fearn 1998).

In this article we develop a partially Bayes estimator that

serves to form a weighted average of all subsets of a particular

size. This has been developed independently but is similar in

The mean subset coef cient vector ‚ is calculated as

4ƒ5

w ‚

‚

(3)

4ƒ5

derivation to the REGF method of Dempster (1973). We apply where the q nonzero coef cient values of ‚

are given

ƒ1

cross-validation for choice of subset and develop fast search by 4X

5 X

y. The elements in the normalized weight

algorithms. Using Monte Carlo simulation, we compare the sequence 8w

proposed mean subset method to the ridge regression, lasso,

garrote, and best subset selection methods. We also apply it

to a challenging spectroscopic example where the number of

explanatory variables is much larger than the number of obser-

vations. In this application, the explanatory variables are dis-

crete points recorded from a continuous curve, which makes

them almost collinear. To limit the variance of the parameter

estimates in this situation, it is very important to use some

sort of shrinkage estimator.

—

ƒ 2 ³ 9 are

ƒn=2

^w^/^S^Sƒ

(4)

This weighting is motivated in the subsequent section. In cal-

culating the mean subset coef cient vector no selection is

involved, because the estimated coef cient vector ‚ is a

weighted mean of all subsets. Furthermore, ‚ is an analyti-

cal function in X and y, in contrast to ‚ , and thus we expect

the mean subset to be less sensitive to small variations in data.

The mean subset may be seen as a shrinkage method in which

q controls the amount of shrinkage.

.1 Introducing Mean Subset as a Shrinkage Method

Let y D 8y 1 : : : 1y 9 be the 4n 15 vector of observed val-

1.2 Bayesian Motivation

ues of the response variable and let X D 8x 1: : : 1 x 9 be the

n p5 matrix including all potential explanatory variables

It is possible to motivate (3) and (4) using Bayesian argu-

ments. The following model largely follows the conjugate

hierarchical mixture model suggested by George and McCul-

loch (1997). We start by assuming the standard linear model

where x D 8x 1 : : : 1x 9 . The linear model has the form

y D X‚C ˜1

(1)

where ˜ is a noise 4n 15 vector and ‚ is a 4p 15 vector

of coef cients. A subset may be characterized by the p-

f4y—‚1‘ 5 D N 4X‚1‘ I51

(5)

dimensional selection vector, ƒ D 4i 1i 1 : : : 1i 5 , where

where ‘ is a positive scalar with the noninformative prior

density proportional to 1=‘ and the other variables satisfy the

previously stated assumptions. The subsets of variables to be

included are speci ed through the prior 4ƒ5. The prior is

selected such that all index vectors with q 1s are assigned the

same prior probability and all other index vectors are given

zero prior probability. In practice, a priori q is unknown, and

a cross-validation procedure is used to estimate q.

This particular prior distribution corresponds to the problem

of nding a subset of q variables, as in best subset regression.

Other prior probability assignments have been suggested by

George and McCulloch (1997) and others since them. How-

ever, the prior probability assigned to models of different size

has been rather arbitrary. The prior distribution suggested here

allows one to look directly at pieces of the posterior that are

individually quite sensible. Moreover, in the situation when

there is an excess of explanatory variables in comparison to

observations, it is not possible to work with the default or

noninformative prior distribution, because the posterior will be

improper when q is large.

i 2 80119 and 1 µ j µ p. De ne the cardinal function,

jD1

q D

i . When a selection vector is used as a subscript to

a matrix, this shall be understood to mean the matrix with q

columns corresponding to the nonzero elements of the selec-

tion vector and in that order. When a selection vector is used

as a bracketed superscript for a vector, this shall be understood

to mean the p-dimensional vector with 0s in all positions

corresponding to the zero elements of the selection vector. The

elements of the vector corresponding to the nonzero elements

of the selection vector are equal to the elements of the vector

with the same selection vector used as a subscript, and in

that order. For example, if p D 51ƒ D 41101 111105 , and the

4ƒ5

coef cient vector ‚ D 451 61 75 , then ‚ D 45101 61 7105 .

One commonly used approach for parameter shrinkage is

best subset selection, in which the number of nonzero coef-

cients is controlled. The best subset method is described as

follows. Let ³ be a set of 4p 15 selection vectors de ned as

³_qD 8ƒ—ƒ D 4i 1i 1: : : 1 i 51i 2 8011911 µ j µ p1q D q91

where q is the number of selected variables. The best subset

Assume the following conditional prior for ‚:

coef cient vector ‚ of size q D 11 21: : : 1 p, is de ned as the

4ƒ5

estimate ‚ for which the selection vector ƒ 2 ³ minimizes

4‚ 1‚ D 0—‘ 1 ƒ5 D N 401‘ v C 5

(6)

ƒN

the residual sum of squares,

and

^ƒ1

SS D y y ƒy X X X

X y1

(2)

4‚ 1‚

6D 0—‘ 1 ƒ5 D 00

(7)

ƒN

ƒ1

and where ‚ D 4X X 5 X y. It is assumed that all matrices

X X for which ƒ 2 ³ have full rank, so that necessarily Here v₁denotes an a prior-de ned positive scalar and C_ƒ

q µ n.

denotes a suitably chosen positive de nite 4q q5 matrix.

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

PREDICTION BASED ON MEAN SUBSET

371

An obvious prior setting for the correlation matrix C_ƒis However, often it is interesting to nd models with only a few

to replicate the correlation structure of the least squares esti- variables, which makes an exhaustive search feasible for much

mates, that is,

larger problems. For example, it is possible to perform an

exhaustive search for all models with three or fewer variables

when the number of explanatory variables is 1,000.

^ƒ1

C D X X

This particular prior was used by Dempster (1973) and is

An approximative exhaustive search is described as follows.

Assume that we have obtained the k best subsets of size r

with an exhaustive search. Each of these subsets is used as

a starting point for forward selection, and one more variable

is included in each subset. Next, these new subsets of size

r C1 are used as initial subsets for variable exchange (Miller

sometimes called a g prior (see Zellner 1980). Integrating out

both ‚ and ‘ and letting v ! ˆ yields

ƒn=2

4ƒ—y5 / SS_ƒ 4ƒ51

(8)

where SS follows from (2). Hence 4ƒ—y5 D w for ƒ 2 ³

and motivates using (4) to calculate the weights.

990). Variable exchange works by sequentially exchanging

COMPUTATIONAL ASPECTS

the variables in the subset. For example, assume a problem

with 26 variables and denote the variables by the letters of the

alphabet. If the initial subset is ABCD, then the following sub-

AND FEASIBLE APPROXIMATIONS

We propose two methods based on a standard least squares

exhaustive search to estimate the mean subset. The rst sets are tested with variable exchange: ABCD, EBCD, FBCD,

method calculates the exact mean subset coef cient vector, GBCD, …, ZBCD, AECD, AFCD, AGCD, …, AZCD, …,

and the second method approximates the mean subset coef - ABCZ. If the best of all the evaluated subsets is the initial sub-

cient vector using a weighted average of a subcollection of all set, then the algorithm is stopped; otherwise, the procedure is

possible models of size q. An ef cient algorithm for nding

the best subsets of all model sizes is called “regression by

leaps and bounds” (Furnival and Wilson 1974). This algorithm

avoids testing all variable combinations; nevertheless, it is

able to guarantee that the k best subsets are found, where

k is a prespeci ed positive integer. The algorithm is very

ef cient when k is small, but the performance degenerates

quickly when k is increased. Today, it is possible to calculate

the exact mean subset for all model sizes when the number

of variables is less than, say, 30. If Moore’s law, which states

repeated, with the best of the evaluated subsets as the new ini-

tial subset. The mean subset is approximated using the subsets

evaluated during the forward selection and variable exchange

procedures. Obviously, when k is increased, the approximation

becomes better. Figure 1 describes the algorithm visually. The

procedure is repeated when larger subsets are required.

The number of variables q is estimated using cross-

validation. Breiman and Spector (1992) recommended using

-fold or 10-fold cross-validation. An advantage of 5- or

how the computer speed evolves over time, continues to hold, 10-fold cross-validation over leave-one-out cross-validation is

then this upper limit will be increased by one every 1 years. that the former methods are less computationally demanding.

Sorted list of k unique subsets.

SS The Subset

variables

New sorted list of k

unique subsets.

Each evaluated subset

For each subset

of size r+1 in forward

SS The Subset

selection and in the

exchange algorithm

Add one variable by Forward selection.

is considered to be

included in the new list.

Improve by the Variable exchange algorithm.

r+1

variables

Figure 1. Algorithm for Finding the k Best Subsets of Size r C 1.

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

OJELUND, BROWN, MADSEN, AND THYREGOD

For 5-fold cross-validation, the observations are rst divided values for variables clustered around the variable 5 are given

into ve equal-sized groups. Denoting these groups by by

L 1: : : 1 L and using an obvious notation, de ne

‚₅C_jD 4hƒ—j—5 1 —j— µ h1

where h is a xed integer controlling the cluster width. The

cluster at variable 15 is generated in the same way. The in u-

D LƒL 1 v D 11 : : : 151

where L is the entire dataset. Now use the data L to estimate ence of noise is studied by testing three levels of signal-to-

the parameters and L to validate. While repeating this for noise (SN) ratio, SN 2 8115199. To obtain the desired SN ratio,

v D 11: : : 1 5, the mean squared error of prediction (MSEP) the coef cients are scaled so that ‚ X X‚=n D SN. The vector

C …, where

becomes

of dependent variables y is calculated from X ‚

the number of observations is n D 40 and … is sampled from

MSEP D

4y ƒx ‚ 5 1

401 I5. In this study, the explanatory variables and response

vD1 4y 1x 52L

variable were centered before estimation.

The performance of the shrinkage methods is measured by

calculating the mean model error (ME), de ned as

4v5

where ‚ is the estimate found using the data L . The number

of variables q is estimated by minimizing the MSEP value. In

the following section, leave-one-out cross-validation is used

ME D 4‚ ƒ‚5 ì4‚ƒ‚5C4yN ƒ xN ‚5 1

(12)

to determine the shrinkage factor of ridge regression. Leave- where yN and xN are the sample mean in each simulated dataset

one-out cross-validation is the same as n-fold cross-validation, of the response variable and the explanatory variables.

where n is the number of observations.

MONTE CARLO SIMULATION STUDY

Figures 2–4, show the average MEs for the methods. In

the left-side graphs in each gure, the shrinkage factors are

estimated using cross-validation. The Monte Carlo simulation

has been repeated 2,000 times for each combination of h and

In this section prediction with the mean subset is compared

to the best subset, ridge regression, garrote, and lasso meth-

ods through Monte Carlo simulations. The simulation study is

constructed not to show the mean subset in a favorable light,

but rather to demonstrate when the different shrinkage meth-

ods may be suitable. The competing shrinkage methods are

, and the estimated standard errors of the points in the graphs

are less than .02. Fivefold cross-validation was used for the

mean subset, lasso, garrote, and best subset, and leave-one-out

cross-validation was used for ridge regression. This method

of estimating the shrinkage factors was suggested by Breiman

(

1996). Whereas the leave-one-out estimate has lower bias,

de ned as follows. The ridge regression estimate is obtained it is degraded by its higher variance. Hence, leave-one-out

from

cross-validation may be used for stable methods like ridge

regression, whereas 5-fold or 10-fold cross-validation with

higher bias is suggested for unstable methods like best subset

selection. In the right-side graphs in gures 2–4, the true data-

generating model is assumed known (referred to as the crystal

ball), and the value of the shrinkage factor is selected such

that ME in (12) is minimized.

The graphs show that mean subset and ridge regression

are complementary to one another. In cases with only a few

nonzero coef cients, mean subset and best subset give good

prediction, but in cases with many nonzero coef cients, ridge

regression works best. When the shrinkage factor is estimated

using cross-validation, mean subset consistently gives better

prediction than best subset. The difference increases with an

ƒ1

0 0

‚

D 4X X Ck I5 X y 1

(9)

where k is the shrinkage factor.

The garrote starts with the ordinary least squares (OLS)

estimates and shrinks them by nonnegative factors whose sum

is constrained. For a given shrinkage factor t ¶ 0, the garrote

minimizes

y_iƒ c_j‚ _jO LS_x

subject to

c µ t1 c > 00 (10)

iD1

jD1

The lasso estimate, ‚ , is de ned by

‚

D argmin 4y ƒx ‚5 subject to

—‚ — µ t1 (11) increasing number of nonzero coef cients and decreasing SN

‚5

ratio. The graphs also reveal that the main reason for this

difference in prediction performance is the instability of the

where t ¶ 0 is the shrinkage factor. In all of these shrink-

age methods, the shrinkage factors are estimated by cross-

validation.

best subset method. This is demonstrated by the degradation

in prediction performance when q is estimated using cross-

validation instead of the crystal ball. The graphs also show

that the lasso is preferred over the garrote and that the dif-

ference between these two methods increases with increasing

collinearity between the explanatory variables. Furthermore,

the lasso is better or as good as mean subset, except when the

underlying model is small and the SN ratio high.

The setup of the example is largely adopted from Breiman

(

1996), and the design makes it possible to investigate the

in uence of cross-validation to the prediction performance.

The explanatory variables are sampled from a 0-mean,

0-variable multivariate normal with covariance matrix

—

iƒj—

ì D . Three different values of are tested—0, .45,

and .9—which spans uncorrelated to highly correlated vari-

ables. For each correlation structure, ve different coef cient

vectors are used to generate the dependent data. The nonzero

coef cients are in two clusters of adjacent variables with

PREDICTION OF MOISTURE AND PROTEIN

CONTENT IN WHEAT

In this real data example, the objective is to predict the

clusters centered at variables 5 and 15. The initial coef cient amount of moisture and protein in wheat using near-infrared

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

PREDICTION BASED ON MEAN SUBSET

rho=0, Cross validation

373

rho=0, Crystal ball

1.5

0.5

rho=0.45, Cross validation

rho=0.45, Crystal ball

0.5

rho=0.90, Cross validation

rho=0.90, Crystal ball

0.4

0.3

0.2

0.1

Signal to noise = 1

Figure 2. ME as a Function of Cluster Size h for SN Ratio SN D 1. Mean subset (full line); best subset (stars); garrote (circles); lasso (crosses);

ridge regression (dashed and dotted line).

rho=0, Cross validation

rho=0, Crystal ball

2.5

1.5

0.5

rho=0.45, Cross validation

rho=0.45, Crystal ball

1.5

0.5

rho=0.90, Cross validation

rho=0.90, Crystal ball

0.5

Signal to noise = 5

Figure 3. ME as a Function of Cluster Size h for SN Ratio SN D 5. Mean subset (full line); best subset (stars); garrote (circles); lasso (crosses);

ridge regression (dashed and dotted line).

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

OJELUND, BROWN, MADSEN, AND THYREGOD

rho=0, Cross validation

rho=0, Crystal ball

2.5

1.5

0.5

rho=0.45, Cross validation

rho=0.45, Crystal ball

1.5

0.5

rho=0.90, Cross validation

rho=0.90, Crystal ball

0.5

Signal to noise = 9

Figure 4. ME as a Function of Cluster Size h for SN Ratio SN D9. Mean subset (full line); best subset (stars); garrote (circles); lasso (crosses);

ridge regression (dashed and dotted line).

(

NIR) spectra (Kalivas 1997). Determining the amount of of wheat. As in the previous example, the data are centered

moisture and protein in wheat normally involves costly before calibration.

and time-consuming laboratory experiments. Hence it is

To allow comparison of the methods, the data are split

interesting to investigate the predictability of these factors into a validation set (34 observations) and a calibration set

using cheap and quickly obtainable NIR spectra. This example (66 observations). The calibration set is used to estimate the

is also interesting because of the high number of regressors

shrinkage factors by cross-validation. To make the calcula-

tion computationally feasible, the best subset and mean subset

for q > 3 are based on the approximation described in Sec-

tion 2. Furthermore, the number of subsets used for approxi-

mating the mean subset is k D 11000 for all values of q > 3.

p D 700 (recorded from 1,100–2,500 nm in 2-nm intervals),

compared to the relatively low number of observations,

n D 100. Consequently, the problem is severely indeterminate,

and it is necessary to use a shrinkage method to obtain a

unique solution to the least squares problem. Hence shrinkage

methods that depend on a unique least squares estimate under

the full model, such as the garrote, will not work.

The spectra have baseline variation caused by the light-

scattering effects of particles of different sizes and shapes. To

depress this noninformative variation, the difference spectra

are calculated, that is,

Tables 1 and 2 give the R values for prediction of protein and

moisture using the validation data.

The tables show that the mean subset method is best for

predicting protein but that the results are more alike for mois-

ture, with the ridge regression method slightly better. In an

attempt to explain why ridge regression is better in predicting

the amount of moisture, the difference spectra were smoothed

before variable selection. It was then found that when the spec-

tra were smoothed using a local polynomial of order two and a

^s1₁₂

¢¢ ¢

^s11701

^s111

¢ ¢¢

^s11700

6 0

X D 4 0

5ƒ4 0

bandwidth of about 70 nm, a prediction performance of R D

¢¢ ¢ s₁₀₀₁₇₀₁

¢ ¢¢ s₁₀₀₁₇₀₀

0972 was obtainable with a single-variable best subset model.

001 2

10011

This indicates that the information for moisture is spread over

where s_i₁_jis the measured re ectance at wavelength number j several highly correlated variables and explains why the ridge

of spectrum i. Figure 5 shows three typical difference spectra regression method performs better than the subset selection

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

PREDICTION BASED ON MEAN SUBSET

375

Difference spectra

x 10

–

0.5

–1

1.5

–2

1200

1400

1600

1800

2000

2200

2400

Wavelength (nm)

Figure 5. Three of the Difference Spectra. The number of regressors is 700.

methods. Figure 6 shows the estimated coef cient vectors for

the four methods.

Figure 8 plots the R value of the validation data as a func-

tion of the number of variables. The gure clearly shows that

Comparing the amount of shrinkage in Table 1 and 2 reveals the mean subset is much more stable than the best subset.

that all methods shrink the coef cient vectors less for the Notice that the best subset model of size three has been found

protein data than for the moisture data. Furthermore, for the by exhaustive search and not by the approximative exchange

algorithm. This corroborates the result of the Monte Carlo

simulation study and explains why the difference in prediction

performance between the best subset and mean subset methods

is greater when cross-validation is used instead of the crystal

ball.

protein data, the mean subset method has better prediction

performance than the other methods. Figure 7 shows the coef-

cient vectors for predicting protein. An important difference

between the ridge regression and variable selection methods,

is that the later give a more clear-cut interpretation of the

data. For instance, studying the coef cient for mean subset

shows that spectral data above 1,800 nm shows no useful rela-

tion to protein. By removing this noninformative data, a more

robust calibration model can be obtained. In general, chemical

SUMMARY

In this article we have addressed the problem of using sub-

set selection as a shrinkage method. It is known that using

substances absorb radiation only in limited spectral regions. best subset selection as a shrinkage method is an unstable

Hence, mean subset may be used to identify these important approach, because a small change in the data may lead to large

spectral regions.

changes in the selected explanatory variables. To avoid this

Table 1. Prediction of Protein Content

Table 2. Prediction of Moisture Content

Method

Shrinkage factor

Method

Shrinkage factor

Mean subset

Best subset

Ridge regression

Lasso

0834

0777

0793

0790

q D 5

Mean subset

Best subset

Ridge regression

Lasso

0960

0955

0976

0960

q D 4

q D 2

D 1080e ƒ 7

D 2097e ƒ 6

t D 161926

t D 51967

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

OJELUND, BROWN, MADSEN, AND THYREGOD

(a)

(b)

1000

–50

–1000

–

100

150

200

–

2000

–3000

1500

2000

2500

1500

2000

2500

(

(d)

000

500

–

500

–1000

–2000

–3000

–

1000

1500

–2000

1500

2000

2500

1500

2000

2500

Figure 6. Coef’cient Vector for Prediction of Moisture. (a) Ridge; (b) mean subset; (c) best subset; (d) lasso.

(

(b)

6000

000

–

2000

4000

–

200

400

–

–6000

1500

2000

2500

1500

2000

2500

(c)

(d)

000

–1000

–

2000

4000

6000

–

2000

3000

4000

1500

2000

2500

1500

2000

2500

Figure 7. Coef’cient Vector for Prediction of Protein. (a) Ridge; (b) mean subset; (c) best subset; (d) lasso.

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

PREDICTION BASED ON MEAN SUBSET

377

Prediction of Protein

Model size

Figure 8. The Amount of Explained Variation of the Validation Data as a Function of the Number of Selected Variables. The results plotted with

stars and diamonds are the mean subset and best subset.

problem, we suggest instead using the mean subset for pre- thereby calculate the expectation. This is a subject for future

diction. This method is motivated using Bayesian arguments. work.

We described a numerical method for nding the mean subset

based on an approximating exhaustive search.

ACKNOWLEDGMENTS

The results from the Monte Carlo simulation study, sum-

marized in Table 3, show that mean subset works well when

the underlying model is small and the SN ratio is high. The

study also showed that the garrote is dominated by the lasso

and best subset is dominated by mean subset.

Finally, in an example that used NIR spectra of wheat to

predict the amount of moisture and protein, the mean subset

method was found to produce both simple and competitive

prediction models. The promising result of using mean subset

instead of best subset (i.e., integration instead of maximiza-

tion) suggests that the lasso method also could be enhanced by

interpreting the penalty function as an a priori distribution and

The authors would particularly like to thank the associate

editor and the referees, whose comments led to signi cant

improvements in this article. The rst author wishes to

acknowledge that this work was partially supported by the

Danish Academy of Technical Sciences.

[Received February 2001. Revised November 2001.]

REFERENCES

Breiman, L. (1995), “Better Subset Regression Using the Nonnegative Gar-

rote,” Technometrics, 37, 373–384.

(1996), “Heuristics of Instability and Stabilization in Model Selec-

tion,” The Annals of Statistics, 24, 2350–2383.

Breiman, L., and Spector, P. (1992), “Submodel Selection and Evaluation

in Regression—The x-Random Case,” International Statistical Review, 60,

291–319.

Brown, P. J., Vannucci, M., and Fearn, T. (1998), “Multivariate Bayesian

Variable Selection and Prediction,” Journal of the Royal Statistical Society,

Ser. B, 60, 627–641.

Copas, J. B. (1983), “Regression, Prediction and Shrinkage,” Journal of Royal

Statistical Society, 45, 311–354.

Dempster, A. P. (1973), “Alternatives to Least Squares in Multiple Regres-

sion,” in Multivariate Statistical Analysis, eds. D. Kabe and R. P. Gupta,

Amsterdam: North-Holland, 25–40.

Table 3. Classi’cation of When the Different Shrinkage Methods

Work Well

Model

sizenSN

Low

Medium

Lasso

Ridge regression

High

Small

Medium

Large

Lasso

Mean subset

Lasso

Ridge regression

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

OJELUND, BROWN, MADSEN, AND THYREGOD

Furnival, G. M., and Wilson, R. W. (1974), “Regression by Leaps and Miller, A. J. (1990), Subset Selection in Regression, Monographs on Statistics

Bounds,” Technometrics, 16, 499–511.

George, E. I., and McCulloch, R. E. (1997), “Approaches for Bayesian Vari-

able Selection,” Statistical Sinica, 7, 339–373.

Hoerl, A. E., and Kennard, R. W. (1970), “Ridge Regression: Biased Estima-

tion for Nonorthogonal Problems,” Technometrics, 12, 55–67.

James, W., and Stein, C. (1961), “Estimation With Quadratic Loss,” in Pro-

ceedings of the 4th Berkeley Symposium, Vol. 1, pp. 361–379.

Kalivas, J. H. (1997), “Two Data Sets of Near Infrared Spectra,” Chemomet-

rics and Intelligent Laboratory Systems, 37, 255–259.

and Applied Probability, Vol. 40, London: Chapman and Hall.

Raftery, A. E., Madigan, D., and Hoeting, J. A. (1997), “Bayesian Model

Averaging for Linear Regression Models,” Journal of the American Statis-

tical Association, 92, 179–191.

Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,”

Journal of the Royal Statistical Society, Ser. B, 58, 267–288.

Vach, K., Sauerbrei, W., and Schumacher, M. (2001), “Variable Selection and

Shrinkage: Comparison of Some Approaches,” Statistica Neerlandica, 55,

53–75.

Leamer, E. E., and Chamberlain, G. (1976), A Bayesian Interpretation of

Pretesting. Journal of the Royal Statistical Society, Ser. B, 38, 85–94.

Zellner, A. (ed.) (1980), Bayesian Analysis in Econometrics and Statistics:

Essays in Honor of Harold Jeffreys, Amsterdam: North-Holland.

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

Products guided by the article

Product name:5ALPHA(H),17ALPHA(H),(20R)-BETA-ACETOXYERGOST-8(14)-ENE

Cas No:4042-95-9

R&D Labs maybe for 4042-95-9

KindChem Co.,Ltd,

Contact:+86-25-85281586

Address:13F,Bld2#,South of Longpan Road
Nanjing Chemzam Pharmtech Co., Ltd.

Contact:+86-25-86462165,+86-13915979898

Address:C5-1，6 Maiyue Road，Maigaoqiao，Nanjing，Jiangsu，China
Shandong united-rising pharmaceutical cooperation.,ltd.

Contact:008653187965009

Address:171No., Jing5 Road, Shizhong District, Jinan, China
Shanghai Topchem Co., LTD

website:http://www.shtopchem.com/

Contact:0086-0576-87776998

Address:room no 1608,xuhui business building yude road,xujiahui street, xuhui district
Hunan Dinuo Pharmaceutical Co.,Ltd.

Contact:86-731-88280100*8561

Address:Bio-pharmaceutical industrial park, Liuyang, Hunan, China

Relevant to this article

Superior Fluorogen-Activating Protein Probes Based on 3-Indole-Malachite Green

Doi:10.1021/acs.orglett.7b02055
(2017)
Effect of a procaspase-activating compound on the catalytic activity of mature caspase-3

Doi:10.1246/bcsj.20150139
(2015)
New rhodacarborane-phosphoramidite catalyst system for enantioselective hydrogenation of functionalized olefins and molecular structure of the chiral catalyst precursor [3,3-{(S)-PipPhos}₂-3-H-1,2-(o -xylylene)- closo -3,1,2-RhC₂B

Doi:10.1021/om101201e
(2011)

Design, synthesis, and biological evaluation of some novel indolizine derivatives as dual cyclooxygenase and lipoxygenase inhibitor for anti-inflammatory activity

Doi:10.1016/j.bmc.2017.06.027
(2017)
Doi:10.1021/ja01650a077
(1954)
Heterostructured composites consisting of In₂O₃ nanorods and reduced graphene oxide with enhanced interfacial electron transfer and photocatalytic performance

Doi:10.1039/c4ta04106b
(2014)

Article Doi

DOI: 10.1198/004017002188618563

Source and publish data:

Authors:

Article abstract of DOI:10.1198/004017002188618563

Full text of DOI:10.1198/004017002188618563

Products guided by the article

R&D Labs maybe for 4042-95-9

Relevant to this article

Superior Fluorogen-Activating Protein Probes Based on 3-Indole-Malachite Green

Effect of a procaspase-activating compound on the catalytic activity of mature caspase-3

New rhodacarborane-phosphoramidite catalyst system for enantioselective hydrogenation of functionalized olefins and molecular structure of the chiral catalyst precursor [3,3-{(S)-PipPhos}2-3-H-1,2-(o -xylylene)- closo -3,1,2-RhC2B

Design, synthesis, and biological evaluation of some novel indolizine derivatives as dual cyclooxygenase and lipoxygenase inhibitor for anti-inflammatory activity

Heterostructured composites consisting of In2O3 nanorods and reduced graphene oxide with enhanced interfacial electron transfer and photocatalytic performance

New rhodacarborane-phosphoramidite catalyst system for enantioselective hydrogenation of functionalized olefins and molecular structure of the chiral catalyst precursor [3,3-{(S)-PipPhos}₂-3-H-1,2-(o -xylylene)- closo -3,1,2-RhC₂B

Heterostructured composites consisting of In₂O₃ nanorods and reduced graphene oxide with enhanced interfacial electron transfer and photocatalytic performance