Scholarly article on Acetic acid 5α-ergostan-3β-yl ester 4356-09-6 from Journal of the Chemical Society p. 921,925

This article was downloaded by: [Pennsylvania State University]

On: 04 July 2013, At: 08:25

Publisher: Taylor & Francis

Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office:

Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Technometrics

Publication details, including instructions for authors and subscription

information:

http://ww w .tandfonline.com/loi/utch20

Robust Estimates of Location and Dispersion for

High-Dimensional Datasets

Ricardo A Maronna^a& Ruben H Zamar^b

^aMathematics Department of the Faculty of Exact Sciences, Universidad

Nacional La Plata and Principal Researcher at C.I.C.P.B.A Argentina

^bDepartment of Statistics, University of British Columbia, Canada

Published online: 01 Jan 2012.

To cite this article: Ricardo A Maronna & Ruben H Zamar (2002) Robust Estimates of Location and Dispersion for

High-Dimensional Datasets, Technometrics, 44:4, 307-317, DOI: 10.1198/004017002188618509

To link to this article: http://dx.doi.org/10.1198/004017002188618509

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”)

contained in the publications on our platform. However, Taylor & Francis, our agents, and our

licensors make no representations or warranties whatsoever as to the accuracy, completeness, or

suitability for any purpose of the Content. Any opinions and views expressed in this publication are

the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis.

The accuracy of the Content should not be relied upon and should be independently verified with

primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims,

proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever

caused arising directly or indirectly in connection with, in relation to or arising out of the use of the

Content.

This article may be used for research, teaching, and private study purposes. Any substantial

or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or

distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can

be found at http://www.tandfonline.com/page/terms-and-conditions

Robust Estimates of Location and

Dispersion for High-Dimensional Datasets

Ricardo A. Maronna

Ruben H. Zamar

Mathematics Department of the Faculty of Exact Sciences

Department of Statistics

Universidad Nacional La Plata and

University of British Columbia

Principal Researcher at C.I.C.P.B.A

Canada

Argentina

( rmaronna@mail.retina.ar )

( ruben@stat.ubc.ca)

The computing times of high-breakdown point estimates of multivariate location and scatter increase

rapidly with the number of variables, which makes them impractical for high-dimensional datasets,

such as those used in data mining. We propose an estimator of location and scatter based on a modi ed

version of the Gnanadesikan–Kettenring robust covariance estimate. We compare its behavior with

that of the Stahel–Donoho (SD) and Rousseeuw and Van Driessen’s fast MCD (FMCD) estimates. In

simulations with contaminated multivariate normal data, our estimate is almost as good as SD and

clearly better than FMCD. It is much faster than both, especially for large dimension. We give examples

with real data with dimensions between 5 and 93, in which the proposed estimate is as good as or better

than SD and FMCD at detecting outliers and other structures, with much shorter computing times.

KEY WORDS: Data mining; Minimum covariance determinant; Robust covariances; Stahel–Donoho

estimate.

1. INTRODUCTION

is also necessary to ensure stability of the result. In general, all

p

of these methods are feasible for moderate , but computing

It is well known that the sample mean and covariance

matrix, which are basic elements of many multivariate

procedures, are sensitive to outlying observations. There are

several approaches to deal with this problem. M estimates

(Maronna 1976) are relatively simple to compute, but their

breakdown point (i.e., the maximum proportion of outliers

p

them for large in a reasonable time requires using values

N

of

that imply giving up a high breakdown point. Woodruff

s

and Rocke (1993, 1994) proposed procedures to deal with

this problem. Recently, Rousseeuw and van Driesen (1999)

proposed the “fast MCD” (FMCD), a procedure much more

effective than naive subsampling for minimizing the objective

function of the MCD, which seems capable of yielding “good”

=p

that the estimate can safely tolerate) is at most 1 , where

p

is the dimension of the data. Different approaches have been

proposed to overcome this dif culty. Some of them are based

on the minimization of a robust scale of Mahalanobis dis-

tances: the minimum volume ellipsoid (MVE) and minimum

covariance determinant (MCD) estimates (Rousseeuw 1984,

N

solutions without requiring huge values of _s. But FMCD

still requires substantial running times for large . Recently,

p

Peña and Prieto (2001) proposed a fast algorithm based on the

kurtosis of projections, which does not require subsampling.

Much faster estimates can be computed if one drops the

requirements of positive de niteness and af ne equivariance.

Early proposals of robust procedures are of this type (see

Bickel 1964, Sen and Puri 1971). A straightforward approach

for multivariate location is to simply calculate a robust loca-

tion estimate to each individual variable. In the case of mul-

tivariate scatter, one can similarly apply a robust covariance

or correlation estimate to each pair of variables. Estimates of

this type are called “coordinatewise” and “pairwise.”

There are many proposals for robust univariate location esti-

mates (see, e.g., Hampel, Ronchetti, Rousseeuw, and Stahel

1986), and also several proposals for the robust estimation of

covariance or correlation of a pair of variables. The simplest

methods are based on (a) ranks, such as the Spearman’s and

Kendall’s (Abdullah 1990); (b) winsorization of the data,

’

¨

1985), S estimates (Davies 1987), and estimates (Lopuhaa

1991). Others are based on projections: the Stahel–Donoho

estimate (SDE) proposed by Stahel (1982) and Donoho

(1981) and studied by Maronna and Yohai (1995); P estimates

(Maronna, Stahel, and Yohai 1992); and a recent proposal by

Peña and Prieto (2001).

All of these estimates have a high breakdown point for all

p3

in fact, if conveniently tuned, they may attain the maxi-

mum breakdown point for af ne-equivariant estimates (Davies

1987). However, their computation requires a heavy effort.

Exact computation of the MCD may be performed through

heuristic procedures (Agulló 1996), but nevertheless remains

feasible only for small datasets. Feasible sets (Hawkins 1994)

ensure attaining the solution with probability 1, but are very

’

p

time-consuming for large

.

such as the quadrant correlation and the “Huberized” covari-

ance estimates (Huber 1981, p. 204); and (c) robusti cation of

Approximate computing is usually based on taking a num-

C

N_s

p

ber

of subsamples—generally of size

1—to obtain

an initial set of solutions, which are the starting point for

the search for a (hopefully global) extremum. Ruppert (1992)

developed a heuristic procedure for S estimates.

the American Society for Quality

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

DOI 10.1198/004017002188618509

N_s

To ensure a given breakdown point, the value of

must

p

N

increase exponentially with . A suf ciently high value of

s

307

308

RICARDO A. MARONNA AND RUBEN H. ZAMAR

the relationship between variances and covariances, initially with rows x_i⁰

‘ 4¢5

1

and columns

be robust univariate dispersion and location

D

4i

Œ4¢5

1: : : 1 n5

X 4j

1 : : : 1p5

1 .

j

proposed by Gnanadesikan and Kettenring (1972) and studied Let

by Devlin, Gnanadesikan, and Kettenring (1981).

and

“4¢1 ¢5

statistics, and let

of two random variables. We de ne a scatter matrix V X and

be a robust estimate of the covariance

4 5

Unfortunately, the resulting multivariate location and scat-

ter matrix estimates are not af ne equivariant, and the scatter

matrix is not guaranteed to be positive de nite. Rousseeuw

and Molenberghs (1993) proposed several methods to deal

with the problem of negative eigenvalues. Note that although

the scatter matrices obtained by approaches (a) and (b) are

positive de nite, they require a correction to make them con-

sistent for normal data, and the correction destroys their pos-

itive de niteness.

4 5

a location vector t X as follows:

1. Let D

and y_i

D^ƒ¹x_i,

D

4‘ 4X 51 : : : 1‘ 4X 55

diag

p

1

D

i

1 : : : 1n

.

1

D

6U 7

jk

“

2. Compute the “correlation matrix” U

to the columns of Y, that is

, applying

D

U_jj

1

U_jk“4Y_j1Y_k51

j

k0

1

and

‹

3. Compute the eigenvalues

and eigenvectors e_jof

j

In this article we present a general method to obtain

positive-denite and approximately af ne-equivariant robust

scatter matrices starting from any pairwise robust scatter

matrix. We apply our method to estimates obtained by the

aforementioned method (c) to de ne multivariate location

and scatter estimates that are shown to be as good as the

equivariant ones reviewed before, while requiring much

less computing effort. Although our estimates are not af ne

equivariant, they are shown to perform well even under very

high collinearity. We give some numerical evidence indicating

that the lack of equivariance is not a serious concern in our

estimates.

4j ² 1 : : : 1p5

U

1

, and call E the matrix whose columns are

the e_j’s, so that U E E , where

0

D

²

å

4‹ 1 : : : 1‹ 5

p

diag

.

1

4. Let

0

ƒ

1

D

1

A

DE

and

z_iE y_i

A

x_i

(3)

(4)

D

so that x_iAz_i, and de ne

0

D

4 5

V X

â

A A

4 5

t X

1

A

and

â

.

4‘ 4Z 5²1: : : 1‘ 4Z 5²5

4Œ4Z 51 : : :

,

1

D

where

diag

and

p

1

Œ4Z 55⁰

p

The rst step makes the estimate scale-equivariant. The

other steps are a kind of “principal components,” replacing the

’s—which may be negative—by the “robust variances” of the

corresponding directions. Another way to view the estimate is

to consider that if U approximates the covariance matrix of

We de ne the estimate in Section 2. In Section 3 we show

the results of a simulation study comparing it to the SDE and

FMCD under contaminated normal distributions. In Section 4

we treat some high-dimensional real datasets. In Section 5

we deal with the lack of equivariance of the estimates. In

Section 6 we compare the computing times of the different

estimates, and nally, in Section 7 we discuss the results.

‹

Z 1 : : : 1Z

Y, then

should be approximately uncorrelated and

p

1

â

hence should have a diagonal covariance matrix (i.e., ). Like-

wise, it is better to apply a coordinatewise location estimate

Z

to the (approximately uncorrelated) _j’s, and then transform

2. THE ESTIMATE

X

back to the X coordinates, than to apply it directly to the _j’s.

“

We take as

the Gnanadesikan–Kettenring estimator

The estimate de ned by Gnanadesikan and Kettenring

(1972) is based on the identity

de ned in (1), which in step 2 yields

£

¤

¢

1

2

D

C

‘ 4X Y 5 ‘ 4X Y 5 1

ƒ

D

C

‘ 4Y_jY_k5 ‘ 4Y_jY_k5 1

ƒ

D

4X1Y 5

cov

U_jk

j

k0

(1)

4

‘

X1 Y

where is the standard deviation and

is a pair of random

orthogonalized Gnana-

The resulting estimate is called an

variables. These authors proposed to de ne a “robust covari-

desikan–Kettenring

(OGK) estimate.

‘

ance matrix” by using a robust scale as ; they used a trimmed

The procedure can be iterated, computing V and t for Z

obtained in step 4, and then expressing them in the original

coordinate system, that is

standard deviation. The resulting matrix is symmetric, but not

necessarily positive semide nite, and is not af ne-equivariant

either. Genton and Ma (1999) calculated its in uence function

and asymptotic ef ciency.

Recall that if V is the covariance matrix of the -dimensional

random vector x and denotes the standard deviation, then

D

4 5

X

4 5 ⁰1

AV Z A

4 5

X

4 51

At Z

V_{4 5}

and

t_{4 5}

(5)

2

p

with Z and A de ned in (3). Further iterations are de ned

likewise.

‘

0

The de nition can be extended to include zero scales. If

‘ 4 5²

D

a x

a Va

(2)

D

‘ 4X_j5

Y

0, then de ne

0 in step 1.

j

2

R

for all a

^p. The Gnanadesikan–Kettenring estimate forces

(2) for a robust scale and a small set of directions a. The

P estimates of Maronna et al. (1992) attempt to ful ll (2)

The estimate can be improved on by a reweighting step.

Denote in general the Mahalanobis distances by

‘

ƒ

D

ƒ

d_id4 5

x_i

4

5⁰¹4

51

t

x_it V x_i

(6)

all

approximately for

directions.

To overcome the lack of positive semide niteness, we pro-

pose a modi cation that forces (2) for a set of “principal direc-

tions” and is based on the observation that the eigenvalues of

the covariance matrix are the variances along the directions

D

4 5

W

with t t X and V V X . Let

be a weight function,

and de ne t_wand V_was the weighted mean and covariance

D

w

W4d 5

i

matrix, where each x_ihas weight

, that is,

i

P

_w5⁰

ƒ

w

w 4

i

54

_ix_i

x_it_wx_i

_iw_i

t

p

i

2

1: : : 1

the

R

given by the respective eigenvectors. Let x₁

x_n

be a

D

P

0

t_w

and V_w

(7)

_iw_i

D

6x 7

ij

n

p

dataset. As a general notation, call X

matrix

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

ROBUST MULTIVARIATE ESTIMATES

309

D

W

W 4d5 I4d µ d 5

The simplest

is “hard rejection,” with

,

and

0

I4¢5

where

is the indicator function. We take

²4‚5

4d 1 : : : 1d 5

1

˜^ü4 1

5

m 2

<

e

‹ 4 4 55

D

V X

max

0

inf

V X

med

²40 5

1

n

1

p

e

n

2

D

X

¸

m

d₀

1

(8)

5

p

e

ˆ

<

‹ 4 4 55 <

p

1

sup

V X

²4‚5

‚

where

is the -quantile of the chi-squared distribution

p

e

2

X

¸

m

p

with degrees of freedom, and “med” denotes the median.

Note that to compute (6) from (4), no matrix inversion is

required, because

‹ 4 5

V and

‹ 4 5

p

where

V are the smallest and largest eigenval-

1

X

ues of V and ¸_mis de ned as in (9) but with X instead of

Then for t and V in (4), we have the following.

.

³

´

2

ƒ

X

z_ijŒ4Z_j5

D

d_i

0

Proposition 2.

Assume that

‘ 4Z_j5

j

1

0

As a general notation, OGK_4l5henceforth denotes the OGK

D

2

ƒ

8 8i 2

sup # a x_i

c9 2

1

9 < 0

R 1

a

0 c

n

l

estimate with iterations, so that OGK_{4 5}corresponds to the

1

Œ

‘

satisfy

˜^ü4Œ1X5 ¶ ˜

˜^ü4‘ 1 X5 ¶ ˜

4‚5

initial estimate (4); OGK_4l5

denotes the reweighted version

Let

and

for all

C

(7)–(8), and OGK remains the generic name of the family of

estimates.

D

X

˜^ü4‘ 1 X5 ¶ ˜

X

8i 2 X

(univariate) , and

c9 µ nƒ

for all such that #

ƒ

i

˜^ü4 1 5 ¶ ˜

2

c

R

˜^ü4 1 5 ¶ ˜

for all

. Then

t X

and

V X

.

The proofs of Propositions 1 and 2 are straightforward and

are not given here. Ma and Genton (2001, sec. 4.1) dealt only

with the breakdown points of individual covariances computed

through (1).

2.1 Properties

It follows from the de nition that t is shift-equivariant. It

‘

Œ

is easy to prove that if and are consistent, then t and V

in (4) are consistent for the location and shape of elliptical

distributions. This is described more precisely in the following

proposition.

It should be noted that having a high breakdown point

is not always an important merit for a nonequivariant esti-

mate. For example, the “robust covariance matrix” de ned

4

4X 5²1 : : : 1

4X 5²5

MAD , where MAD stands for

p

as diag MAD

1

D

C

D

Proposition 1.

1 : : : 1 _n1: : :

x

Let x₁

be iid with x_iBu_i

p

mean absolute deviation, has breakdown .5!

2

R 2 X_a

t₀where u_ihas a spherical distribution. Put for a

n

0

The maximum bias under pointwise contamination has been

computed for the MVE and the SDE (Yohai and Maronna

1990; Maronna and Yohai 1995). The lack of equivariance of

the OGK makes the study of its bias extremely dif cult.

8

1 : : : 1

9

a x₁

a x_n. Assume that for all a, the limits in proba-

! ˆ

Œ4X 5

n

‘ 4X 5

n

1

t

bility of

and of

exist. Then when

n

a

0

c

converges in probability to t₀and V to BB , where is a

scalar.

Œ

˜

It is also easy to show that if the breakdown points of and

3. SIMULATION

‘

(for both implosion and explosion) are not less than , then

We have run a simulation comparing the SDE, FMCD, and

OGK estimates. To evaluate their statistical behavior, we need

situations in which the “true values” are known; we have

chosen the contaminated multivariate normal model. Because

exploring a full neighborhood is infeasible, we focus on point

4 1

5

so is the breakdown point of t V if the data are not collinear.

D

X

8x 1: : : 1x 9

be a univariate sample.

n

1

More precisely, let

2

m

X

8 1: : : 1 n9

0

For

de ne the “contamination neighborhood”

of as the set of samples of size having

ƒ

n

m

elements in

X

common with , that is,

8

1: : : 1

9

x_n, the rst

mass contamination;that is, for a sample x₁

e

8X 2 4X5 n1 4X X5

e

D

\

D

ƒ

n

m91

¸

#

(9)

m

ƒ

n

m

elements are iid multivariate normal, and the remaining

are equal to a xed vector.

4¢5

where #

denotes the cardinality. Then the contamination

breakdown point of at is

We used the SDE with “Huber weight function” following

Maronna and Yohai (1995, p. 334), with threshold

Œ

X

1

ü

q

e

Œ4X5 <

D

—

ˆ

˜ 4Œ1X5

m 2 _e

1

max

sup

²4‚5

‚

0

50

D

n

b

with

(10)

2

X

¸

p

m

‘

and the explosion and implosion breakdown points of are

The FMCD was computed through the algorithm of Rousseeuw

and van Driesen (1999), followed by a step of hard rejection

1

ü

e

D

ˆ

˜ 4‘ 1 X5

C

m 2 _e‘ 4X5 <

max

sup

D

‚

0

975.

with

n

2

X

¸

m

D

l

1

We used the estimates OGK_4l5with

1 2 and their

10

D

_4l54‚5

‚

0

90 95, and .975.

reweighted versions OGK

Because

case is shown, but

simulations showed that iterations beyond the second did not

lead to improvement. Numerical experiments do not show

any convergence when iterating a large number of times.

Because we need robust and ef cient scale and location

with

and

1

ü

D

‚

0

9 generally yielded the best results, only this

e

‘ 4X5 >

D

˜_ƒ4‘ 1 X5

m 2

0

max

inf

0

e

n

2

D

X

¸

‚

0

95 was almost as good. Exploratory

m

n

R^p

Now let X be a sample of size in

points of t and V at X are

. The breakdown

1

˜^ü4 1

t X

5

m 2 _e

e

D

˜

¸

m

˜

4 5 <

ˆ

max

sup t X

n

‘

’

the “ scale” of Yohai and

estimates, we chose for

2

X

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

310

RICARDO A. MARONNA AND RUBEN H. ZAMAR

k

Zamar (1988), which is a truncated standard deviation, and a ones for location and scatter, which appear in the table as

t

Œ

weighted mean for . De ne the functions

k

.

V

and

Exploratory simulations were run with different values of

³

³ ´ ´

2

x

c

W_c4x5

_c4x5

4x²1 c²50

2 10 1 0 10

0 5 7 9, and .999. For

mult

D

ƒ

— —

I4 x µ c5

D

1

and

min

0 and .5, OGK behaved

surprisingly well (similarly to SDE); but as could be expected,

mult

X

8x 1: : : 1x 9

Let

be a univariate sample and put

its behavior deteriorated with increasing _mult. The reweighted

versions were more stable. We show only the results corre-

sponding to the least favorable case,

very collinear situation, the ratio of variances of projections

orthogonal to a₀to those along a₀is .0003 and .0002 for

5 and 10. This collinearity is much higher than that in

the simulations of Devlin et al. (1981) and Ma and Genton

n

1

D

—

4 X

4X5

ƒ

—

‘

4X5

4X5 5

med

MAD

med

and

0

D

0

999. This is a

³

´

mult

ƒ

x_i

med

D

w_iW_c

0

1

‘

0

D

p

Then the location and scale statistics are de ned as

P

_ix_iw_i

D

Œ4X5

P

and

(2001). Of course, the value of

estimates, because they are equivariant.

does not affect the other

mult

_iw_i

³

´

2

X

ƒ

x_iŒ4X5

‘

0

‘ 4X5²

0

For each estimate, the location vector t and scatter ma_ƒtrix

D

(11)

c₂

0

V were_ƒcomputed, and then “back-transformed,” t₁

R

¹t,

n

‘

D

i

V₁

R

¹VR^ƒ¹, with R de ned in (12). They were evalu-

D

c

0

4 5 and

To combine robustness and ef ciency, we took

1

2

D ˜ ˜

e_t

ated through the distributions of the “errors”

t₁and

D

c₂

3, which yield approximately 80% ef cient univariate

D

e_V

4

55

log cond V₁(the decimal logarithm). Their mean and

-quantiles were computed, with

location and scale for both normal and Cauchy data. Simply

using the median and the MAD clearly worsened the simu-

lation results, especially for collinear data. Ma and Genton

D

0 10

5 75, and .90. Only the

D

0

values corresponding to

qualitatively similar results. The condition numbers are more

easily displayed in the log scale, because they range between

about 3 and 20,000.

75 are shown; the others yield

Q_n

(2001) advocated using the scale estimate

proposed by

Croux and Rousseeuw (1992) and Rousseeuw and Croux

(1993), but we prefer (11) for reasons of speed. In the

pure normal situation, the results for the sample mean and

covariance are also shown.

Unfortunately, the procedure proposed by Peña and Prieto

(2001) was not available to us when the simulation study

was conducted. A comparison with this method would be of

interest.

D

p

The number of subsamples corresponding to

5 and 10

was 1,000 and 2,000 for SDE and 500 and 1,000 for FMCD.

This is probably much larger than needed, but we wanted to

see the behavior of these estimates at their best. The number

of Monte Carlo replications was 1,000 in all cases. For each

n

p

combination of and , the samples were the same for all

˜

k

estimates and all and . The results are displayed in Table 1.

p

˜

The sampling situations were -variate normal -contami-

D

Discussion.

The SDE appears to be the overall best esti-

mate for point contamination, and FMCD appears to be the

40 5

p

n

nated distributions, with taking the values 5 and 10, and

p

10 . In view of the lack of equivariance of the OGK estimate,

worst. Among the four variants of OGK, OGK₄9 seems

its behavior may depend on the covariance structure; hence

5

1

D

m

p

6n˜7

(where

we generated correlated data as follows. Let

2

6¢7

denotes the integer part); generate y_ias -variate normals

D

ƒ

4 1 5

i

ƒ

1 : : : 1n

; we chose

m

„

4

1„²5

N_p0 I for

1

m

, and as N_py₀

I for some

D

i > n

0

y₀and

1. The choice of a normal values for scale and location drop to .89 and .25.

distribution with a small dispersion, rather than exact point-

mass contamination, is due to the fact that exactly repeated

points may cause problems with the subsampling algorithms

used to compute the SDE and FMCD.

D

R_jj

Put x_iRy_i, where R is the matrix with

D

1

R_jk

i

j0

1

and

for

(12)

2

D

˜

1

Then for

0 X has covariance matrix R , and the multiple

between any coordinate of X and all of the

others is easily calculated as a function of . We chose so

correlation

mult

that

took on chosen values. If

is high, then X is

mult

4. REAL DATA

0

D

4 1 1 : : : 1 5

concentrated around the line with direction a₁

the eigenvector of R corresponding to its largest eigenvalue.

1 1

1 ,

p

We analyzed several datasets with between 5 and 93. Here

D

we show the results for the most interesting ones. For each

dataset, we computed the same estimates as in the simulation.

Because the reweighted versions of OGK always showed more

k

We took y₀

a₀, where a₀is a unit vector. Preliminary sim-

ulations suggested that the least favorable direction for OGK

0

D

ƒ

=p

and then

is orthogonal to a₁. Given b, take a₀

normalize it to unit norm. We tried two options, one using a

b

b a₁

structure than the raw ones, only the results for OGK_{4 5}(.9) and

1

b

4

5^j

and the other taking b at random with

OGK_{4 5}(.9) are displayed, and the “(.9)” is omitted for brevity

D ƒ

xed b with

1

2

j

N_s

in this section. The number

of subsamples is the default

a spherical distribution. They yielded similar results, and we

p

b

k

500 for FMCD and depends on for SDE. The threshold of

report those corresponding to the rst option. The value of

p µ

p

10; because for larger this

ranged over a set of values to search for the least favorable SDE is taken as in (10) for

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

ROBUST MULTIVARIATE ESTIMATES

311

Table 1. Simulation Results

D

1

p

5 n 50

p

10 n 100

˜

Estimate

e_V

055

1003

059

057

054

059

048

e_t

k_V

k_t

e_V

e_t

k_V

k_t

0

SD(.5)

FMCD

OGK

016

025

018

017

016

013

054

090

056

057

054

056

053

014

020

017

015

017

015

013

OGK(.9)

OGK₍₂

)

OGK₍₂(.9)

)

Mean-covariance

.1

.2

SD

FMCD

OGK

OGK(.9)

OGK₍₂

073

1056

2049

081

1048

087

031

067

054

032

036

038

505

205

090

2010

2052

095

1068

1009

036

2056

061

048

042

50

10

200

5

200

6

5

4

200

4

200

4

200

4

10

200

5

7

6

)

OGK₍₂(.9)

059

)

SD

FMCD

OGK

OGK(.9)

OGK₍₂

1027

301

3035

1058

2050

1099

1051

2202

6022

3063

5018

9093

17

12

200

9

200

15

3

15

200

9

15

1056

4032

3046

1070

2067

2024

2078

35

70

200

10

200

20

5

70

200

10

9

51505

35037

4066

3087

17042

)

OGK₍₂(09)

20

)

2

˜ ˜

t

NOTE:

e

and e are the error measures for t and V, equal to the .75 quantiles of

and of log cond V . k and k are the respective

( )

10

t V

t

V

contamination locations yielding the highest errors for each estimate.

4.1 Bush’re Data (Campbell 1989)

may yield excessively large values, and hence a less robust

estimate, we used

This dataset containing satellite measurements on ve fre-

quency bands, corresponding to each of

analyzed by Maronna and Yohai (1995). Here

q

±

²

D

n

38 pixels, was

D

b

²40 51

25

0

min

4

p

D

N_s

500 for

D

i

SDE. Figures 1–2 display

versus . All estimates show the

i

same structure, but with different degrees of emphasis. Pixels

32–38 appear as clear outliers, and also 31 to a lesser extent.

4 1 5

For each estimate V t , call

as in (6), put

d

the Mahalanobis distances

i

But OGK_{4 5}gives only faint indications of 7–9, whereas the

d_i

1

2

p

D

D_i 40 5

5

other estimates clearly point out 7–11 and give some indica-

tions for 29 and 30. N. A. Campbell (personal communica-

tion) pointed out that the pixels may be classi ed as “burnt,”

“unburnt,” and “water” and that the suspect ones lie on bound-

ary areas between the classes.

4d 5

med

¢

2

D

C

D

D f

the ordered _i´s; and let

_p4i=4n

55

1 . Then for

call

4i5

i

D

f

_i. For each dataset, we

normal data, we should have

4i5

D

i

D

f

versus .

i

plotted

versus case and

i

4i5

Figure 2. Bush’re Data: Distance D_iVersus Index i for (a) OGK and

Figure 1. Bush’re Data: Distance D_iVersus Index i for (a) FMCD

and (b) SDE.

(b) OGK₍₂

.

)

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

312

RICARDO A. MARONNA AND RUBEN H. ZAMAR

C

Figure 5. Q-Q Plots of Ionospheric Data for OGK

( ) and

Figure 3. Engineering Data: Distance D_iVersus Index i for (a) FMCD

and (b) SDE.

OGK₍₂

( ).

)

parts of the complex responses corresponding to each of 17

pulse numbers. The measurements are classi ed as “good”

radar returns (those showing evidence of some type of

structure in the ionosphere) or “bad” ones. We analyze the

D

4.2 Engineering Data

Rousseuw and van Driesen kindly supplied the data used

in their article: nine characteristics measured on

D

n

677

D

N_s

1

diaphragm parts for TV sets. Here

D_i

2 000 for SDE.

n

225 “good” ones. Variables 1, 2, and 27 were omitted

i

Figures 3–4 show

versus . It is seen that all estimates

D

from the analysis because they had MAD 0, so that here

identify essentially the same structure: some isolated outliers,

plus points 491–565, but FMCD and OGK₄do so more

strongly than SDE and OGK_{4 5}. The plot for mean-covariance

D

p

31. These are very collinear data; the condition numbers

5

2

40 5

of the covariance matrix and of OGK_{4 5}9 are about 4,000

1

D

1

N_s

1

D_i

and 14,000. We took

2 000 for the SDE. Ploting

(not shown here) identi es only the isolated outliers.

i

D

f

i

versus shows no structure. To plot

versus we found

4i5

the problem of the large range of the former, which prevents

us from seeing details in the lower values; hence we plotted

4.3 Ionospheric Data

This dataset from the Johns Hopkins University Ionosphere

database was taken from the “Data Repository” of Bay

(1999) and has been used by Sigillito, Wing, Hutton, and

Baker (1989). It consists of 351 radar measurements on 34

continuous characteristics, which are the real and imaginary

square roots

D_4i5f

and _i. In each plot the curves

the

of

were slightly displaced to avoid superimposing them. Figures

5–6 show that the data structure is more complex than just

“normal data with outliers.” The plots for both FMCD and

p

f <

i

OGK₄show an almost straight part for

about 5.7 (the

5

2

smallest 128 distances), which may describe a “central part”

Figure 4. Engineering Data: Distance D_iVersus Index i for (a) OGK

C

Figure 6. Q-Q Plot of Ionospheric Data for FMCD ( ), SDE ( ), and

and (b) OGK₍₂

.

cov ( ).

)

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

ROBUST MULTIVARIATE ESTIMATES

313

Table 2. Ionospheric Data: Points With the Largest Mahalanobis Distances

Estimate

Points with largest D_i(inverse order)

D

FMCD (N_s500) 96 95 18 62 26 14 33 27 202 56 116

41

29 119 129

D

1

SDE (N_s2 000) 95 96 27 18 62 116 14 26 56 85

41

18 203

64 215

OGK(.9)

85 95 84 96 81 83 202 109 214 14

95 96 62 14 18 85 202 27 26 41

94

81

62 130

OGK₍₂(.9)

)

Mean-covariance 95 96 62 27 18 116 40 14 26 85 108

ÿ

of the data, followed by an abrupt increase. The points with

largest distances are given in Table 2.

For a more detailed analysis of the data, we plotted for each

observation the sequence of coordinates, but rst placing the

odd and then the even numbered ones (the real and imaginary

parts of the signal). The following features emerged:

Most points of form (a) in Figure 7 are just above the

break.

ÿ

Most points of forms (b), (c), and (d) in Figure 7 and

type c are below rank 128.

We can thus conclude that the break in the plots for FMCD

and OGK_{4 5}correspond to a real feature of the data and not

2

a. 138 of the 225 observations have 1 of 4 characteristic

forms. Figure 7 plots observations 4, 32, 58, and 79, which

are “pure specimens”; most specimens are noisier. Forms (d)

and (a) are the most and the least abundant, with 70 and

10 points. Lacking subject matter knowledge, we ignore the

physical meaning of the forms.

to an artifact. The other estimates give no hint of this feature.

We remark that this analysis has been made only to demon-

strate the behavior of the estimates, and that further analysis

and subject matter knowledge are needed to really understand

this dataset.

b. 22 observations look like a mixture of form (b) with (c)

or (d).

c. 39 observations look like very noisy versions of type a

or b.

d. 26 do not seem to belong to any of the former; these are

subjective classi cations.

4.4 Spectral Data

This dataset was also taken from Bay (1999). It is part of

the Low-Resolution Spectrometer Database in the Infra-Red

Astronomy Satellite Project and contains

spectra measured on

D

n

531 high-quality

D

p

N

93 frequency bands. We used

s

1

3 000 for SDE. The results are displayed in Figures 9 and 10.

The points with the largest Mahalanobis distances belong

to type c or d. Figure 8 shows observations 95, 96, 41, and

27, which are among the rst listed in Table 2.

The rank orders of the Mahalanobis distances for all esti-

mates (except mean-covariance) follow same pattern:

The mean and covariances point out only points 210 and

maybe 307. Increasing

results very much. FMCD with

N

to 10,000 does not change the SDE

s

D

N

1

3 000 yields results sim-

s

ilar to OGK_{4 5}. Table 3 displays the points with the largest

1

D

_i’s.

ÿ

Here, too, OGK_{4 5}, OGK_{4 5}, and FMCD point out a break.

Most points of types c and d are well above rank 128,

2

1

p

where the break for FMCD and OGK_{4 5}occurs.

f µ 0

Of the 302 points with

9 8, OGK₄shares 293 with

5

2

i

Figure 7. Ionospheric Data: “Pure Specimens.” (a) Observation 4;

(b) observation 32; (c) observation 58; (d) observation 79.

Figure 8. Ionospheric Data: Outliers. (a) Observation 95; (b) obser-

vation 96; (c) observation 41; (d) observation 27.

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

314

RICARDO A. MARONNA AND RUBEN H. ZAMAR

C

(

C

Figure 9. Q-Q Plots of LRS Data for OGK ( ) and OGK₍₂

).

Figure 10. Q-Q Plots of LRS Data for FMCD ( ) and SDE ( ).

)

4 5

and hence to explore equivariance we should compare t X

OGK_{4 5}, 274 with FMCD, and 291 with SDE. All estimates

1

4 5

and V X with

D

share 262 points. Of the 20 points with the largest _i, OGK₄

,

5

2

ƒ

5

^ƒ¹⁰0

V X_AA

1

OGK_{4 5}, and FMCD share 16.

D

1

4 5

X

4

t X

_A5

4 5

X

4

t_A

A

and

V_A

A

For a more detailed analysis, we plotted the sequence of

coordinates for each observation. Figures 11(a) and (b) are

two typical forms; (c) is a point “just above the break” (with

rank order 306), and (d) is an outlier.

Points above the break are clearly different from (a) and

(b) like (d), or noisy versions of (a) and (b). We can again

conclude that the observed break reveals a real feature of the

data.

Because exploring all transformations is infeasible, we gen-

D

erated random matrices as A TD, where T is a random

D

4u 1: : : 1 u 5

p

u

, where the _i’s

orthogonal matrix and D diag

1

are independent and uniformly distributed in (0,1).

The simulation of Section 3 was repeated for several of the

sampling situations. For each generated X a random A was

generated as described earlier, and the performance of t_Aand

V_Awas evaluated. In general, the results were very similar

to those for the untransformed estimates. Table 4 shows the

4.5 Other Datasets

D

p

1n

k

˜

0

2, choosing the “least favor-

results for

5

50, and

D

k

able situations”

200 and

9, corresponding to OGK

Several other datasets from Bay (1999) were also analyzed,

with one and two iterations, and with or without reweighting.

The columns “V” and “t” repeat the results of Table 1, and

“t_A” and “V_A” correspond to the random transformation as

described earlier.

D

n

p

5

n

p

5

namely Glass (

76,

7 , Wine (

59,

13 , VDBC

D

n

1 p

5

4n

1p

5

n

(

357

D

30 , Segment

330

16 , Pima (

D

1p

500

n

5

8) and Sat (

961

36 . In all cases, OGK₄

5

2

and FMCD yielded similar results, both nding more structure

It is seen that the effect of the transformation is stronger on

V than on t. As a general pattern, the reweighted estimators

were “more equivariant” in the sense that their performances

were much less affected by the transformations.

To investigate the effect of transformations on an individ-

ual sample, we de ne measures of “lack of equivariance” for

location and scatter, namely

than SDE and OGK_{4 5}

.

1

5. EQUIVARIANCE

In this section we investigate the effects of the lack of

equivariance of our estimates on their performance. Given

D

8

1 : : : 1 _n9

p

matrix A, let

X

X_A

x₁

x

1 : : : 1

and a nonsingular

9

ƒ

4 5 ¹⁰51

cond U ¹V_AX U

D

8

Ax₁

Ax_n. If the estimates are equivariant, then

D ˜

ƒ

˜

D

d_t

4 5 4 5

t X

d_V

4

t_A

X

and

we should have

0

D

4 5

where U is any matrix such that V X

UU . Experiments

4

t X

_A5

4 5

At X

4 _A5

V X

4 5 ⁰1

AV X A

D

were performed with real and simulated data. As an example,

and

Table 3. LRS Data: Points With the Largest Mahalanobis Distances

Estimate

Points with largest D_i(inverse order)

D

FMCD (N_s500)

210

173

307

173

112

281

112

90

173

307

90

307

90

2

112

2

281

245

281

193

472

193

451

67

451

67

2

67

370

271

147

370

D

1

SDE (N_s3 000)

OGK(.9)

90

OGK₍₂(.9)

307

2

)

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

ROBUST MULTIVARIATE ESTIMATES

Table 5. Measures of Lack of Equivariance for Ionospheric Data

.5 .7 .8 .9 Max

315

d_V

d_t

OGK ₁(09)

OGK⁽₂⁾(09)

203

239

264

390

304

465

373

555

757

888

(

)

OGK ₁(09)

OGK⁽₂⁾(09)

039

047

042

053

043

055

045

058

051

067

(

)

ÿ

Further iterations do not improve on the behavior.

Although the worst case may differ from the original

data, for most transformations the results are very similar.

These results suggest that the consequences of the lack of

equivariance of the estimates are not serious.

This experiment has been conducted only to demonstrate

the behavior of the estimates. For this dataset, the original

coordinate system is clearly the most natural one.

Figure 11. LRS Data. (a) and (b) Two “typical observations,” 262

and 104; (c) one intermediate observation, 122; and (d) one outlier, 90.

6. COMPUTING TIMES

To compare the computing times of the different estimates,

n

we generated random samples with different values of

we show the results corresponding to the ionospheric data of

Section 4.3. The number of random transformations was 200.

The number of iterations ranged between 1 and 4, with and

without reweighting. Because the data range between 1 and

ƒ

p

and . We ran the experiments on a PC with a 550-MHz

Intel Pentium processor with 128 Mb RAM. We rst ran

them in Fortran, using for FMCD the code kindly supplied by

Rousseeuw and van Driessen. It turned out that the running

times for FMCD were at least 100 times those for OGK,

which may be due to paging. Because we could not overcome

this problem, we decided to run the experiment in Gauss

(version 3.2.32). This should be more favorable to SDE and

FMCD, because their computing effort consists mainly of

vector and matrix operations, which a matrix language like

Gauss performs very quickly, whereas almost half of the time

for OGK is spent computing medians.

d

1, no scaling was used for _t. Table 5 gives the maximum

D

d

for

t

0 10 1 0

5 7 8, and .9. No

and the -quantiles of

and

V

improvement was found beyond the second iteration. Because

d

the values of

for the estimators without reweighting were

V

about 100 times higher than those with reweighting, only the

latter are shown in Table 5.

Table 5 reveals that here the effect of transformations is

much stronger on V than on t. Figure 12 shows the plots

of Mahalanobis distances corresponding to different transfor-

mations: the untransformed data (as in Fig. 5) and the trans-

formations corresponding to the .80 and .90 quantiles and to

To make our method run faster, we did not use the built-

in Gauss command “median,” which uses sorting. Rather, we

d

the maximum of _V. For the .80 quantile, the plot is almost

indistinguishable from that of the original data, and the order-

d

ing of the _i’s is essentially the same as in Table 2. For the

.90 quantile, the basic features still remain, and some are still

visible in the maximum case.

The following features were observed for all of the exam-

ined datasets:

ÿ

Location is much less affected than scatter.

Reweighting makes the estimates much more equivariant.

ÿ

D

,

, …

.

2 With Fixed and Random

Table 4. Simulation for p 5 n 50

Coordinates

k

V

V_A

t

t_A

200

OGK ₁

3035

065

2050

062

3029

069

1098

066

6022

020

4082

021

3001

023

1063

024

OGK⁽₁⁾(09)

OGK⁽₂

)

OGK⁽₂⁾(09)

(

)

9

OGK ₁

1068

1058

1056

1062

1085

1062

1061

1065

097

3063

3073

3083

1078

3074

3092

3094

OGK⁽₁⁾(09)

OGK⁽₂

)

OGK⁽₂(09)

)

Figure 12. Q-Q Plots for Transformed Ionospheric Data. (a)–(d) cor-

respond to original data, .80- and .90-quantiles, and maximum of d_V.

(

)

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

316

RICARDO A. MARONNA AND RUBEN H. ZAMAR

D

+

7. DISCUSSION

Table 6. Times for Simulated Data in Seconds a

b

c

d

There is probably no estimate that is fully satisfactory.

p

FMCD is equivariant, but—although the empirical results with

Estimate

n

20

40

60

80

D

N_s

500 are satisfactory—it is dif cult to determine for a

p

N

ensures a given breakdown point. Moreover,

s

given which

FMCD

200

400

800

200

400

800

200

400

800

1309

3306

7409

300

407

803

046

087

106

4206

8603

17809

907

1205

1700

105

8908

17105

33305

2507

2706

3706

306

20207

41706

72600

5709

6503

7001

703

D

(N_s500)

the simulations show that it may behave poorly under point

p

mass contamination. SDE is equivariant, and for moderate it

SD

D

does a good job under point mass contamination, but with real

data, it seems to fail to detect interesting structures, and for

(N_s500)

OGK

p

N

to ensure

s

large , it requires impractically large values of

309

700

504

1203

1109

1706

a high breakdown point. Finally, OGK is not equivariant, but

it performs well in simulations with point mass contamination

and performs similarly to FMCD with high-dimensional real

data, all at a computational cost much lower than that of its

competitors. The weighted versions are better and are “more

equivariant,” as demonstrated in Section 5. Iterating seems

used a selection algorithm (the procedure “select” in section

8.5 of Press et al. (1992), which is linear in

n

.

We implemented steps 1–4 of the algorithm in section 5

of Rousseeuw and van Driessen (1999). The running times

of SDE, FMCD, and OGK₄were measured for 20% con-

40 5

9 for the

advantageous; OGK_{4 5}

9

is better than OGK₄

5

1

2

real datasets in Sections 4.2, 4.3, and 4.5. It must be added

that even for moderate datasets, a very fast procedure has the

advantage of allowing the use of computer-intensive methods,

such as the bootstrap and cross-validation.

5

1

D

p

1

taminated normal samples with

20 40 60, and 80 and

D

n

1

N

200 400, and 800. The number of subsamples was

s

500 in all cases. Whereas the running times of SDE and OGK

are practically independent of the dataset, this is not so for

FMCD, which seems to require more time (i.e., more itera-

tions) for contaminated data than for pure normal data.

ACKNOWLEDGMENT

Ruben Zamar’s research was partially funded by NSERC,

Canada.

n

We have not tried larger ’s for several reasons. First, we

p

n

.

were concerned with the problem of large

than large

[Received May 2001. Revised February 2002.]

n

Second, when is larger than a certain (the default is 600),

0

Rousseeuw and van Driessen’s FMCD algorithm applies an

ingenious splitting procedure to reduce the number of evalua-

tions. For OGK, a time-saving procedure may be as follows.

REFERENCES

Abdullah, M. B. (1990), “On a Robust Correlation Coef cient,” The Statisti-

cian, 39, 455–460.

Agulló, J. (1996), “Exact Iterative Computation of the Multivariate Mini-

mum Volume Ellipsoid Estimator With a Branch and Bound Algorithm,” in

Proceedings in Computational Statistics, ed. A. Prat, Heidelberg: Physica-

Verlag, pp. 175–180.

Bay, S. D. (1999), “The UCI KDD Archive” [http://kdd.ics.uci.edu], Univer-

sity of California, Irvine, Dept. of Information and Computer Science.

Bickel, P. J. (1964), “On Some Alternative Estimates for Shift in the

Variate One-Sample Problem,” Annals of Mathematical Statistics, 35,

1079–1090.

Campbell, N. A. (1989), “Bush re Mapping Using NOAA AVHRR Data,”

technical report, CSIRO.

Croux, C., and Rousseeuw, P. J. (1992), “Time-Ef cient Algorithms for

Two Highly Robust Estimators of Scale,” Computational Statistics, 2,

411–428.

Davies, P. L. (1987), “Asymptotic Behavior of S-Estimates of Multivariate

Location Parameters and Dispersion Matrices,” The Annals of Statistics, 15,

1269–1292.

n

When is larger than some ₀, take a random subsample of

n

size

tion in Section 2; then use the whole sample for (4) and (7);

n₁

and use it to perform steps 1, 2, and 3 of the de ni-

1

p

probably should depend on . It is dif cult to determine

theoretically how much the statistical performance of FMCD

and OGK deteriorates with this savings, so that further exper-

iments would be necessary to determine an adequate choice

p

-

n

.

1

of

and

0

Table 6 gives the running times in seconds. It is seen

that those for FMCD are between 22 and 46 times those

N_s

for OGK_{4 5}. Note that the values of

actually required by

1

SDE are much larger than the 500 used for testing. Actually,

the number of subsamples required to ensure an average of

Devlin, S. J., Gnanadesikan, R., and Kettenring, J. R. (1981), “Robust Esti-

mation of Dispersion Matrices and Principal Components,” Journal of the

American Statistical Association, 76, 354–362.

D

˜

0

p

1 1

20 40 60, and 80 are

ve “good” ones for

2 and

6

8

1

⁴1

around 400 4 10

3

10 , and 3 10 . Table 7 shows the

running times for the real datasets in the preceding section, Donoho, D. L. (1982), “Breakdown Properties of Multivariate Location Esti-

mators,” Ph.D. qualifying paper, Harvard University.

in seconds.

Genton, M. G., and Ma, Y. (1999), “Robustness Properties of Dispersion

Estimators,” Statistics and Probability Letters, 44, 343–350.

Gnanadesikan, R., and Kettenring, J. R. (1972), “Robust Estimates, Residuals,

and Outlier Detection With Multiresponse Data,” Biometrics, 28, 81–124.

Hampel, F. R., Ronchetti, E., Rousseeuw, P. J., and Stahel, W. A. (1986),

Robust Statistics: The Approach Based on Inuence Functions, New York:

Wiley.

Hawkins, D. M. (1994), “The Feasible Solution Algorithm for the Minimum

Covariance Determinant Estimator in Multivariate Data,” Computational

Statistics and Data Analysis, 17, 197–210.

Table 7. Times for Real Datasets

Dataset

n

p

N_SD

500

N_FM

OGK

SDE

04

1409

2903

FMCD

Bush’re

38

677

225

531

5

500

004

032

101

08

2003

2102

1

Engineering

Ionospheric

Spectral

9

31

93

2 000

Huber, P. J. (1981), Robust Statistics, New York: Wiley.

1

3 000

¨

’

Lopuhaa, H. P. (1991), “Multivariate -Estimators for Location and Scatter,”

1

3 000

1904

45803

61506

Canadian Journal of Statistics, 19, 307–321.

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

ROBUST MULTIVARIATE ESTIMATES

317

Ma, Y., and Genton, M. G. (2001), “Highly Robust Estimation of Dispersion Rousseeuw, P. J., and van Driesen, K. (1999), “A Fast Algorithm for the

Matrices,” Journal of Multivariate Analysis, 78, 11–36. Minimum Covariance Determinant Estimator,” Technometrics, 41, 212–223.

Maronna, R. A. (1976), “Robust M-Estimates of Multivariate Location and Ruppert, D. (1992), “Computing -Estimators for Regression and Multivariate

S

Scatter,” The Annals of Statistics, 4, 51–56.

Maronna, R. A., Stahel, W. A., and Yohai, V. J. (1992), “Bias-Robust Estima-

Location/Dispersion,” Journal of Computational and Graphical Statistics,

1, 253–270.

tors of Multivariate Scatter Based on Projections,” Journal of Multivariate Sen, P. K., and Puri, M. L. (1971), Nonparametric Methods in Multivariate

Analysis, 42, 141–161. Analysis, New York: Wiley.

Maronna, R. A., and Yohai, V. J. (1995), “The Behavior of the Stahel–Donoho Sigillito, V. G., Wing, S. P., Hutton, L. V., and Baker, K. B. (1989), “Clas-

Robust Multivariate Estimator,” Journal of the American Statistical Asso-

ciation, 90, 330–341.

si cation of Radar Returns From the Ionosphere Using Neural Networks,”

Johns Hopkins APL Technical Digest, 10, 262–266.

Peña, D., and Prieto, F. J. (2001), “Multivariate Outlier Detection and Robust Stahel, W. A. (1981), “Breakdown of Covariance Estimators,” Research

¨ ¨

Covariance Matrix Estimation,” Technometrics, 43, 286–301.

Press, W. H., Teukolsky, S. A. Vetterling, W. T., and Flannery, B. P. (1992), Woodruff, D. L., and Rocke, D. M. (1993), “Heuristic Search Algorithms for

Report 31, Fachgruppe fur Statistik, ETH Zurich.

Numerical Recipes in Fortran, New York: Cambridge University Press.

Rousseeuw, P. J. (1984), “Least Median of Squares Regression,” Journal of

the American Statistical Association, 79, 871–881.

(1985), “Multivariate Estimation With High Breakdown Point,” in

Mathematical Statistics and Aplications, Vol. B, eds. G. S. Maddala and

C. R. Rao. Amsterdam: Elsevier, pp. 101–121.

the Minimum Volume Ellipsoid,” Journal of Computational and Graphical

Statistics, 2, 69–95.

(1994), “Computable Robust Estimation of Multivariate Location and

Shape in High Dimension Using Compound Estimators,” Journal of the

American Statistical Association, 89, 888–896.

Yohai, V. J., and Maronna, R. A. (1990), “The Maximum Bias of Robust

Covariances,” Communications in Statistics, Part A—Theory and Methods,

19, 3925–3933.

Rousseeuw, P. J., and Croux, C. (1993), “Alternatives to the Median Absolute

Deviation,” Journal of the American Statistical Association, 88, 1273–1283.

Rousseeuw, P. J., and Molenberghs, G. (1993), “Transformation of Nonposi- Yohai, V. J., and Zamar, R. (1988), “High Breakdown Point Estimates of

tive Semide nite Correlation Matrices,” Communications in Statistics, Part

A—Theory and Methods, 22, 965–984.

Regression by Means of the Minimization of an Ef cient Scale,” Journal

of the American Statistical Association, 86, 403–413.

TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4

Article Doi

DOI: 10.1198/004017002188618509

Source and publish data:

Authors:

Article abstract of DOI:10.1198/004017002188618509

Full text of DOI:10.1198/004017002188618509

Products guided by the article

R&D Labs maybe for 4356-09-6

Relevant to this article

Hot Product