316
RICARDO A. MARONNA AND RUBEN H. ZAMAR
D
+
7. DISCUSSION
Table 6. Times for Simulated Data in Seconds a
b
c
d
There is probably no estimate that is fully satisfactory.
p
FMCD is equivariant, but—although the empirical results with
Estimate
n
20
40
60
80
D
Ns
500 are satisfactory—it is dif cult to determine for a
p
N
ensures a given breakdown point. Moreover,
s
given which
FMCD
200
400
800
200
400
800
200
400
800
1309
3306
7409
300
407
803
046
087
106
4206
8603
17809
907
1205
1700
105
8908
17105
33305
2507
2706
3706
306
20207
41706
72600
5709
6503
7001
703
D
(Ns 500)
the simulations show that it may behave poorly under point
p
mass contamination. SDE is equivariant, and for moderate it
SD
D
does a good job under point mass contamination, but with real
data, it seems to fail to detect interesting structures, and for
(Ns 500)
OGK
p
N
to ensure
s
large , it requires impractically large values of
309
700
504
1203
1109
1706
a high breakdown point. Finally, OGK is not equivariant, but
it performs well in simulations with point mass contamination
and performs similarly to FMCD with high-dimensional real
data, all at a computational cost much lower than that of its
competitors. The weighted versions are better and are “more
equivariant,” as demonstrated in Section 5. Iterating seems
used a selection algorithm (the procedure “select” in section
8.5 of Press et al. (1992), which is linear in
n
.
We implemented steps 1–4 of the algorithm in section 5
of Rousseeuw and van Driessen (1999). The running times
of SDE, FMCD, and OGK4 were measured for 20% con-
40 5
40 5
9 for the
advantageous; OGK4 5
9
is better than OGK4
5
1
2
real datasets in Sections 4.2, 4.3, and 4.5. It must be added
that even for moderate datasets, a very fast procedure has the
advantage of allowing the use of computer-intensive methods,
such as the bootstrap and cross-validation.
5
1
D
p
1
1
taminated normal samples with
20 40 60, and 80 and
D
D
n
1
N
200 400, and 800. The number of subsamples was
s
500 in all cases. Whereas the running times of SDE and OGK
are practically independent of the dataset, this is not so for
FMCD, which seems to require more time (i.e., more itera-
tions) for contaminated data than for pure normal data.
ACKNOWLEDGMENT
Ruben Zamar’s research was partially funded by NSERC,
Canada.
n
We have not tried larger ’s for several reasons. First, we
p
n
.
were concerned with the problem of large
than large
[Received May 2001. Revised February 2002.]
n
n
Second, when is larger than a certain (the default is 600),
0
Rousseeuw and van Driessen’s FMCD algorithm applies an
ingenious splitting procedure to reduce the number of evalua-
tions. For OGK, a time-saving procedure may be as follows.
REFERENCES
Abdullah, M. B. (1990), “On a Robust Correlation Coef cient,” The Statisti-
cian, 39, 455–460.
Agulló, J. (1996), “Exact Iterative Computation of the Multivariate Mini-
mum Volume Ellipsoid Estimator With a Branch and Bound Algorithm,” in
Proceedings in Computational Statistics, ed. A. Prat, Heidelberg: Physica-
Verlag, pp. 175–180.
Bay, S. D. (1999), “The UCI KDD Archive” [http://kdd.ics.uci.edu], Univer-
sity of California, Irvine, Dept. of Information and Computer Science.
Bickel, P. J. (1964), “On Some Alternative Estimates for Shift in the
Variate One-Sample Problem,” Annals of Mathematical Statistics, 35,
1079–1090.
Campbell, N. A. (1989), “Bush re Mapping Using NOAA AVHRR Data,”
technical report, CSIRO.
Croux, C., and Rousseeuw, P. J. (1992), “Time-Ef cient Algorithms for
Two Highly Robust Estimators of Scale,” Computational Statistics, 2,
411–428.
Davies, P. L. (1987), “Asymptotic Behavior of S-Estimates of Multivariate
Location Parameters and Dispersion Matrices,” The Annals of Statistics, 15,
1269–1292.
n
n
When is larger than some 0, take a random subsample of
n
size
tion in Section 2; then use the whole sample for (4) and (7);
n1
and use it to perform steps 1, 2, and 3 of the de ni-
1
p
probably should depend on . It is dif cult to determine
theoretically how much the statistical performance of FMCD
and OGK deteriorates with this savings, so that further exper-
iments would be necessary to determine an adequate choice
p
-
n
n
.
1
of
and
0
Table 6 gives the running times in seconds. It is seen
that those for FMCD are between 22 and 46 times those
Ns
for OGK4 5. Note that the values of
actually required by
1
SDE are much larger than the 500 used for testing. Actually,
the number of subsamples required to ensure an average of
Devlin, S. J., Gnanadesikan, R., and Kettenring, J. R. (1981), “Robust Esti-
mation of Dispersion Matrices and Principal Components,” Journal of the
American Statistical Association, 76, 354–362.
D
D
˜
0
p
1 1
20 40 60, and 80 are
ve “good” ones for
2 and
6
8
1
41
around 400 4 10
3
10 , and 3 10 . Table 7 shows the
running times for the real datasets in the preceding section, Donoho, D. L. (1982), “Breakdown Properties of Multivariate Location Esti-
mators,” Ph.D. qualifying paper, Harvard University.
in seconds.
Genton, M. G., and Ma, Y. (1999), “Robustness Properties of Dispersion
Estimators,” Statistics and Probability Letters, 44, 343–350.
Gnanadesikan, R., and Kettenring, J. R. (1972), “Robust Estimates, Residuals,
and Outlier Detection With Multiresponse Data,” Biometrics, 28, 81–124.
Hampel, F. R., Ronchetti, E., Rousseeuw, P. J., and Stahel, W. A. (1986),
Robust Statistics: The Approach Based on Inuence Functions, New York:
Wiley.
Hawkins, D. M. (1994), “The Feasible Solution Algorithm for the Minimum
Covariance Determinant Estimator in Multivariate Data,” Computational
Statistics and Data Analysis, 17, 197–210.
Table 7. Times for Real Datasets
Dataset
n
p
NSD
500
NFM
OGK
SDE
04
1409
2903
FMCD
Bush’re
38
677
225
531
5
500
500
500
500
004
032
101
08
2003
2102
1
Engineering
Ionospheric
Spectral
9
31
93
2 000
Huber, P. J. (1981), Robust Statistics, New York: Wiley.
1
3 000
¨
’
Lopuhaa, H. P. (1991), “Multivariate -Estimators for Location and Scatter,”
1
3 000
1904
45803
61506
Canadian Journal of Statistics, 19, 307–321.
TECHNOMETRICS, NOVEMBER 2002, VOL. 44, NO. 4