Scholarly article on POLY(D-LACTIDE) 106989-11-1 from

Econometrica, Vol. 71, No. 3 (May, 2003), 933–946

FINITE MIXTURE DISTRIBUTIONS, SEQUENTIAL LIKELIHOOD

AND THE EM ALGORITHM

By Peter Arcidiacono and John Bailey Jones¹

A popular way to account for unobserved heterogeneity is to assume that the data are

drawn from a ﬁnite mixture distribution. A barrier to using ﬁnite mixture models is that

parameters that could previously be estimated in stages must now be estimated jointly:

using mixture distributions destroys any additive separability of the log-likelihood function.

We show, however, that an extension of the EM algorithm reintroduces additive sepa-

rability, thus allowing one to estimate parameters sequentially during each maximization

step. In establishing this result, we develop a broad class of estimators for mixture models.

Returning to the likelihood problem, we show that, relative to full information maximum

likelihood, our sequential estimator can generate large computational savings with little

loss of efﬁciency.

Keywords: Unobserved heterogeneity, mixture distributions, EM algorithm, dynamic

discrete choice.

1

ꢀ introduction

One way to account for unobserved heterogeneity in data, and the related problem

of self-selection, is to assume that the data are drawn from a ﬁnite mixture distribution.

Under this approach, each observation is assumed to belong to one of several different

“types,” each of which has its own distribution. While the econometrician does not observe

each observation’s type, if her model is sufﬁciently structured she can infer it by applying

Bayes’ Theorem.

Models with ﬁnite mixtures have appeared in numerous applications.²In labor eco-

nomics, Keane and Wolpin (1997) and Eckstein and Wolpin (1999) use mixtures to control

for person-speciﬁc differences in models of dynamic discrete choice. Finite mixture models

form the basis of Hamilton’s (1989, 1990) inﬂuential regime-switching model of economic

time series. A particularly important application has been to use ﬁnite mixture models

as nonparametric approximations to more general mixture models. Important papers in

this vein include Laird (1978), Lindsay (1983), and Heckman and Singer (1984). More

recently, Cameron and Heckman (1998, 2001) use this sort of nonparametric maximum

likelihood estimation to study the effect of family background on educational achieve-

ment. Mroz (1999) uses mixtures to control for endogeneity in a binary explanatory vari-

able. He shows that “discrete factor approximations” to a continuous latent variable often

1

We thank Donald Andrews, Arie Beresteanu, Mark Coppejans, Michael McCracken, Tom Mroz,

Barbara Rossi, Wilbert van der Klaauw, and two anonymous referees for valuable comments.

2

Although we focus on economic applications, ﬁnite mixture models have been used widely in

other ﬁelds as well. Titterington, Smith, and Makov (1985) and McLachlan and Peel (2000) provide

lists.

933

9

34

p. arcidiacono and j. b. jones

outperform alternative estimators, especially when the unobservable components of the

model have a non-normal distribution.

One drawback to using mixture models is that they can complicate the estimation pro-

cess. In this paper we focus on a particular problem, namely the issue of sequential like-

lihood. Some complicated likelihood models can be feasibly estimated only in stages; a

subset of the parameters is estimated using one portion of the likelihood function, with

the remainder of the parameters estimated with the remainder of the likelihood func-

tion, using the parameters estimated in the preceding step(s). While introducing a mixture

distribution seemingly prevents one from proceeding sequentially, we show that if one

extends the Expectation-Maximization (EM) algorithm, one can still estimate the likeli-

hood function in steps.

In contrast to the EM algorithm, which is ultimately a search algorithm, our procedure

does not yield full information maximum likelihood (FIML) estimates. Rather, our proce-

dure introduces a broad class of estimators for mixture and switching models. In particu-

lar, a simple argument shows that any moment condition that holds across the unobserved

“types” or “states” generates a moment condition that holds across the observed data.

In addition to providing general results, we construct a Monte Carlo exercise that

shows the large savings in computational time from employing the EM algorithm with

a sequential maximization step (ESM). Although the gains to using the method are

problem-speciﬁc, we show reductions in computing time on the order of 20 for a relatively

simple problem. More complicated problems should show even larger reductions. A fur-

ther beneﬁt of the ESM algorithm is that moving from a problem without unobserved

heterogeneity to one with unobserved heterogeneity requires little change in computer

code.

The next section reviews mixture distributions and the EM algorithm. Section 3 shows

how the EM algorithm introduces an additive separability not previously present in mix-

ture models. This allows for a sequential maximization step. Section 4 describes the asymp-

totics of our estimator, and shows how it can be generalized. Section 5 provides simula-

tions showing that the ESM estimator performs as well as FIML and takes signiﬁcantly

less time to converge. Section 6 concludes.

2

ꢀ mixture distributions and the em algorithm

The general relationship between mixtures and the EM algorithm has been covered in

a number of sources, such as Everitt and Hand (1981), Titterington, Smith, and Makov

(1985), and Hamilton (1990). We provide a brief review.

Consider a panel data set of I individuals, where for each individual i we observe

T realizations of the J-element vector x . Observations of x are independent across

it

individuals, although not necessarily across time. As a matter of notation, let the collection

ꢀ

of x-vectors for agent i be denoted by the J ·T -element vector x = ꢁx ꢂx ꢂꢃ ꢃ ꢃ ꢂx ꢄ .

i

i1

i2

iT

Each individual belongs to one of K distinct types. While the econometrician knows

K, he does not observe individuals’ types. Let p_kdenote the unconditional probability

that an individual belongs to type k, with p = ꢅp ꢂp ꢂꢃ ꢃ ꢃ ꢂp ꢆ denoting the vector of

1

2

K

these probabilities. Letting f ꢅ·ꢆ denote the density function for type k, and letting ꢇ =

k

ꢅꢈ ꢂꢃ ꢃ ꢃ ꢂꢈ ꢆ denote a vector of parameters, the unconditional likelihood of x is

1

M

i

K

ꢀ

gꢅx ꢉꢇꢂpꢆ =

p f ꢅx ꢉꢇꢆꢀ

k k i

i

k=1

ﬁnite mixture distributions

935

It follows from Bayes’ theorem that Prꢅkꢁx ꢉꢇꢂpꢆ, the probability that agent i is of

i

type k, conditional on having observed x , is given by

i

p f ꢅx ꢉꢇꢆ

k

i

(1)

Prꢅkꢁx ꢉꢇꢂpꢆ =

ꢀ

i

gꢅx ꢉꢇꢂpꢆ

i

Let S_Kdenote the K −1-dimensional unit simplex. Using equation (l), it is straightforward

ꢁ

to show that if one maximizes the sample log-likelihood, Lꢅꢇꢂpꢆ ≡

lnꢅgꢅx ꢉꢇꢂpꢆꢆ,

i

subject to the restriction p ∈ S , the maximum likelihood estimate pˆ is given by

K

k

I

1

ꢀ

ꢂ

ꢄ

ꢃ

(

2)

pˆ = _I

Pr kꢁx ꢉꢇꢂpˆ ꢀ

k

i

i=1

ꢃ

The maximum likelihood estimate ꢇ must solve

I

K

ꢃ

ꢀ

ꢊ lnꢅf ꢅx ꢉꢇꢆꢆ

k

i

ꢃ

(

3)

Prꢅkꢁx ꢉꢇꢂpˆꢆ

= 0ꢂ

i

ꢊꢇ

i=1 k=1

so that

4)

I

K

ꢀ

ꢃ

(

ꢇ = argmax

Prꢅkꢁx ꢉꢇꢂpˆꢆlnꢅf ꢅx ꢉꢇꢆꢆꢀ

i

k

i

ꢇ

i=1 k=1

ꢃ

In other words, ꢇ maximizes the sample average of two different objects: (i) the log of

ꢁ

the unconditionally-type-averaged likelihood ꢅlnꢁ

type-averaged log-likelihood ꢅ

p f ꢅx ꢆꢄꢆ; and (ii) the conditionally-

k

k k i

ꢁ

Prꢅkꢁx ꢆlnꢁf ꢅx ꢆꢄꢆ. The key insight of our paper is that

k

i

k

i

while the ﬁrst object does not support sequential estimation, the second one does.

Equations (1) through (4) suggest the following iterative algorithm, which is a special

case of the EM algorithm developed by Dempster, Laird, and Rubin (1977). Suppose that

l

at the beginning of iteration l, the operative value of ꢇ is ꢇ and the operative value

l

of p is p . In the “E” step, one uses equation (1) to ﬁnd Prꢅkꢁx ꢉꢇ ꢂp ꢆ. In the “M”

i

l+1

step, one uses equations (2) and (4) to ﬁnd p and ꢇ , respectively. One iterates until

convergence.

3

ꢀ the em algorithm with a sequential m step

Now divide the parameter vector ꢇ into ꢇ and ꢇ . Clearly the solution to equation

1

2

(

4) can be found by maximizing across ꢇ and ꢇ simultaneously, or by iterating, using

1

2

ꢃ

the most recent value of ꢇ to update ꢇ and then using this updated value of ꢇ to

recalculate ꢇ . For some applications, it is easier to proceed sequentially. Meng and Rubin

(

and show that the ECM algorithm retains all of the convergence properties of the EM

algorithm.

1

2

ꢃ

1

1993) call this approach the Expectation-Conditional Maximization (ECM) algorithm,

3

A more interesting case occurs when the type-conditional likelihood function can be

decomposed as

f ꢅx ꢉꢇ ꢂꢇ ꢆ = f ꢅx ꢉꢇ ꢆf ꢅx ꢉꢇ ꢂꢇ ꢆꢂ

k

i

1

2

1k

i

1

2k

i

1

2

3

ꢃ

As Ruud (1991) points out, one can update Prꢅkꢁx ꢉꢇꢂpˆꢆ each time either ꢇ or ꢇ is updated.

i

1

2

Meng and Rubin (1993) label this the “multi-cycle ECM” algorithm. Also see the discussion in

McLachlan and Krishnan (1997).

9

36

p. arcidiacono and j. b. jones

and f ꢅx ꢉꢇ ꢆ can be written as a product of type-conditional likelihoods:

1

k

i

1

J

ꢅ

(5)

f ꢅx ꢉꢇ ꢆ = f ꢅx ꢁx_i_ꢂ_∼_jꢉꢇ ꢆꢂ

1

k

i

1

1k

iꢂj

1

j=1

where x_i_ꢂ_jand x

are mutually exclusive subvectors of x_i.

iꢂ∼j

It proves instructive to consider the log-likelihood that arises when K = 1, i.e., there is

only one type:

I

ꢀ

Lꢅꢇꢆ = lnꢅf ꢅx ꢉꢇ ꢆꢆ+ lnꢅf ꢅx ꢉꢇ ꢂꢇ ꢆꢆꢂ

1

i

1

2

i

1

2

i=1

=

L ꢅꢇ ꢆ+L ꢅꢇ ꢂꢇ ꢆꢀ

1

2

1

2

In this case, consistent estimates of ꢇ can be found by maximizing L , while consistent

1

4

estimates of ꢇ can be found from maximizing L , taking as given the estimates of ꢇ .

2

1

Note that this differs from the ECM approach in that we are not maximizing f ꢅ·ꢆ in steps,

k

but are instead sequentially maximizing two partial likelihoods. While this approach is less

efﬁcient than maximizing the log of f ꢅ·ꢆ, it is often much easier to implement, especially

when L is difﬁcult to evaluate.

2

For example, Rust (1994) considers the maximum likelihood estimator for a Markov

decision process,

ꢆ

ꢇ

I

T

ꢀ

ꢅ

ꢃ

i

t

i

t

i

t

i

(6)

ꢇ = arg max ln

Pꢅd ꢁs ꢉꢇ ꢂꢇ ꢆꢋꢅs ꢁs ꢂd ꢉꢇ ꢆ ꢂ

1

2

t−1

1

ꢇ

i=1

t=1

i

t

i

where d is agent i’s decision vector at time t, and s is the vector of state variables that

t

i

t

characterizes agent i’s economic environment at time t. While ꢋꢅs ꢁ·ꢆ is straightforward to

i

evaluate, Pꢅd ꢁ·ꢆ requires one to solve a dynamic programming problem. Rust ﬁnds that

t

ꢁ

i

t

i

estimating ꢇ as the maximizer of

lnꢅꢋꢅs ꢁs ꢂd ꢉꢇ ꢆꢆ can greatly reduce the

t−1 t−1

1

i

t

1

ꢁ

i

t

i

t

number of times

lnꢅPꢅd ꢁs ꢉꢇ ꢂꢇ ꢆꢆ must be evaluated, which in turn signiﬁcantly

i

t

1

2

lowers computational cost. Indeed, Rust and Phelan (1997) conclude that “[e]stimation is

only feasible using a simpler two-stage estimation procedure[.]”

In the ﬁnite mixture case, the log-likelihood is

ꢆ

ꢇ

I

K

ꢀ

Lꢅꢇꢂpꢆ = ln

p f ꢅx ꢉꢇ ꢆꢂf ꢅx ꢉꢇ ꢂꢇ ꢆ ꢂ

k

1k

i

1

2k

i

1

2

i=1

k=1

which cannot be neatly decomposed into L and L . This seemingly destroys the option

1

2

of sequential estimation. But with the EM algorithm we work with equation (4), which

can be written as

I

K

ꢀ

ꢃ

ꢅꢇ ꢂꢇ ꢆ = arg max

Prꢅkꢁx ꢉꢇꢂpˆꢆlnꢅf ꢅx ꢉꢇ ꢆꢆ

1

2

i

1k

i

1

ꢌꢇ1ꢂꢇ2ꢍ i=1

k=1

I

K

ꢀ

ꢃ

+

Prꢅkꢁx ꢉꢇꢂpˆꢆlnꢅf ꢅx ꢉꢇ ꢂꢇ ꢆꢆꢀ

i

2k

i

1

2

i=1 k=1

4

The asymptotic properties of these sorts of two-step estimators are discussed in Cox (1975) and

Amemiya (1978), as well as in the next section.

ﬁnite mixture distributions

937

Once again we can proceed sequentially, using the partial likelihood estimators

I

K

ꢀ

ꢈ

(

7)

8)

ꢇ = arg max

Prꢅkꢁx ꢉꢇꢂp˜ꢆlnꢅf ꢅx ꢉꢇ ꢆꢆꢂ

1

i

1k

i

1

ꢇ1

i=1 k=1

I

K

ꢀ

ꢈ

(

ꢇ = arg max

Prꢅkꢁx ꢉꢇꢂp˜ꢆlnꢅf ꢅx ꢉꢇ ꢂꢇ ꢆꢆꢀ

2

i

2k

i

1

2

ꢇ2

i=1 k=1

Applying the EM algorithm in this way introduces an additive separability that allows ꢇ

to be estimated sequentially, with each stage using the estimates from the previous stage.

Note that the derivative of f ꢅ·ꢆ with respect to ꢇ never has to be calculated. This means

2

k

1

that the estimates generated by equations (7) and (8) are less efﬁcient than the FIML

estimates, but potentially much easier to compute.

4

ꢀ asymptotic behavior of the sequential estimator

As the review in Section 2 reveals, the EM algorithm is a method for ﬁnding standard

FIML estimates. Our sequential estimator, on the other hand, is not equivalent to FIML.

The asymptotic properties of our estimator can be shown instead by constructing moment

conditions, to which standard GMM results can be applied. In the next section we derive

these moment conditions. In the succeeding section, we discuss conditions that ensure the

parameters of interest are identiﬁed. We ﬁnish our theoretical discussion by showing how

our approach generates a wide class of estimators.

4

ꢀ1ꢀ Moment Conditions

Let starred values denote population parameters. Note ﬁrst that at the population level

∗

(9)

ꢅꢇ ꢂp ꢆ = arg max E ꢅlnꢁp f ꢅxꢉꢇ ꢆf ꢅxꢉꢇ ꢂꢇ ꢆꢄꢆꢂ

xꢂk

k

1k

1

2k

1

2

ꢌꢇꢂp∈SK ꢍ

with the expectation taken over both k and x. It then follows from the law of total

probability that

ꢆ

K

ꢀ

∗

(10)

ꢅꢇ ꢂp ꢆ = arg max E

Prꢅkꢁxꢉꢇ ꢂp ꢆ

x

ꢌꢇꢂp∈SK ꢍ

k=1

ꢇ

×

lnꢁp f ꢅxꢉꢇ ꢆꢂf ꢅxꢉꢇ ꢂꢇ ꢆꢄ ꢂ

k

1k

1

2k

1

2

with the latter expectation taken over x alone. This result is the self-consistency property,

5

which dates back to work by R. A. Fisher.

∗

It immediately follows from equation (10) that ꢇ solves

2

ꢆ

ꢇ

K

ꢀ

∗

1

max E

Prꢅkꢁxꢉꢇ ꢂp ꢆlnꢁf ꢅxꢉꢇ ꢂꢇ ꢆꢄ ꢂ

x

2k

2

ꢇ2

k=1

5

See the discussion in Efron (1982) and McLachlan and Krishnan (1997).

9

38

p. arcidiacono and j. b. jones

the population analog to equation (8). The ﬁrst-order condition for this problem is

ꢆ

ꢇ

K

∗

ꢀ

ꢊ lnꢅf ꢅxꢉꢇ ꢆꢆ

∗

2k

E_x

Prꢅkꢁxꢉꢇ ꢂp ꢆ

= 0ꢀ

^ꢊ^ꢇ2

k=1

The population analog to equation (2) can be constructed in a similar fashion.

Since f ꢅx ꢉꢇ ꢆ is a type-conditional likelihood in its own right—recall equation (5)—

1

k

i

1

∗

6

ꢇ must solve

1

ꢆ

ꢇ

K

ꢀ

∗

max E

Prꢅkꢁxꢉꢇ ꢂp ꢆlnꢁf ꢅxꢉꢇ ꢆꢄ ꢂ

x

1k

1

ꢇ1

k=1

the population analog to equation (7). The associated ﬁrst-order condition is

ꢆ

ꢇ

K

∗

1

ꢀ

ꢊ lnꢅf ꢅxꢉꢇ ꢆꢆ

∗

1k

E_x

Prꢅkꢁxꢉꢇ ꢂp ꢆ

= 0ꢀ

^ꢊ^ꢇ1

k=1

The population moment conditions for ꢇ and p are thus



ꢁ

∗



K

k=1

∗

ꢊ lnꢅf1kꢅxꢉꢇ1 ꢆꢆ

Prꢅkꢁxꢉꢇ ꢂp ꢆ

ꢊꢇ

1





ꢁ

ꢊ lnꢅf2kꢅxꢉꢇ1 ꢆꢆ

∗

K

k=1

∗

Prꢅkꢁxꢉꢇ ꢂp ꢆ



ꢊꢇ2







(11)

E_x

∗

= 0ꢂ

Prꢅ1ꢁxꢉꢇ ꢂp ꢆ−p

1





ꢀ



ꢀ





ꢀ



∗

PrꢅKꢁxꢉꢇ ꢂp ꢆ−p

K

∗

with Prꢅkꢁxꢉꢇ ꢂp ꢆ given by equation (1). Then it follows from standard arguments (see

Hansen (1982) or Newey and McFadden (1994)) that, subject to the usual regularity

ꢈ

conditions, ꢇ ꢂꢇ , and p˜ are consistent and asymptotically normal, with the variance-

1

2

covariance matrix given by the standard method-of-moments formula. Note that even

though ꢇ and ꢇ can be estimated sequentially, ﬁnding standard errors requires evalu-

1

2

7

ating all the moment conditions together. Equation (11) also reveals that the sequential

estimator will not be as efﬁcient as FIML, for

ꢉ

ꢊ

∗

K

∗

ꢊ lnꢅgꢅxꢉꢇ ꢂp ꢆꢆ

ꢀ

ꢊ lnꢅf ꢅxꢉꢇ ꢆꢆ ꢊ lnꢅf ꢅxꢉꢇ ꢆꢆ

1

1k 2k

∗

=

Prꢅkꢁxꢉꢇ ꢂp ꢆ

+

ꢂ

ꢊꢇ₁

k=1

which means that the ﬁrst element of the moment vector in equation (11) is not part of

the score vector for the FIML function, even though the remaining elements are.

6

Also see Cox’s (1975) discussion of partial likelihood.

Rust (1994) discusses this issue in some detail for the one-type case.

7

ﬁnite mixture distributions

ꢀ2ꢀ Asymptotic Identiﬁcation

939

4

Consistency and asymptotic normality require that an estimator satisfy regularity con-

ditions of the sort set forth by Newey and McFadden (1994). Of these the most important

is asymptotic identiﬁcation. One approach for achieving identiﬁcation is to assume that

the moment conditions given by equation (11) are satisﬁed only by the parameter vector

∗

ꢅꢇ ꢂp ꢆ. Given that mixture likelihoods are often not globally concave, we also consider

an alternative approach. In particular, we assume that the expectation of the log-likelihood

∗

function is uniquely maximized at ꢅꢇ ꢂp ꢆ and characterize the moment conditions listed

8

in equation (11) as features of this optimum.

Wu (1983) shows that the EM algorithm converges to ﬂat points on the likelihood

surface, so that the EM solution yielding the highest likelihood value can be taken as the

maximum likelihood estimate. One can see this heuristically by considering equations (2)

and (4). While our sequential estimator is not a reformulation of the FIML estimator,

it can nonetheless be used in a similar way. In particular, one can apply the likelihood

9

criterion when the sample analog to equation (11) has multiple solutions. Although this

does not yield FIML estimates—equation (11) is not the FIML score—using a likelihood

tiebreaker ensures consistency. We provide a formal proof of consistency in the Appendix,

1

0

using arguments that apply to almost any GMM estimator.

It is worth reiterating that even if the population likelihood function has a unique

maximum, FIML estimation can require one to compare numerous local extrema on the

sample likelihood surface. If the ESM algorithm yields multiple ﬁxed points, it is likely

to be the case that a gradient-based FIML search will yield multiple solutions as well.

In either case, a likelihood tiebreaker will have to be applied. The difference is that in

the searches before the tiebreaker is applied, the sequential estimator can be much less

computationally demanding.

To this point, we have focused on how the ESM algorithm generates a sequential alter-

native to the FIML estimator. A different approach is to use the ESM algorithm to gen-

erate initial values for a FIML search. Using the ESM algorithm in this way allows one

to enjoy some of the cost savings of sequential estimation without losing asymptotic efﬁ-

ciency. A particularly interesting possibility is to utilize the ESM algorithm as a search

routine in nonparametric maximum likelihood, in a way similar to how Follmann and Lam-

bert (1989) combine the EM and quasi-Newton algorithms. Such an approach extends the

beneﬁts of sequential estimation to cases where the number of types ꢅKꢆ is not known.¹¹

8

In assuming uniqueness, we are imposing several normalizations. Titterington, Smith, and Makov

(

1985) discuss exact conditions for identifying ﬁnite mixture models.

In choosing this way, one must take care to restrict oneself to stationary solutions. It is well

9

known, for example, that one can drive the sample log-likelihood of a normal mixture to inﬁnity by

assuming that one of the observations belongs to its own zero-variance type.

1

0

An interesting result from this section is that to ensure consistency, one has to consider local as

well as global minima of the GMM criterion function generated by equation (11).

1

We are grateful to a referee for this suggestion. As described by Heckman and Singer (1984)

and Follmann and Lambert (1989), when K is unknown one proceeds by ﬁnding FIML estimates

with successively larger values of K until, roughly speaking, the derivative of the likelihood function

with respect to K is nonpositive. A topic we do not explore here is whether ESM estimates can fully

replace FIML estimates in this computationally intensive procedure, or can serve only as starting

values.

9

40

p. arcidiacono and j. b. jones

As Rust (1994) points out, yet another way to recover asymptotic efﬁciency is to use

the sequential estimator as the basis for a one-step estimator: starting with the sequential

estimates, one can take one Gauss-Newton step with the full likelihood function.

4

ꢀ3ꢀ Generalizations of the Sequential Estimator

Our approach extends in a very straightforward way to general moment conditions in

mixture models. In the interest of brevity, we continue to work with ﬁnite mixtures, but

extensions to general mixtures or regime-switching models are straightforward.

As before, it follows from the law of total probability that any function hꢅxꢂkꢉꢇꢆ that

satisﬁes

∗

E_x_ꢂ_kꢅhꢅxꢂkꢉꢇ ꢆꢆ = 0

also satisﬁes

ꢆ

ꢇ

K

ꢀ

∗

(12)

E_x

Prꢅkꢁxꢉꢇ ꢂp ꢆhꢅxꢂkꢉꢇ ꢆ = 0ꢀ

k=1

Equation (12) provides a basis for estimation. There is a long tradition of analyzing mix-

1

2

ture distributions with classical method of moments estimators; we have simply extended

the classical approach to general moment conditions. As in the motivating case of sequen-

tial likelihood, some of these alternative conditions might be less computationally demand-

ing than the likelihood equations. It is also straightforward to construct overidentiﬁcation

tests.

By way of example, consider the following linear regression model:

ꢀ

∗

y = x b +e ꢂ

i

k

i

where: x is an M-element random vector; b is a parameter vector; and e is a standard

i

k

i

logistic random variable that is independent of x and k. As before i indexes observation

i

and k denotes observation i’s unobserved type. Let ꢇ denote the collection of b’s. Note

that

∗

ꢀ

∗

E_y_ꢂ_xꢅPrꢅkꢁyꢂxꢉꢇ ꢂp ꢆxꢁy −x b ꢄꢆ = 0ꢂ

k ∈ ꢌ1ꢂꢃ ꢃ ꢃ ꢂKꢍꢀ

k

Following Kiefer (1980), under random sampling the sample analog to this equation can

ꢃ

be found using weighted least squares, where the weighting matrix W is a diagonal matrix

ꢋ

k

ꢃ

whose ith element is Prꢅk ꢁy ꢂx ꢉꢇꢂpˆꢆ. As before, one can proceed iteratively, estimat-

i

ˆ

ꢃ

ing b with the sample matrices W XꢂW y, and using these estimates to update W .

k

5

ꢀ simulations

Two questions remain. First, are there common cases where the sequential M step

results in signiﬁcant savings in computational time? Second, since the two-step estimator

described above is not efﬁcient, how much information is lost by using it? To address these

issues, we perform a Monte Carlo simulation with a dynamic discrete choice problem.

Even for this relatively simple problem, the computational gains are quite large, with little

loss of information.

1

2

See Everitt and Hand (1981), and Titterington, Smith, and Makov (1985).

ﬁnite mixture distributions

ꢀ1ꢀ The Model

941

5

The model we use in our Monte Carlo exercise is one of sequential decision-making

because, as discussed above, sequential estimation works particularly well with models of

dynamic choice. The model we simulate is similar in spirit to Cameron and Heckman

1

3

(

2001). In each of three periods, individuals decide whether to continue their education.

In the fourth period individuals receive earnings. Earnings depend on education, observ-

able characteristics, a random shock, and an individual’s unobserved type. Different types

have different labor market abilities, and have different preferences over education itself.

Individuals face uncertainty over both the pecuniary and nonpecuniary returns to educa-

tion. As time passes, individuals receive new information that allows them to reduce this

1

4

uncertainty.

In the absence of type-based differences, the likelihood function for education choices

and earnings generated by this model resembles the likelihood function in equation (6)

and can be estimated in a similar sequential fashion. This will yield consistent estimates

of, among other things, the returns to college, ꢎ . But with unobservable type-based

C

differences, estimates of ꢎ_Cwill be biased upwards (and inconsistent) unless the estimates

account for type-based selection. The goal of the Monte Carlo exercise is to see whether

the ESM algorithm can account for selection, by estimating the mixture model, more

quickly and as accurately as FIML.

5

ꢀ2ꢀ Simulation Results

All of the simulations were conducted in MATLAB, using MATLAB’s “fminunc” opti-

mization package. The number of individuals is ﬁxed at 3000 and the number of types

is two. Crucial to the calculation time is the number of points that are used to approxi-

mate the distribution of new information. An increase in the number of points leads to

more complicated expectations and a larger computational burden. For the simulations,

we approximate the distributions of unknown state variables with 10-point discrete dis-

tributions. This discretization is applied to two unknown state variables at t = 1 and one

unknown state variable at t = 2.

The model is estimated 100 times using four different methods. First, we estimate

the model with the complete data, where each individual’s type is observed. Second, we

estimate the model with incomplete data, where type is unobserved, and pretend that

there is no selection problem. We then control for unobserved types by estimating the

mixture model, ﬁrst with FIML and then with the ESM algorithm. We do not report

estimates for the EM algorithm itself, as it was substantially slower than FIML.

As we are primarily interested in how well the various approaches to estimating the

mixture distribution mitigate the selection problem, we only report the coefﬁcient on the

1

5

return to college, ꢎ_C. The key feature of the model is that the estimates of ꢎ_Care biased

upwards from the population value of 0.2 (and inconsistent) unless the estimates account

for selection based upon type. We also report the standard deviation of the estimated

returns and the mean squared difference between the estimates and the true value of ꢎ_C.

To get a sense of speed, we record the number of ﬂoating point operations (FLOPs) the

1

3

4

The model also resembles Aricidiacono’s (2002) model of application, college, and major choice.

A detailed description of the model, the parameters of the data generating process, and the

starting values for the optimization routines is in a simulation appendix, which can be downloaded

∼

from http://www.econ.duke.edu/ psarcidi/simulation.pdf.

1

5

All of the approaches produced similar estimates for the other coefﬁcients.

9

42

p. arcidiacono and j. b. jones

TABLE I

Simulation Results

Estimation Method

Complete

Incomplete

FIML

ESM

Mean ꢅꢎˆ ꢆ

0ꢀ2078

0ꢀ0330

0ꢀ1141

0ꢀ2932

0ꢀ0323

0ꢀ9731

0ꢀ2255

0ꢀ0496

0ꢀ3082

0ꢀ2226

0ꢀ0565

0ꢀ3667

22.48

C

Standard Deviation ꢅꢎˆ ꢆ

C

Mean Squared Error ꢅ×100ꢆ

(

FIML FLOPs)/(ESM FLOPs)

Note: Each simulation was conducted 100 times with 3000 observations. The distributions of

unknown state variables were approximated with 10-point discrete distributions. Mean squared error

refers to the squared differences between estimates of ꢎC and its true value of 0.2.

various algorithms took to converge.¹⁶We then report the ratio of FIML FLOPs to ESM

FLOPs.

Table I presents the simulation results. As expected, not controlling for the selection

problem yields estimates of ꢎ_Cthat are too high relative to the complete data estimates.

Using either FIML or ESM to estimate the mixture model yields estimates much closer to

those found when individuals’ types were observed. Moving from FIML to ESM increases

the standard deviation for ꢎˆ_Cand the mean squared error, both by less than twenty

percent. The last line in Table I shows that this relatively small loss of precision leads to

large gains in speed: the ESM algorithm improves the rate of convergence by a factor of

roughly twenty. To see the rate at which adding states affects computational time, we also

perform the simulation with ﬁve and seven states for each of the discretized state variables.

Figure 1 graphs the number of FLOPS for FIML and ESM. While the computational

gains are large for ﬁve states (over ten times as fast), it is clear that the gains increase

with the number of states.

5

ꢀ3ꢀ Discussion

It is worth stressing that the ESM algorithm requires little researcher time: program-

ming the algorithm can be very easy. In the algorithm’s simplest form, all that one needs

is to save the full density functions so that one can estimate the type probabilities by

Bayes’ rule. One otherwise uses the same estimators as in the nonmixture case, except

that the data are weighted by the imputed type probabilities. Hence, adding decisions or

state variables has very little effect on the time spent programming the ESM algorithm.

In general, the ESM algorithm is easier to program than FIML. Because the simulations

behind Table I employ the simplest version of the ESM algorithm, the substantial savings

in computational time come with savings in programming time as well.

There are, however, at least three reasons to believe that the estimates in Table I give

lower bounds on the computational savings from the ESM algorithm. First, in our current

optimization routine, the estimate of the Hessian at each stage is re-initialized at the

beginning of each ESM maximization step. Hence, all the updating of the Hessians that

occurs while maximizing the type-conditional log-likelihood functions is lost. Changing the

1

6

Jamshidian and Jennrich (1997) use this measure of speed in their study of enhancements to

the EM algorithm. An advantage of using FLOPs is that we are able to run simulations on multiple

computers of varying clock speeds and still have a consistent measure of speed.

ﬁnite mixture distributions

943

Figure 1.—Number of FLOPs as a function of the number of states.

optimization code to carry estimated Hessians across ESM iterations could substantially

reduce convergence times.

Second, the convergence criteria we used at the maximization step did not depend upon

how close the ESM algorithm was to converging. Precise maximization is not necessary

when the ESM algorithm is far from the optimum. Setting the convergence criteria at the

maximization step to be a function of the changes in the conditional probabilities and the

likelihoods should speed up convergence.

Third, some of the work in the statistics literature on accelerating the EM algorithm

l

can be applied here. An iteration in the EM (or ESM) algorithm uses ꢇ and p to ﬁnd

l

l+1

Prꢅkꢁx ꢉꢇ ꢂp ꢆꢂp , and ꢇ . This can be described as

i

l+1

l

(13)

ꢅꢇ ꢂp ꢆ = Gꢅꢇ ꢂp ꢆꢂ

where Gꢅ·ꢆ is the vector-valued function given by an EM iteration. It is then easy to see

that the EM estimates are ﬁxed points in the nonlinear system given by equation (13).

Given that the EM algorithm proceeds iteratively, there are potential speed gains if one

treats the EM estimate as a zero of a system of nonlinear equations, and uses more

sophisticated solution routines to ﬁnd these zeros. Jamshidian and Jennrich (1997) show

that using quasi-Newton methods to solve equation (13) can accelerate convergence of

the EM algorithm, sometimes dramatically.

9

44

p. arcidiacono and j. b. jones

6

ꢀ conclusion

This paper provides a simple way to add unobserved heterogeneity to models that, in

the absence of such heterogeneity, could be estimated sequentially. In particular, if one

assumes that the data are drawn from a ﬁnite mixture distribution, the EM algorithm con-

tains a step where one maximizes an additively separable type-conditional log-likelihood

function. Hence, one can control for unobserved heterogeneity even in problems where

the parameters are most simply estimated in stages. Although our ESM algorithm does

not yield FIML estimates, it is asymptotically well-behaved—in fact the ESM estimator

introduces a broad class of GMM-type estimators. Simulation results show that the ESM

algorithm performs very well, with substantial computational savings and little loss of

information.

Dept. of Economics, Duke University, 305 Social Sciences Building, Durham, NC

∼

2

7708-0097, USA; psarcidi@econ.duke.edu; http://www.econ.duke.edu/ psarcidi

and

Dept. of Economics, University at Albany, State University of New York, BA-110,

Albany, NY 12222, USA; jbjones@albany.edu; http://www.albany.edu/ jbjones

∼

Manuscript received September, 2000; ﬁnal revision received July, 2002.

APPENDIX: Consistency with Weak Identiﬁcation

We begin with some notation. Assume that we have an i.i.d. sample of x’s of size I. Let s =

ꢀ

M+K

∗

ꢁ

ꢇ ꢂp ꢄ ∈ S ⊂ ꢀ

denote a parameter vector, with s denoting the population value of s and sˆ

denoting a sample estimate. Let Qꢅsꢆ denote the negative of a GMM criterion function, such as the

one behind the sequential estimator, and let Q ꢅsꢆ denote the sample analog of Qꢅ·ꢆ. Similarly, let

I

Lꢅsꢆ ≡ Eꢅlnꢅgꢅxꢉsꢆꢆꢆ and L ꢅsꢆ denote the population and sample expectations of the log-likelihood

I

function. Let NMꢅQꢂSꢆ denote the consistency conditions used in Newey and McFadden’s (NM

(

1994), Theorem 2.1), when applied to the function Qꢅ·ꢆ and the parameter space S. The conditions

∗

are: (i) Qꢅsꢆ is uniquely maximized at s ; (ii) S is compact; (iii) Qꢅsꢆ is continuous; and (iv) Q ꢅsꢆ

I

converges uniformly in probability to Qꢅsꢆ.

We will consider Qꢅs ꢆ to be a local maximum if there exists a closed ball B ꢅs ꢆꢂꢏ > 0, in S such

l

ꢏ

l

that Qꢅs ꢆ is a maximum over B ꢅs ꢆ. Let S denote the set of local maximizers of Qꢅsꢆ over the

l

ꢏ

l

ꢁ

ꢃ

entire space S, and let S denote the analogous set generated by the sample analog Q ꢅsꢆ. Let S

ꢁ

I

m

∗

denote a closed subset of S. Let s denote the population maximizer of Qꢅsꢆ over S , and let sˆ

m

∗

m

denote the sample maximizer of Q ꢅsꢆ over S . In the proof below, s will be the local maximizer

I

m

∗

of Qꢅsꢆ that equals the likelihood parameter vector s .

Recall that the motivating problem is a lack of identiﬁcation: even if it were restricted to global

maximizers, the set S_ꢁwould have multiple elements. The approach suggested in the text was to

ꢃ

pick sˆ as the element of S that maximizes L ꢅsꢆ; we term this the “tiebreaker” estimator. We now

ꢁ

I

show consistency.

Theorem 1: Suppose that:

(

i) conditions NMꢅQꢂS ꢆ hold (local GMM regularity);

m

∗

(

ii) s lies in the interior of S_m⊆ S (interiority);

m

(

iii) conditions NMꢅLꢂSꢆ hold (global MLE regularity);

∗

iv) lnꢅgꢅxꢉs ꢆꢆ and S satisfy the conditions of Newey and McFadden’s (1994) Lemma 4.3 (local

m

MLE regularity);

∗

(

v) s = s , the maximizer of Lꢅsꢆ (cross-identiﬁcation);

m

vi) sˆ = arg max

ꢃ

L ꢅsꢆ (tiebreaking estimate).

ꢌs∈S

ꢍ

I

ꢁ

p

∗

Then sˆ −→ s .

ﬁnite mixture distributions

945

p

∗

Proof: The proof proceeds in two steps: (a) L ꢅsˆ ꢆ−→ Lꢅs ꢆ; and (b) convergence of L ꢅsˆ ꢆ

I

m

I

m

implies convergence of sˆ. In the interest of brevity, we assume that all measurability conditions are

satisﬁed. (See also the discussion of NM’s Theorem 2.1.).

p

∗

To get step (a), note that by condition (i) and NM’s Theorem 2.1, sˆ −→ s . Then it follows from

m

p

∗

m

condition (iv) and NM’s Lemma 4.3 that L ꢅsˆ ꢆ −→ Lꢅs ꢆ.

I

m

It remains to show convergence of sˆ. It follows from consistency of sˆ and condition (ii) that

m

with probability approaching 1 (w.p.a.1) sˆ is in the interior of S and thus a local maximizer,

m

ꢃ

so that w.p.a.1 sˆ ∈ S . We now proceed as in NM’s Theorem 2.1. ∀ꢏ > 0, we have w.p.a.1:

m

ꢁ

ꢃ

(

1) Lꢅsˆꢆ > L ꢅsˆꢆ−ꢏ/3 (from condition (iii)); (2) L ꢅsˆꢆ > L ꢅsˆ ꢆ−ꢏ/3 (by condition (vi) and sˆ ∈ S );

I

m

ꢁ

∗

3) L ꢅsˆ ꢆ > Lꢅs ꢆ − ꢏ/3 (from step (a)). Together these conditions imply that w.p.a.1 Lꢅsˆꢆ >

I

m

∗

Lꢅs ꢆ−ꢏ. But by condition (v) s = s . It then follows from condition (iii) and arguments in NM’s

m

p

∗

Theorem 2.1 that sˆ −→ s .

Q.E.D.

Unless one of the local GMM maximizers is also the maximum likelihood estimator, it is essential

ꢃ

to include local as well as global maximizers in the set S . This can be illustrated with a simple

ꢁ

example. Suppose that S can be partitioned into two disjoint compact subsets, S and S , and that

1

2

∗

in addition to satisfying the conditions for Theorem 1, the maximizers of each subset, s and s , are

1

2

∗

global maximizers of Qꢅsꢆ over S. Suppose further that s₁is also the MLE maximizer s . Finally,

suppose that over S ꢂQ ꢅsꢆ = Qꢅsꢆ−1/I, while over S ꢂQ ꢅsꢆ = Qꢅsꢆ+1/I. It immediately follows

1

I

2

I

∗

that sˆ = s and sˆ = s , and the proof of Theorem 1 goes through. But Q ꢅsˆ ꢆ < Q ꢅsˆ ꢆ, so that a

1

2

I

1

I

2

∗

search over global maxima would exclude sˆ = s .

The conditions for the proof apply naturally to the sequential estimator developed in the main text.

Qꢅsꢆ is the negative of inner product of the expectation vector in equation (11), and Q is its sample

I

analog. Condition (v) (cross-identiﬁcation) follows from the construction of equation (11). Since any

solution to equation (11) will be a zero of Qꢅsꢆ, the sequential estimator is a local maximizer of

Qꢅsꢆ. One potential difﬁculty is that, as noted by Wu (1983), some mixture problems lack a compact

parameter space.

REFERENCES

Amemiya, T. (1978): “On a Two-step Estimation of a Multivariate Logit Model,” Journal of Econo-

metrics, 8, 13–21.

Arcidiacono, P. (2002): “Afﬁrmative Action in Higher Education: How Do Admissions and Finan-

cial Aid Rules Affect Future Earnings?” Manuscript, Duke University.

Cameron, S., and J. Heckman (1998): “Life Cycle Schooling and Dynamic Selection Bias: Models

and Evidence for Five Cohorts of American Males,” Journal of Political Economy, 106, 262–333.

(

2001): “The Dynamics of Educational Attainment for Black, Hispanic, and White Males,”

Journal of Political Economy, 109, 455–499.

Cox, D. R. (1975): “Partial Likelihood,” Biometrika, 62, 269–275.

Dempster, A. P., N. M. Laird, and D. B. Rubin (1977): “Maximum Likelihood from Incomplete

Data via the EM Algorithm,” Journal of the Royal Statistical Society, B, 39, 1–38.

Eckstein, Z., and K. Wolpin (1999): “Why Youths Drop Out of High School: The Impact of

Preferences, Opportunities and Abilities,” Econometrica, 67, 1295–1339.

Efron, B. (1982): “Maximum Likelihood and Decision Theory,” Annals of Statistics, 10, 323–339.

Everitt, B. S., and D. J. Hand (1981): Finite Mixture Distributions. London: Chapman and Hall.

Follmann, D. A., and D. Lambert (1989): “Generalized Logistic Regression by Nonparametric

Mixing,” Journal of the American Statistical Association, 84, 295–300.

Hamilton, J. D. (1989): “A New Approach to the Economic Analysis of Nonstationary Time Series

and the Business Cycle,” Econometrica, 57, 357–385.

(

1990): “Analysis of Time Series Subject to Changes in Regime,” Journal of Econometrics,

45, 39–70.

Hansen, L. (1982): “Large Sample Properties of Generalized Method of Moments Estimators,”

Econometrica, 50, 1029–1054.

9

46

p. arcidiacono and j. b. jones

Heckman, J., and B. Singer (1984): “A Method for Minimizing the Impact of Distributional

Assumptions in Econometric Models for Duration Data,” Econometrica, 52, 271–320.

Jamshidian, M., and R. Jennrich (1997): “Acceleration of the EM Algorithm by Using Quasi-

Newton Methods,” Journal of the Royal Statistical Society, B, 59, 569–587.

Keane, M., and K. Wolpin (1997): “The Career Decisions of Young Men,” Journal of Political

Economy, 105, 473–522.

Kiefer, N. (1980): “A Note on Switching Regressions and Logistic Discrimination,” Econometrica,

48, 1065–1069.

Laird, N. (1978): “Nonparametric Maximum Likelihood Estimation of a Mixing Distribution,” Jour-

nal of the American Statistical Association, 73, 805–811.

Lindsey, Bruce (1983): “The Geometry of Mixture Likelihoods: A General Theory,” Annals of

Statistics, 11, 86–94.

McLachlan, G. J., and T. Krishnan (1997): The EM Algorithm and Extensions. New York: John

Wiley and Sons.

McLachlan, G. J., and D. Peel (2000): Finite Mixture Models. New York: John Wiley and Sons.

Meng, X., and D. B. Rubin (1993): “Maximum Likelihood Estimation via the ECM Algorithm:

A General Framework,” Biometrika, 80, 267–278.

Mroz, T. A. (1999): “Discrete Factor Approximations in Simultaneous Equation Models: Estimating

the Impact of a Dummy Endogenous Variable on a Continuous Outcome,” Journal of Econometrics,

92, 233–274.

Newey, W. K., and D. McFadden (1994): “Large Sample Estimation and Hypothesis Testing,” in

Handbook of Econometrics, Volume 4, ed. by R. F. Engle and D. L. McFadden. Amsterdam: North

Holland, 2113–2245.

Rust, J. (1994): “Structural Estimation of Makov Decision Process,” in Handbook of Econometrics,

Volume 4, ed. by R. F. Engle and D. L. McFadden. Amsterdam: North Holland, 3081–3143.

Rust, J., and C. Phelan (1997): “How Social Security and Medicare Affect Retirement Behavior

in a World of Incomplete Markets,” Econometrica, 65, 781–831.

Ruud, P. A. (1991): “Extensions of Estimation Methods Using the EM Algorithm,” Journal of

Econometrics, 49, 305–341.

Titterington, D. M., A. F. M. Smith, and U. E. Makov (1985): Statistical Analysis of Finite

Mixture Distributions. New York: John Wiley and Sons.

Wu, C. F. (1983): “On the Convergence Properties of the EM Algorithm,” Annals of Statistics, 11,

95–103.

Article Doi

DOI: 10.1111/1468-0262.00431

Source and publish data:

Authors:

Article abstract of DOI:10.1111/1468-0262.00431

Full text of DOI:10.1111/1468-0262.00431

Products guided by the article

R&D Labs maybe for 106989-11-1

Relevant to this article

Hot Product