Bifunctional Phosphine Ligand Enabled Gold-Catalyzed Alkynamide Cycloisomerization: Access to Electron-Rich 2-Aminofurans and Their Diels–Alder Adducts

Received: 4 February 2019

DOI: 10.1002/qre.2557

Revised: 10 May 2019

Accepted: 30 July 2019

R E S E A R C H A R T I C L E

Efficient global monitoring statistics for high-dimensional

data

Jun Li

Department of Statistics, University of

California, Riverside, California

Abstract

Global monitoring statistics play an important role in developing efficient mon-

itoring schemes for high-dimensional data. A number of global monitoring

statistics have been proposed in the literature. However, most of them only

work for certain types of abnormal scenarios under specific model assumptions.

How to develop global monitoring statistics that are powerful for any abnor-

mal scenarios under flexible model assumptions is a long-standing problem in

the statistical process monitoring field. To provide a potential solution to this

problem, we propose a novel class of global monitoring statistics. Our proposed

global monitoring statistics are easy to calculate and can work under flexible

model assumptions since they can be built on any local monitoring statistic that

is suitable for monitoring a single data stream. Our simulation studies show that

the proposed global monitoring statistics perform well across a broad range of

settings and compare favorably with existing methods.

Correspondence

Jun Li, Department of Statistics,

University of California, Riverside,

Riverside, CA 92521, U.S.A.

Email: jun.li@ucr.edu

KEYWORDS

CUSUM, process change detection, quantile vs quantile, statistical process control

1

INTRODUCTION

Advanced manufacturing and data acquisition technologies have made the gathering of high-dimensional data possible

in many fields. The demand for efficient online monitoring tools for such data has never been greater. Depending on the

purpose, two types of monitoring schemes are needed.

The first type is for applications in which changes in any of data streams indicate the same abnormality in the whole

system and require a single, uniform corrective action. As a result, those applications do not require the identification

of abnormal data streams. For example, in the environmental monitoring application, hundreds or thousands of sensors

are usually deployed to monitor certain environmental factors. The abnormality of data from any of those sensors will

indicate a general abnormality in those environmental factors. In many cases, it is not of interest to identify which sensors

have caused the alarm. Since it is not required to identify abnormal data streams for this type of monitoring scheme, a

popular approach for developing monitoring schemes of this type is to use a single global monitoring statistic to track all

data streams jointly. As a result, a global monitoring statistic that is powerful for detecting any abnormal scenario is the

key for developing any efficient monitoring scheme of this type.

The second type of monitoring scheme is for applications in which changes in different data streams indicate different

problems in the system, requiring unique corrective actions. As a result, this type of monitoring scheme needs to identify

which data stream is experiencing abnormal activities. For example, in the network traffic surveillance application, the

network traffic data from different data streams are associated with different IP addresses. When some abnormality in

the data occurs, identifying which data stream or IP address has caused the problem is very important, as doing so helps

Qual Reliab Engng Int. 2019;1–15.

wileyonlinelibrary.com/journal/qre

1

2

LI

pinpoint the cause and guides future corrective actions. As pointed out in Li,¹the following two-stage strategy can be

very effective to develop this type of monitoring scheme. In the first stage, a global monitoring statistic is used to decide

whether there is any abnormal data stream. If this is the case, the second stage is activated, and a local monitoring statistic

is used to decide which data streams are abnormal. As shown in Li,¹this two-stage strategy has better performance than

the one-stage strategy. On the basis of this two-stage strategy, it is evident that what global monitoring statistic to use in

the first stage becomes critical, since it will ultimately affect the effectiveness of the resulting monitoring scheme in terms

of how quickly it will raise an alarm when some of the data streams start experiencing abnormal activities.

From the above discussions, it is clear that an efficient global monitoring statistic is important for developing effi-

cient monitoring schemes of both types. How to construct efficient global monitoring statistics for high-dimensional

data streams has been an active research topic in the statistical process monitoring field. A number of global monitoring

statistics have been proposed. However, most of them only work for certain types of abnormal scenarios under specific

model assumptions. For example, under the assumption that each data stream follows a normal distribution and using the

cumulative sum (CUSUM) statistic as the local monitoring statistic for each data stream, Tartakovsky et al²and Mei³pro-

posed using the maximum and sum of those CUSUM statistics, respectively, as the global monitoring statistic. It has been

shown that the sum of those CUSUM statistics is more effective than the maximum when a moderate or large number of

data streams are abnormal and vice versa when only a few data streams are abnormal. In practice, it is usually unknown

in advance how many data streams will be abnormal. Therefore, neither the maximum or the sum of those CUSUM

statistics as the global monitoring statistic guarantees robust performance. Xie and Siegmund⁴recognized this limita-

tion and proposed a global monitoring statistic derived from the likelihood function of a normal-mixture model. Besides

being computationally intensive, their approach also needs to pre-specify the percentage of abnormal data streams. If the

percentage is misspecified, their approach will sacrifice some power. Zou et al⁵proposed some alternative way to com-

bine the CUSUM statistics from all data streams to produce a single global monitoring statistic. This approach does not

require prior knowledge about how many data streams are abnormal and performs well compared with the aforemen-

tioned methods across different abnormal scenarios. However, it is not clear how to extend their approach to develop some

computationally efficient global monitoring statistics for situations where the local monitoring statistics are not CUSUM

statistics. Recently, Liu et al⁶proposed several global monitoring statistics via the so-called SUM-shrinkage technique.

The SUM-shrinkage technique they use is essentially a thresholding method. Instead of taking the sum of all the local

monitoring statistics as proposed in Mei,³their new global monitoring statistics take the sum of only those that exceed

some pre-specified threshold. As expected, the detection power of their monitoring schemes depends on the pre-specified

threshold. In their paper, they studied several choices of threshold. However, each one works well only for certain types

of abnormal scenarios. Similar to other thresholding-based methods, it is impossible to find a single threshold in their

proposed global monitoring statistics that will work well for different types of abnormal scenarios.

In addition to the limitations described above, most of the existing approaches were developed under the normality

assumption. When this normality assumption does not hold, none of the above approaches will perform as expected.

So all this indicates the need to develop flexible global monitoring statistics that can work with any type of data and

be effective for different types of abnormal scenarios. To meet this need, we propose a novel class of global monitoring

statistics by making use of the order statistics of the local monitoring statistics. The unique use of the order statistics

makes our proposed monitoring statistics efficient for different abnormal scenarios, as shown in our simulation studies.

Our proposed global monitoring statistics are easy to calculate and can work under flexible model assumptions since they

can be built on any local monitoring statistic that is suitable for monitoring a single data stream. There is vast literature

on how to monitor a single data stream. Therefore, the proposed class of global monitoring statistics can easily benefit

from this rich literature.

The rest of the paper is organized as follows. In Section 2, we propose a general class of global monitoring statistics and

show that the global monitoring statistic studied in Zou et al⁵can be considered as a special case of our proposed global

monitoring statistic. In Section 3, we give three examples of our proposed global monitoring statistic from the general class

and evaluate their performance through simulation studies. Finally, we provide some concluding remarks in Section 4.

2

METHODOLOGY

2.1 Notation

The setup for our high-dimensional data monitoring problem is the following. There are m data streams in the system. We

denote the observation from the ith data stream at time t by X_i,t, i = 1, … , m, t = 1, 2, … . Since a time series model can

LI

3

be used to decorrelate the temporal correlation within each data stream and a spatial model can be used to decorrelate

the spatial correlation between data streams before applying monitoring schemes, following most papers on the topic,

we assume that the X_i,tare independent both within and between data streams. When the system is in control (IC), the

underlying distribution of {X_i,1, X_i,2, … } (i = 1, … , m) is called the IC distribution, denoted by F_0,i. Following this setup,

at a given time t, we observe X_i,1, X_i,2, … , X_i,t, i = 1, … , m. The task of our online monitoring scheme at time t is to

determine if the distribution of X_i,1, X_i,2, … , X_i,tis the same as F_0,ifor all i = 1, 2, … , m.

The above task can be carried out by tracking a global monitoring statistic G_t, which contains information collected

from all data streams up to time t. If G_tis within the preset control limit, we will declare that all data streams are IC and

continue monitoring. If G_texceeds the control limit, we will raise an alarm suggesting that some of the data streams are

out of control (OC). In the following, we propose a novel class of global monitoring statistics that can work with any type

of data and be effective for different types of OC scenarios.

2.2 Proposed global monitoring statistics

Because the change point can happen at different times for different data streams, a popular approach in the literature

for developing the global monitoring statistic G_tis to first choose an appropriate local monitoring statistic for tracking

each data stream and then combine those local monitoring statistics in a way that produces a single global monitoring

statistic. We will follow this approach. More specifically, let W_i,tbe the local monitoring statistic for the ith data stream at

time t that summarizes the evidence regarding a possible local change based on the observations, X_i,1, … , X_i,t. Without

loss of generality, we assume that a larger W_i,tindicates a higher probability of the ith data stream being OC. Although

our proposed global monitoring statistic G_tcan work with any choice of W_i,t, in order for G_tto be efficient for detecting

changes in any data stream, the W_i,tshould be chosen to be efficient for detecting local changes. Since choosing a good

local monitoring statistic W_i,tis equivalent to choosing an appropriate monitoring statistic for the univariate data stream,

there is rich literature on this topic (see, for example, Qiu⁷), and we can easily find the appropriate monitoring statistic

from the literature as W_i,tfor any particular application in mind. Therefore, in the following, we assume that the W_i,thave

been constructed, and our focus is how to combine these local monitoring statistics W_i,tinto a powerful global monitoring

statistic.

Note that at any time t, we have calculated W_1,t, … , W_m,t. Without loss of generality, we assume that the W_i,tare

independent and identically distributed when the system is IC. As mentioned in Section 1, Liu et al⁶recently proposed

an SUM-shrinkage approach to construct the global monitoring statistic based on W_1,t, … , W_m,t. In their approach,

W_1,t, … , W_m,tare compared with some pre-specified threshold, and only those that exceed the threshold are used to con-

struct the global test statistic. However, similar to all the other thresholding methods, it is impossible to choose a threshold

in advance that works well for all OC scenarios.

Instead of comparing W_1,t, … , W_m,twith some pre-specified threshold, we propose to compare their order statistics

with their respective expected values when the system is IC. More specifically, let W_(1),t≤ W_(2),t≤ … ≤ W_(m),tbe the

order statistics of W_1,t, … , W_m,t. Note that W_(i),tcan be also considered as the observed (i − 3∕4)∕(m − 1∕2) quantile of the

underlying distribution of the W_i,t. Here, (i − 3∕4)∕(m − 1∕2) is the common continuity correction of i∕m. Let q_(i),tdenote

the expected (i − 3∕4)∕(m − 1∕2) quantile of the IC distribution of the W_i,t. Then, q_(i),tcan be considered as the expected

value of W_(i),t. A natural statistic that summarizes the differences between W_(i),tand their respective expected values q_(i),t

∑

m

is simply _i=1(W_(i),t− q_(i),t)². Since a larger W_i,tindicates a higher probability of the ith data stream being OC, only when

W_(i),tis larger than its expected value q_(i),t, it may indicate abnormality in the system. Therefore, we only include the

difference when W_(i),tis larger than its expected value q_(i),tin our global monitoring statistic, and the new class of global

monitoring statistics we propose is

m

∑

(

)

2

G_t=

W

_(i),t− q_(i),tI_{W

,

}

(1)

_(i),t>q_(i),t

i=1

where I_{A}is the indicator function and takes 1 if A is true and 0 otherwise. Then, our proposed monitoring scheme is

to plot G_tover the time t, and it raises an alarm if G_t> h, where h is the control limit predetermined by the desired IC

average run length (denoted by ARL₀).

At first glance, our proposed global monitoring statistic G_tis the sum-type statistic, so it is expected to be effective when

a moderate or large number of data streams are abnormal. As shown in our simulation studies in Section 3, our global

monitoring statistic G_tis efficient not only for a large number of abnormal data streams but also for a few abnormal data

streams. The reason why G_tcan be efficient when a few data streams are abnormal is the following. In general, the extreme

order statistics W_(i),twith small i or large i have larger variabilities than the order statistics in the middle. Therefore,

4

LI

although the expression in (1) is in the form of an unweighted sum of squares, if we take into account the variabilities of

different order statistics W_(i),t, G_tis actually a weighted sum of squares, with the more extreme order statistics receiving

larger weight. This weighting scheme makes G_talso sensitive for a few abnormal data streams, since those abnormal

data streams will most likely drive up the extreme order statistics first. Therefore, despite its simple form, our proposed

global monitoring statistic G_thas a build-in adaptive mechanism that is capable of adapting to different types of abnormal

scenarios.

As seen above, the proposed new class of global monitoring statistics is very general and can work with any local moni-

toring statistic W_i,tthat is suitable for the particular application in mind. Therefore, it can find wide-ranging applications

in the real world and offer promising solutions to various statistical process monitoring problems. As shown in our sim-

ulation studies reported in Section 3, this class of global monitoring statistics performs very well across different OC

scenarios under different model assumptions.

To calculate the above G_t, the expected quantiles q_(i),t(i = 1, … , m) of the IC distribution of the W_i,tare needed. When

the IC distribution of the W_i,tis from some well-known distribution family, the q_(i),tcan be easily obtained from that dis-

tribution family. When the IC distribution of the W_i,tis not from any well-known distribution family, which is most often

the case, we can first simulate a random sample from the IC distribution of the W_i,tand then use the sample quantiles of

this random sample to approximate the corresponding q_(i),t. We will give several examples on how to obtain those approx-

imations in the next section. It should be noted that obtaining approximations of the q_(i),twill be carried out offline before

the online monitoring starts and the values will be stored beforehand. As a result, the total online computational effort

∑

m

in calculating G_tis the same as calculating _i=1(W_(i),t− a_i)²I_{{W >a }}with all the a_igiven. Therefore, it is computationally

(i),t

i

simple to implement the proposed method online for monitoring high-dimensional data streams.

2.3 A special case: The global monitoring statistic proposed by Zou et al⁵

Assume that all the IC distributions F_0,iare the normal distribution with mean 0 and variance 1 (denoted by N(0, 1)) and

the OC distribution of the ith data stream (i = 1, … , m) is also some normal distribution with mean ꢀ_iand variance 1

(denoted by N(ꢀ_i, 1)). Under those assumptions, an optimal local monitoring statistic for each data stream is the CUSUM

statistic. To detect a positive mean shift, the CUSUM statistic for the ith data stream is defined as

{

S_i⁺_,0= 0,

(2)

1

2

S_i⁺_,t= max(0, S_i⁺_,t−1+ ꢀ_i(X_i,t− ꢀ_i)), for t ≥ 1.

Let H_i,t(·) denote the cumulative distribution function of S⁺when the ith data stream is IC. Then, define U_i,t= H_i,t(S⁺),

i,t

i = 1, … , m, and their order statistics are U_(1),t≤ … ≤ U_(mⁱ₎^,_,^t_t. Utilizing one of the goodness-of-fit test statistics developed

by Zhang,⁸Zou et al⁵proposed the following global monitoring statistic:

{

[

]}₂

U₍⁻_i)¹_,t− 1

m

∑

G^Z_t=

log

I_{U

.

(3)

_(i),t>(i−3∕4)∕(m−1∕2)}

(m − 1∕2)∕(i − 3∕4) − 1

i=1

In the following, we show that G^Z_tcan be considered as a special case of our proposed global monitoring statistic G_t

in (1). To see this, note that Q(p) = − log(p⁻¹− 1) is the quantile function of the standard logistic distribution. If we

assume that the OC means ꢀ_i, i = 1, … , m, are all equal, then − log(U⁻¹− 1) can be considered as the observed (i −

(i),t

3∕4)∕(m − 1∕2) quantile of the underlying distribution of S_i⁺_,ton the scale of the standard logistic distribution. Similarly,

[

]

−1

− log( (i − 3∕4)∕(m − 1∕2) − 1) is the expected (i − 3∕4)∕(m − 1∕2) quantile of the underlying distribution of S_i⁺_,ton the

scale of the standard logistic distribution. As a result, if we choose W_i,tto be − log(U⁻¹− 1) with W_(i),t= − log(U⁻¹− 1)

i,t

(i),t

[

]

−1

and q_(i),t= − log( (i − 3∕4)∕(m − 1∕2) − 1), our proposed global monitoring statistic G_tin (1) reduces to G^Z_tin (3).

The above shows that G^Z_tis a special case of our proposed global monitoring statistic G_t. Therefore, G_t^Zcan be also

considered as the sum of the squared differences between the observed quantiles and expected quantiles. Theoretically, G_t^Z

can be modified using local monitoring statistics other than the above CUSUM statistics S_i⁺_,t. However, the quantiles used

in G^Z_tare on the scale of the standard logistic distribution. To obtain those quantiles, the local monitoring statistics have

to be transformed to U_i,tbased on their underlying IC distribution. For many commonly used local monitoring statistics,

there is no analytical form available for this transformation, and it has to be approximated through the Markov chain

LI

5

method or Monte Carlo simulation, which can be time-consuming, especially for high-dimensional data monitoring. This

greatly restricts the applicability of G^Z_tto other settings. Therefore, in Zou et al,⁵all the analysis was limited to using S⁺

i,t

in (2) as the local monitoring statistic, since a closed-form formula to approximate U_i,tis available in this setting thanks

to Grigg and Spiegelhalter.⁹In contrast, the quantiles used in our proposed G_tcan be directly determined by any local

monitoring statistics W_i,t, which makes our G_tmore versatile and more computationally efficient than G^Z_tfor monitoring

high-dimensional data.

3

EXAMPLES

In Section 2, we propose a general class of global monitoring statistics, which can be built on any local monitoring statistic.

In this section, we provide three examples of our proposed global monitoring statistic from this general class and compare

their performance with that of other existing global monitoring statistics.

3.1 Known prechange and postchange distributions

In our first example, we assume that the distributions before and after the change are N(0, 1) and N(ꢀ, 1), respectively,

for all the data streams, where ꢀ is the postchange mean and is completely specified. Under this setting, we can use the

CUSUM statistic S_i⁺_,tdefined in (2) with ꢀ_i= ꢀ as the local monitoring statistic.

On the basis of this local monitoring statistic, the global monitoring statistic G^Z_tdefined in (3) can be used to monitor

the m data streams jointly. As mentioned earlier, to implement G^Z_t, it is important that U_i,tcan be calculated quickly.

Grigg and Spiegelhalter⁹developed an empirical approximation to the IC steady-state distribution of the CUSUM statisitc

S⁺. Their result can be used to obtain a closed-form formula to calculate U_i,t. Since this formula only works when S⁺

i,t

reaches its steady state, to make use of this formula, we modify the definition of the CUSUM statistic a little. Instead of

starting the CUSUM statistic at 0, ie, S_i⁺_,0= 0 as in (2), we start the CUSUM statistic at some value randomly drawn from

the IC steady-state distribution of S_i⁺_,t. More specifically, we first generate 10⁵independent sequences of {X_k,1, … , X_k,2000

}

10⁵

(k = 1, … , 10⁵), each of which is independently drawn from N(0, 1), and calculate S_k⁺_,2000as in (2). Then, {S_k⁺_,2000

}

_k=1can

serve as a random sample from the IC steady-state distribution of S_i⁺_,t. Our modified CUSUM statistic is then defined as

follows. For i = 1, … , m,

{

+∗

S

= V_i

i,0

,

(4)

1

2

= max(0, S_i⁺_,t^∗₋₁+ ꢀ(X_i,t− ꢀ)), for t ≥ 1

+∗

i,t

10⁵

where V_iis randomly drawn with replacement from {S⁺

}

. The global monitoring statistic G^Z_tin (3) is then calcu-

k,2000 k=1

lated using U_i,t= H^∗(S_i⁺_,t^∗), where H^∗(·) is the IC distribution of S^+∗. Since S^+∗starts from the steady state, H^∗(·) at any

i,t

time t follows the IC steady-state distribution. As a result, we can utilize the closed-form formula provided in Grigg and

Spiegelhalter⁹to calculate the above U_i,tquickly.

+∗

Similarly, we also use the above-modified CUSUM statistic S as W_i,tin our proposed global monitoring statistic G_t

i,t

10⁵

in (1). Because of this modification, the underlying distribution of W_i,tfor any time t is the same as that of {S⁺

}

obtained above. Then, its expected quantiles q_(i₁_),₀_t₅also remain the same for any time t and can be well approximated^kb⁼y¹

k,2000

the corresponding sample quantiles of {S_k⁺_,2000

statistic G_tin this particular setting is

}

_k=1, which we denote by q_(i). Therefore, our proposed global monitoring

s

̂

m

(

)

∑

2

+∗

(i),t

s

+∗

s

̂

G_t=

S

− q_(i)I_{S

,

_(i),t>q_(i)}

̂

i=1

+∗

(1),t

+∗

(m),t

+∗

where S

≤ … ≤ S

are the order statistics of S , … , S

.

1,t

m,t

In Liu et al,⁶several global monitoring statistics based on hard thresholding, soft thresholding and order thresholding

were proposed. From their simulations, the soft-thresholding method seems to work the best. Therefore, we only include

their soft-thresholding-based global monitoring statistic in the following simulation study for performance comparison.

To be consistent with the above G^Z_tand G_t, we also calculate their global monitoring statistic based on the above-modified

+∗

CUSUM statistic S , which is defined as

i,t

m

∑

G^L_t=

max{S_i⁺_,t^∗− b, 0},

i=1

where b is the thresholding constant. Following Liu et al,⁶three choices of b are considered: (a) b₁= 1∕2; (b) b₂

log(10) = 2.3026; and (c) b₃= log(100) = 4.6052.

=

6

LI

• Simulation study

In the following, we report a simulation study to compare the performance of G_t, G^Z_t, and G_t^L. The general simulation

settings are the following. Among the m data streams, m₀data streams are from the IC distribution N(0, 1), and the

remaining m₁= m − m₀data streams are from the OC distribution N(0.5, 1). We consider two choices of m: m = 100 and

1000, and Table 2 lists the corresponding choices of m₁for these two choices of m.

In our simulation study, we construct the monitoring scheme by tracking G_t, G^Z_t, and G_t^L, respectively. If G_t, G_t^Z, or G^L_t

exceed its respective control limit h, the corresponding monitoring scheme will stop the monitoring and raise an alarm.

The control limit h for G_t, G^Z_t, and G_t^Lcan be obtained through Monte Carlo simulation to satisfy the ARL₀requirement.

The desired ARL₀for all the monitoring schemes is set at 1000. The control limits h for G_t, G^Z_t, and G_t^Lfor different values

of m are listed in Table 1.

Using those control limits, the monitoring schemes based on G_t, G^Z_t, and G_t^Lare then used to monitor the above m data

streams with m₁of them being OC. Since those OC data streams have changed from their IC distributions from the very

beginning, the detection power of the monitoring schemes based on G_t, G^Z_t, and G_t^Lcan be compared with the average

time for the monitoring scheme to raise an alarm, ie, the average run length (denoted by ARL₁). Table 2 reports the ARL₁

of the monitoring schemes based on G_t, G^Z, and G_t^Lfor different settings from 2500 simulations. The standard deviations

of the run lengths from the 2500 simulati^tons are also included in parentheses, and the standard errors of the ARL₁are

simply those standard deviations divided by 50.

TABLE 1 The control limits of the

monitoring schemes based on G_t, G^Z_t,

and G^L_twhen ARL₀= 1000

m = 100

m = 1000

G_t

G_t^Z

G_t^L

b₁

G_t^L

b₁

G_t

G_t^Z

b₂

b₃

b₂

b₃

h

20.798 28.570 69.496 19.303 5.513 25.041 32.593 526.599 108.212 19.413

G_t^L

b₁

TABLE 2 The ARL₁of the

monitoring schemes based on G_t, G^Z_t,

and G^L_tfrom 2500 simulations

m

m₁

1

3

5

G_t

G_t^Z

b₂

b₃

64.44 (32.88) 68.04 (33.15) 110.26 (55.02)

36.20 (14.95) 38.17 (15.01) 48.44 (21.88)

27.08 (10.53) 28.75 (10.96) 32.55 (14.35)

81.22 (40.72)

38.23 (15.92)

27.26 (10.53)

19.67 (7.06)

17.01 (5.93)

11.25 (3.52)

6.52 (1.97)

4.99 (1.54)

4.32 (1.34)

62.71 (31.84)

35.74 (13.62)

28.65 (9.89)

23.04 (7.53)

21.08 (6.52)

16.12 (4.56)

11.23 (3.33)

9.48 (2.81)

8.56 (2.63)

8

20.33 (7.26)

17.42 (6.11)

10.64 (3.67)

4.88 (1.57)

3.25 (0.97)

2.72 (0.77)

20.77 (7.76)

17.63 (6.64)

10.04 (3.84)

3.91 (1.43)

2.37 (0.82)

1.89 (0.59)

21.19 (8.94)

17.33 (7.21)

9.43 (3.61)

4.15 (1.40)

2.81 (0.93)

2.37 (0.74)

100

10

20

50

80

100

1

81.56 (37.40) 86.87 (38.64) 214.47 (113.31) 148.94 (74.62) 98.34 (44.36)

3

51.83 (18.91) 53.97 (19.81) 106.37 (49.85)

41.89 (13.58) 43.36 (14.39) 73.63 (32.83)

33.90 (10.30) 35.47 (11.02) 52.97 (21.97)

73.70 (31.93)

52.33 (20.96)

38.89 (14.17)

33.02 (11.62)

20.88 (6.69)

11.88 (3.30)

8.94 (2.43)

7.85 (2.07)

6.16 (1.65)

5.19 (1.45)

4.14 (1.13)

3.46 (0.96)

3.04 (0.86)

2.71 (0.80)

2.47 (0.72)

2.27 (0.68)

2.11 (0.62)

1.97 (0.59)

53.60 (20.03)

41.54 (13.64)

33.70 (9.99)

30.07 (8.80)

22.54 (5.85)

15.82 (3.70)

13.09 (3.09)

12.14 (2.71)

10.34 (2.46)

9.16 (2.23)

7.80 (2.00)

6.86 (1.82)

6.18 (1.68)

5.65 (1.57)

5.22 (1.50)

4.92 (1.40)

4.60 (1.32)

4.31 (1.32)

5

8

10

20

50

80

100

30.19 (8.96)

21.04 (6.24)

11.95 (3.67)

8.47 (2.54)

7.05 (2.13)

5.04 (1.43)

3.96 (1.13)

2.78 (0.80)

2.21 (0.60)

1.88 (0.46)

1.67 (0.48)

1.46 (0.50)

1.26 (0.44)

1.09 (0.29)

31.58 (9.63)

21.60 (6.60)

11.50 (3.88)

7.65 (2.69)

6.16 (2.19)

4.08 (1.45)

3.05 (1.06)

2.07 (0.64)

1.61 (0.53)

1.30 (0.46)

1.08 (0.28)

1.02 (0.13)

1.00 (0.03)

1.00 (0.00)

43.87 (17.80)

24.74 (9.58)

11.01 (4.03)

7.18 (2.55)

5.83 (2.01)

4.09 (1.36)

3.20 (1.04)

2.35 (0.71)

1.92 (0.57)

1.65 (0.51)

1.45 (0.50)

1.28 (0.45)

1.16 (0.37)

1.07 (0.25)

1.03 (0.17)

1000 150

200

300

400

500

600

700

800

900

1000 1.03 (0.17)

Note. The standard deviations of the run lengths from the 2500 simulations are reported in parentheses.

LI

7

For the monitoring schemes based on G^L_t, the bold number in each row of Table 2 represents the smallest ARL₁among

the three choices of b for that particular OC scenario. As the table shows, the detection power of G^L_tdepends on the choice

of b. If b is too small, G^L_tis not powerful when only a few data streams are OC, since many IC data streams may exceed b

and the signal in G^L_twill be diluted by including those IC data streams. Similarly, if b is too large, G^L_tis not powerful when

many data streams are OC, since many of them may not exceed b and hence do not contribute to G^L_t. This is consistent

with what is known for any thresholding-based method. However, in practice, it is rarely known in advance how many

data streams will be OC, which makes it extremely difficult to come up with an appropriate b for G^L_tin real applications.

In contrast, both G_tand G_t^Zdo not depend on any sort of tuning parameter, and they perform well across different OC

scenarios with detection delays being always close to those of G^L_twith the best choice of b.

When further comparing G_twith G^Z_t, we notice that our G_tis better when a small number of data streams are OC,

while G^Z_tis better when a large number of data streams are OC. This can be explained by the following. As described

in Section 2, both of the global monitoring statistics can be viewed as the sum of the squared differences between the

observed quantiles and expected quantiles. G^Z_tis based on the quantiles from the standard logistic distribution, while our

G_tis based on the quantiles from the distribution of the CUSUM statistic S^+∗in (4). As shown in Grigg and Spiegelhalter,⁹

i,t

+∗

the tail of the IC distribution of S resembles that of an exponential distribution, which implies that the IC distribution

i,t

+∗

of S has a heavier tail than the standard logistic distribution. As a result, the extreme quantiles contribute more in G_t

i,t

than in G^Z. This explains why G_tperforms better than G^Z_twhen a small number of data streams are OC and vice versa

when a la^trge number of data streams are OC.

3.2 Known prechange distribution but unknown postchange distribution

In the previous example, in order to use the CUSUM statistic as the local monitoring statistic, the distribution after the

change needs to be completely specified. In some real-world applications, prior knowledge of the postchange distribution

may not be available. In our second example, we consider the setting where the postchange distribution is unknown. More

specifically, we assume that the OC distribution of the ith data stream is N(ꢀ_i, 1) with ꢀ_iunknown and the IC distributions

of all data streams are still N(0, 1). To obtain the specific form of our proposed global monitoring statistic G_t, the key

is to find the appropriate local monitoring statistic W_i,tin this setting. There exist a few options for such a statistic in

the statistical process monitoring literature. For example, Sparks¹⁰proposed an adaptive CUSUM statistic, and Han and

Tsung¹¹developed a reference-free cumulative score statistic. In both of the two methods, instead of using the specified

ꢀ_iin the CUSUM statistic defined in (2), an estimate of ꢀ_iis plugged in. In Sparks,¹⁰an exponentially weighted moving

average (EWMA) of all the past observations is used to estimate ꢀ_i, while, in Han and Tsung,¹¹the absolute value of the

current observation, |X_i,t|, is used as the estimate of ꢀ_i. Following the same idea, Lorden and Pollak¹²proposed another

estimate of ꢀ_ito replace ꢀ_iin the CUSUM statistic in (2) and proved the asymptotic optimality of the resulting monitoring

statistic. Since the CUSUM statistic in (2) is only for detecting positive mean shifts, the monitoring statistic developed in

Lorden and Pollak¹²is also only for positive mean shifts. Recently, Liu et al⁶extended Lorden and Pollak's monitoring

statistic to detect both positive and negative mean shifts. In the following, we use this two-sided monitoring statistic in

Liu et al⁶as our local monitoring statistic W_i,t.

More specifically, define, for t ≥ 1,

(

))

1

X_i,t− ꢀ̂

2

(1)

i,t

(1)

i,t

(1)

i,t

C

= max 0, C_i⁽_,¹_t−⁾₁+ ꢀ̂

,

(

1

X_i,t− ꢀ̂

2

(2)

i,t

(2)

i,t

(2)

i,t

C

= max 0, C_i⁽_,²_t−⁾₁+ ꢀ̂

(1)

(2)

i,t

where ꢀ̂ and ꢀ̂ are the estimates of ꢀ_ifor the positive mean shift and negative mean shift, respectively, and they are

i,t

given by

(

)

(

)

(1)

i,t

(1)

(2)

i,t

(2)

i,t

s + S

t + T

−s + S

(1)

(2)

i,t

ꢀ̂ = max ꢁ,

> 0, ꢀ̂ = min −ꢁ,

< 0.

i,t

t + T

i,t

In the above estimates, ꢁ is the pre-specified smallest mean shift that is meaningful, and s and t are also pre-specified

nonnegative constants and can be considered as a prior so that the above estimates can be treated as the Bayes-type

estimates. In our simulation studies, we choose ꢁ = 0.25, s = 1, and t = 4 as in Liu et al.⁶For j = 1, 2, the sequences

8

LI

( ꢂ)

(S , T ) are calculated recursively.

i,t

{

( ꢂ)

S

+ X_i,t−1, if C^{( ꢂ)}> 0,

( ꢂ)

i,t

i,t−1

S

=

( ꢂ)

0,

if C

= 0,

i,t−1

{

T_i⁽_,t^ꢂ₋⁾₁+ 1, if C^{( ꢂ)}> 0,

( ꢂ)

i,t

i,t−1

T

=

( ꢂ)

0,

if C

= 0.

i,t−1

Finally, our local monitoring statistic C_i,tis simply

(1)

(2)

C_i,t= max(C , C ).

i,t

In Liu et al,⁶the above monitoring statistic C_i,tstarts from the following initial values:

(1)

i,0

(2)

(1)

(2)

(1)

(2)

S

= S = T = T = C = C = X_i,0= 0.

i,0 i,0 i,0 i,0 i,0

Using those initial values, the IC distribution of the C_i,twill change over the time before it reaches its steady

state. Recall that, to implement our proposed global monitoring statistic G_t, the expected quantiles q_(i),tof the IC

distribution of the C_i_m_,tare needed. If the IC distribution of the C_i,tchanges over the time, then we need to calcu-

late and store {q_(i),t

}

for each t. To simplify our procedure, similarly to how we modified the original CUSUM

i=1

statistic in the previous section, we propose to set the initial values, (S⁽¹⁾, S⁽²⁾, T_i⁽_,¹₀⁾, T_i⁽_,²₀⁾, C_i⁽_,¹₀⁾, C⁽²⁾, X_i,0), at some value

i,0

(2)

i,t

i,0

(1)

i,t

i,0

(1) (2)

i,t

(1)

i,t

(2)

i,t

randomly drawn from the IC steady-state distribution of (S , S_i,t, T , T , C , C , X_i,t). To obtain such initial val-

ues, we generate 10⁵independent sequences of {X_k,1, … , X_k,2000} (k = 1, … , 10⁵), each of which is independently

drawn from N(0, 1), and calculate (S_k⁽¹_,2⁾₀₀₀, S_k⁽²_,2⁾₀₀₀, T_k⁽¹_,2⁾₀₀₀, T_k⁽²_,2⁾₀₀₀, C_k⁽¹_,2⁾₀₀₀, C_k⁽²_,2⁾₀₀₀, X_k,2000), using the initial value 0. Then,

5

{(S_k⁽¹_,2⁾₀₀₀, S_k⁽²_,2⁾₀₀₀, T_k⁽¹_,2⁾₀₀₀, T_k⁽²_,2⁾₀₀₀, C_k⁽¹_,2⁾₀₀₀, C⁽²⁾, X_k,2000)}¹⁰can be used to approximate the IC steady-state distribution of

k,2000

k=1

(S⁽¹⁾, S⁽²⁾, T⁽¹⁾, T⁽²⁾, C⁽¹⁾, C⁽²⁾, X_i,t). The initial values to calculate our modified C_i^∗_,tare then defined as

i,t

(S⁽¹⁾, S⁽²⁾, T_i⁽_,¹₀⁾, T_i⁽_,²₀⁾, C_i⁽_,¹₀⁾, C⁽²⁾, X_i,0) = V_i,

i,0

5

where V_iis randomly drawn from {(S_k⁽¹_,2⁾₀₀₀, S_k⁽²_,2⁾₀₀₀, T_k⁽¹_,2⁾₀₀₀, T_k⁽²_,2⁾₀₀₀, C_k⁽¹_,2⁾₀₀₀, C⁽²⁾, X_k,2000)}¹⁰with replacement. Since

k,2000

k=1

(S⁽¹⁾, S⁽²⁾, T⁽¹⁾, T⁽²⁾, C⁽¹⁾, C⁽²⁾, X_i,t) starts from the steady state, C^∗at any time t follows the IC steady-state distribution when

i,t

10⁵

the system is IC, and its expected quantiles q_(i),talso remain the same for any time t. Then, {max(C_k⁽¹_,2⁾₀₀₀, C_k⁽²_,2⁾₀₀₀)} can be

k=1

used to approximate the IC steady-state distribution of C^∗, and the expected quantiles q_(i),tof C^∗can be approximated by

i,t

10⁵

c

the corresponding sample quantiles of {max(C_k⁽¹_,2⁾₀₀₀, C_k⁽²_,2⁾₀₀₀)} , which we denote by q_(i). Therefore, our proposed global

̂

k=1

monitoring statistic G_tin this particular setting is

m

(

)

∑

2

c

∗

c

̂

G_t=

C₍^∗_i),t− q_(i)I_{C

,

_(i),t>q_(i)}

̂

i=1

where C₍^∗_1),t≤ … ≤ C₍^∗_m),tare the order statistics of C₁^∗_,t, … , C_m^∗_,t

.

• Simulation study

Using the above modified local monitoring statistics C_i^∗_,t, theoretically, it is possible to define the global monitoring

statistic G^Z_tproposed by Zou et al⁵in this setting accordingly. To implement this G^Z_t, it is important to have a closed-form

formula for the cumulative distribution function of C_i^∗_,t. However, it is not easy to develop such a formula for the above C_i^∗_,t.

Because of this computational difficulty of G^Z_t, in our simulation study, we only compare our global monitoring statistic

G_tdefined above with the one proposed in Liu et al.⁶Their global monitoring statistic based on soft thresholding in this

particular setting is defined as

m

∑

G^L_t=

max{C_i^∗_,t− b, 0}.

i=1

LI

9

TABLE 3 The control limits of the monitoring

schemes based on G_tand G^L_twhen ARL₀= 1000

m = 100

m = 1000

G_t^L

b₁

G_t^L

b₁

G_t

b₂

b₃

G_t

b₂

b₃

h

19.717 78.452 19.977 5.644 24.096 617.353 114.857 19.787

Again, following Liu et al,⁶three choices of b are considered: (a) b₁= 1∕2; (b) b₂= log(10) = 2.3026; and (c) b₃

log(100) = 4.6052.

=

The specific simulation settings are similar to those in Section 3.1. Among the m data streams, m₀data streams are

from the IC distribution N(0, 1), and the remaining m₁= m − m₀data streams are from the OC distribution N(ꢀ_i, 1),

i = 1, … , m₁, where ꢀ_iis randomly drawn from {−0.5, 0.5}. We consider two choices of m: m = 100 and 1000, and Table 4

lists the corresponding choices of m₁for these two choices of m.

Similar to the first simulation study reported in Section 3.1, the performance of G_tand G^L_tis compared based on the

ARL₁of their corresponding monitoring schemes. The desired ARL₀for the G_t- and G^L_t-based monitoring schemes is set

at 1000. The control limits h for those monitoring schemes, which are obtained through Monte Carlo simulation, are

listed in Table 3.

On the basis of those control limits, the ARL₁of the G_t- and G^L_t-based monitoring schemes are obtained from 2500

simulations, which are reported in Table 4. The standard deviations of the run lengths from the 2500 simulations are

also included in parentheses. Again, for the G^L_t-based monitoring schemes, the bold number in each row represents the

smallest ARL₁among the three choices of b for that particular OC scenario. As we can see from the table, the ARL₁of G_t^L

depends on the choice of b. G^L_twith a small b does not perform well when a small number of data streams are OC, while

G^L_twith a large b does not perform well when a large number of data streams are OC. The explanation is similar to that

we give in Section 3.1. Since it is rarely known in advance how many data streams will be OC in practice, it is extremely

difficult to come up with an appropriate b for G^L_tin real applications. In contrast, despite the fact that there is no tuning

parameter involved, our G_tperforms well across different OC scenarios, and its detection delays are always close to those

of G^L_twith the best choice of b. This makes our G_tparticularly appealing in many real-world applications.

3.3 Unknown prechange and postchange distributions

In the previous two examples, the prechange distributions for all data streams are assumed to be completely known

and are specified by some particular parametric distribution. In some applications, it might not be easy to identify the

appropriate distributions for all data streams. Therefore, in this example, we consider the setting where both the prechange

and postchange distributions are unknown. Again, to obtain the specific form of our proposed global monitoring statistic

G_tin this setting, we need to find the appropriate local monitoring statistic W_i,t. To deal with the unknown prechange

distribution, a nonparametric monitoring statistic should be used. To deal with the unknown postchange distribution,

we need a nonparametric monitoring statistic that can detect any arbitrary distributional changes. In the literature, to

deal with the unknown prechange distribution, many nonparametric monitoring statistics assume that a large amount

of IC reference data generated by the prechange distribution is available so that certain characteristics of the prechange

distribution can be well estimated. However, in order for the effect of using estimates instead of the true values on the

ARL₀to be negligible, it usually requires a substantial amount of IC reference data. In many real-world applications, it can

be very challenging to have such data. Therefore, to find a good candidate for our W_i,t, we only focus on the nonparametric

monitoring statistics that have the self-starting feature.

There are a few nonparametric self-starting monitoring statistics that can detect any arbitrary distributional changes

in the literature. For example, Zou and Tsung¹³proposed an EWMA statistic based on a powerful goodness-of-fit test.

However, according to the simulation studies conducted in Ross and Adams,¹⁴this EWMA statistic is only sensitive in

detecting scale increases and is not as powerful as its competitors in detecting other types of distributional changes includ-

ing location shifts. Ross and Adams¹⁴further proposed two monitoring statistics based on the change-point detection

(CPD) framework. Their proposed CPD statistics are shown to have better overall performance than Zou and Tsung's

EWMA statistics for detecting different distributional changes. However, like most CPD statistics, the computation of their

proposed statistics is very intensive, which makes them very challenging to implement for monitoring high-dimensional

data. Recently, Li¹⁵proposed a nonparametric self-starting CUSUM statistic that can detect any arbitrary distributional

changes. On the basis of the simulation studies in Li,¹⁵the proposed monitoring statistic not only is computationally

more efficient than Ross and Adams's CPD statistics but also has better overall detection power than those CPD statistics.

Therefore, in the following, we use the CUSUM statistic proposed in Li¹⁵as our local monitoring statistic W_i,t.

10

LI

G_t^L

b₁

TABLE 4 The ARL₁comparison of the monitoring

schemes based on G_tand G_t^L

m

m₁

1

G_t

b₂

b₃

71.05 (37.20) 120.08 (62.31)

41.05 (17.58) 55.64 (25.85)

31.37 (12.67) 37.93 (16.37)

88.54 (45.55)

44.01 (19.25)

31.89 (13.03)

24.25 (9.15)

21.10 (7.84)

14.25 (4.62)

8.59 (2.44)

6.85 (1.81)

6.16 (1.53)

70.94 (36.97)

40.89 (17.21)

32.13 (12.57)

26.24 (9.69)

23.86 (8.73)

18.02 (5.83)

12.41 (3.60)

10.39 (2.87)

9.52 (2.53)

3

5

8

10

20

50

80

100

24.51 (9.27)

21.54 (8.08)

14.43 (4.77)

8.17 (2.26)

6.17 (1.58)

5.40 (1.26)

26.89 (10.98)

22.86 (9.24)

13.49 (4.85)

6.98 (2.13)

5.20 (1.40)

4.51 (1.14)

100

1

91.72 (43.42) 229.39 (126.38) 161.70 (83.56) 106.88 (51.26)

3

56.75 (22.74) 114.71 (56.63)

46.40 (17.00) 82.56 (37.59)

37.48 (13.18) 59.71 (25.92)

34.05 (11.68) 50.09 (20.84)

80.57 (35.88)

59.86 (24.22)

44.29 (16.67)

38.23 (13.94)

25.26 (8.34)

14.81 (4.08)

11.36 (2.96)

10.15 (2.65)

8.19 (1.98)

7.12 (1.68)

5.90 (1.33)

5.20 (1.12)

4.72 (1.00)

4.33 (0.91)

4.09 (0.84)

3.85 (0.84)

3.70 (0.75)

3.54 (0.74)

59.57 (23.93)

47.45 (16.85)

37.83 (12.81)

33.98 (11.23)

25.19 (7.62)

17.24 (4.50)

14.35 (3.55)

13.17 (3.30)

11.26 (2.61)

10.14 (2.31)

8.74 (2.00)

7.91 (1.75)

7.28 (1.59)

6.77 (1.48)

6.45 (1.38)

6.13 (1.34)

5.90 (1.25)

5.67 (1.25)

5

8

10

20

50

80

100

24.57 (7.74)

15.36 (4.15)

11.77 (3.06)

10.42 (2.67)

8.15 (1.96)

6.94 (1.60)

5.49 (1.18)

4.70 (0.97)

4.19 (0.84)

3.78 (0.76)

3.50 (0.68)

3.28 (0.66)

3.10 (0.58)

30.05 (11.50)

15.04 (4.95)

10.55 (3.30)

9.15 (2.78)

6.86 (1.94)

5.74 (1.53)

4.48 (1.08)

3.83 (0.89)

3.43 (0.75)

3.12 (0.69)

2.88 (0.60)

2.71 (0.59)

2.61 (0.55)

2.47 (0.54)

1000 150

200

300

400

500

600

700

800

900

1000 2.95 (0.52)

Note. The standard deviations of the run lengths from the 2500 simulations are reported in

parentheses.

Assume that, for each data stream, there are n IC reference data, denoted by X_i,−n+1, … , X_i,0, i = 1, … , m. At time

t ≥ 1, for the ith data stream, we partition the real line into the following d left-to-right regions:

(1)

̂

A_i,t,1= (−∞, q_i,t,1], A_i,t,2= (q_i,t,1, q_i,t,2], … , A_i,t,d= (q_i,t,d−1, ∞),

and the following d center-outward regions:

(2)

̂

A_i,t,1= (q_i,t,d−1, q_i,t,d+1],

(2)

̂

A_i,t,2= (q_i,t,d−2, q_i,t,d−1] ∪ (q_i,t,d+1, q_i,t,d+2],

… …

(2)

̂

A_i,t,d= (−∞, q_i,t,1] ∪ (q_i,t,2d−1, ∞),

(1)

(2)

̂

where q_i,t,ꢂ(j = 1, … , d − 1) and q_i,t,k(k = 1, … , 2d − 1) are the (j∕d)th and (k∕(2d))th sample quantiles, respectively,

from X_i,−n+1, … , X_i,0, X_i,1, … , X_i,t−1. For j = 1, … , d, define

(1)

(2)

̂

Y_i,t,ꢂ= I(X_i,t∈ A_i,t,ꢂ),

and

ꢂ

∑

(1)

(2)

̂

Z_i,t,ꢂ

=

Y_i,t,l

,

Z_i,t,ꢂ

=

Y_i,t,l.

l=1

LI

11

For k₁, k₂= 1, 2, we calculate

ꢂ

∑

⎛

⎜

⎧

⎪

⎨

⎛

⎜

⎞

⎟

(k₁,k₂)

̂

p_i,t,l

d−1

d²

l=1

∑

1

d

(k₁,k₂)

̂

t−1

^{(k )}log

1

̂

S_i,t

= max 0, S

+

Z

⎜

i,t,ꢂ

_ꢂ=1ꢂ(d − ꢂ)

ꢂ∕d

⎜

⎪

⎜

⎟

⎜

⎪

⎩

⎜

⎝

⎟

⎠

⎝

ꢂ

∑

⎛

⎞⎫⎞

(k₁,k₂)

̂

p_i,t,l

1 −

⎜

⎟⎪⎟

⎟⎬⎟

l=1

+(1 − Z_i⁽_,^k_t,ꢂ⁾) log

,

1

̂

1 − ꢂ∕d

⎜

⎟⎪⎟

⎜

⎝

⎟⎪⎟

⎠⎭⎠

where p⁽_i,^k_t,l^{,k )}is defined by

1

2

̂

(k₁,k₂)

i,t,l

ꢃ_l^{(k )}+ N

2

(k₁,k₂)

̂

p_i,t,l

=

,

d

∑

(k₁,k₂)

i,t

ꢃ_ꢂ^{(k )}+ N

2

ꢂ=1

and both N_i⁽_,^k_t^{,k )}and N_i⁽_,^k_t,l^{,k )}are calculated recursively by

1

2

1

2

{

(k₁,k₂)

N

(k₁,k₂)

̂

+ 1, if S_i,t−1> 0,

(k₁,k₂)

i,t

i,t−1

N

=

(k₁,k₂)

0,

if S_i,t−1= 0,

{

(k₁,k₂)

(k₁)

(k₁,k₂)

̂

N

+ Y_i,t−1,l, if S_i,t−1> 0,

(k₁,k₂)

i,t,l

i,t−1,l

(k₁,k₂)

̂

0,

if S_i,t−1= 0.

The constants {ꢃ₁^{(k )}, … , ꢃ_d^{(k )}} (k₂= 1, 2) serve as the parameters of a prior distribution and are chosen as suggested in

2

(1)

(1,1)

Li.¹⁵In particular, when using ꢃ_ꢂin S_i,t, the prior indicates a positive location shift; therefore, S_i,tis more powerful

̂

for detecting positive location shifts. When using ꢃ_ꢂ⁽²⁾in S_i,t, the prior indicates a negative location shift, so S_i⁽_,¹_t^,2)is more

(1,2)

̂

powerful for detecting negative location shifts. Similarly, when using ꢃ_ꢂ⁽¹⁾in S_i⁽_,²_t^,1), the prior indicates a scale increase, so

̂

S_i⁽_,²_t^,1)is more powerful for detecting scale increases. When using ꢃ_ꢂ⁽²⁾in S_i,t, the prior indicates a scale decrease, so S_i,t

is more powerful for detecting scale decreases. If we do not have any prior information about what type of changes the

process might encounter, our local monitoring statistic is simply

(2,2)

̂

(1,1) (1,2) (2,1)

S_i,t= max(S_i,t, S_i,t, S_i,t, S_i⁽_,²_t^,2)),

(5)

̂

which is efficient to detect any type of distributional changes. Li¹⁵shows that the above monitoring statistic is asymptotic

distribution free. Following the suggestion in Li,¹⁵we choose d = 20 and n = 40.

(k₁,k₂)

i,0,l

In Li,¹⁵the initial values (S_i⁽_,^k₀^{,k )}, N^(k¹^,k²⁾, N

, Y_i⁽_,^k_0,l⁾), k₁, k₂= 1, 2, l = 1, … , d, and i = 1, … , m, for the above mon-

1

2

1

̂

i,0

̂

itoring statistic S_i,tare all set at 0. To simplify the calculation of our proposed global monitoring statistic G_t, similarly

to how we modified the local monitoring statistics in the previous two examples, we propose to set the initial values

(k₁,k₂)

i,0,l

(k₁)

(S_i⁽_,^k₀^{,k )}, N^(k¹^,k²⁾, N

, Y_i,0,l) at some values randomly drawn from their IC steady-state distributions. More specifically,

1

2

̂

i,0

5

̂

using the distribution-free property of S_i,t, we generate 10 independent sequences of {X_k,−39, … , X_k,0, X_k,1, … , X_k,2000

(k = 1, … , 10⁵), each of which is independently drawn from N(0, 1), and calculate

}

5

{(

)}₁₀

(k₁,k₂)

(k₁)

S_k,2000, N_k^(k_,20^,k₀₀⁾, N^{(k ,k )}

_k,2000,l, Y_k,2000,l

,

1

2

1

2

̂

k=1

12

LI

1

2

1

2

10⁵

(k₁,k₂)

(k₁)

using the initial value 0. Then, {(S_k,2000, N_k^(k_,20^,k₀₀⁾, N_k^(k_,20^,k₀₀⁾_,l, Y_k,2000,l)}

can be used to approximate the IC steady-state

̂

k=1

(k₁,k₂)

i,t,l

(k₁)

distribution of (S_i⁽_,^k_t^{,k )}, N^(k¹^,k²⁾, N

, Y_i,t,l). The initial values to calculate our modified S_i,tare then defined as

1

2

∗

̂

i,t

(k₁,k₂)

i,0,l

(S_i⁽_,^k₀^{,k )}, N^(k¹^,k²⁾, N

, Y_i⁽_,^k_0,l⁾) = V_i,

1

2

1

̂

i,0

1

2

1

2

10⁵

(k₁,k₂)

(k₁)

where V_iis randomly drawn from {(S_k,2000, N_k^(k_,20^,k₀₀⁾, N_k^(k_,20^,k₀₀⁾_,l, Y_k,2000,l)}

with replacement. The expected quantiles q_(i),t

̂

k=1

∗

10⁵

(1,1)

(1,2)

(2,1)

(2,2)

̂

of S_i,tcan then be well approximated by the corresponding sample quantiles of {max(S_k,2000, S_k,2000, S_k,2000, S_k,2000)}_k=1

which we denote by q_(i). Therefore, our proposed global monitoring statistic G_tis

,

̂

s

̂

m

(

)

∑

2

̂

s

∗

̂

S

̂

G_t=

_(i),t− q_(i)I_{S

,

_(i),t>q_(i)}

̂

s

∗

̂

i=1

∗

̂

where S_(1),t≤ … ≤ S_(m),tare the order statistics of S_1,t, … , S_m,t

.

• Simulation study

∗

̂

Using the above modified local monitoring statistics S_i,t, again, it is difficult to implement the global monitoring statistic

Z

5

∗

̂

G_tproposed by Zou et al, since no closed-form formula for the cumulative distribution function of S_i,tis available. We

can use the thresholding method proposed in Liu et al⁶to come up with some alternative global monitoring statistics.

However, it is not clear how to choose a sensible threshold. Therefore, in our simulation study, we only compare our global

∑

m

monitoring statistic G_tdefined above with two other natural competitors G^m_t^ax= max_{i=1, … ,m}S and G^s_t^um

=

_i=1S_i,t.

∗

̂

i,t

∗

̂

Again, in our simulation study, we consider monitoring m data streams. Since S_i,tis distribution free, among the m data

streams, we randomly select half of the data streams to have N(0, 1) as their IC distributions, one fifth of the data streams

to have the t distribution with 2.5 degrees of freedom as their IC distributions, and the remaining data streams to have

the lognormal distribution with parameters ꢀ = 1 and ꢄ = 0.5 as their IC distributions. For the data generated from the t

or lognormal distribution, we also standardize the data so that their IC distributions have mean 0 and standard deviation

1. Among the m data streams, the first m₀data streams follow their IC distributions all the time, and the remaining

m₁= m − m₀data streams will experience certain distributional changes from their IC distributions at the change-point

t = 100. Since S_i,tis capable of detecting any type of distributional changes, starting from the change-point t = 100, for

∗

̂

the m₁data streams that will experience distributional changes, we add 0.5 to the observations from the first ⌈m₁∕2⌉

data streams to introduce the location change and multiply 1.5 to the observations from the remaining m₁− ⌈m₁∕2⌉

data streams to introduce the scale change. Here, ⌈b⌉ is the smallest integer not less than b. Similar to the previous two

examples, we consider two choices of m: m = 100 and 1000, and Table 6 lists the corresponding choices of m₁for these

two choices of m. The desired ARL₀for the G_t-, G_t^max- and G^s_t^um-based monitoring schemes is set at 1000. The control

limits h for those monitoring schemes, which are obtained through Monte Carlo simulation, are listed in Table 5.

On the basis of those control limits, the ARL₁(after the change point) of the G_t-, G^m_t^ax-, and G_t^sum-based monitoring

schemes from 2500 simulations is reported in Table 6. The standard deviations of the run lengths from the 2500 simula-

tions are also included in parentheses. Again, the bold number in each row represents the smaller ARL₁between G_t^max

and G_t^sumfor that particular OC scenario. As we can see from the table, G_t^maxworks best when only a few data streams are

OC but does not perform well when a large number of data streams are OC. On the other hand, G^s_t^umhas the best per-

formance when a large number of data streams are OC but has the worst performance when only a few data streams are

OC. In contrast, our G_tperforms well across different OC scenarios, and if its detection delay is not the best among all the

three monitoring statistics, it is always very close to the best. This provides another example of the robust performance of

our proposed global monitoring statistic G_tfor detecting different OC scenarios.

TABLE 5 The control limits of the monitoring schemes based on G_t,

G^m_t^ax, and G_t^sumwhen ARL₀= 1000

m = 100

m = 1000

G_t

G^m_t^ax

G^s_t^um

G_t

G^m_t^ax

G^s_t^um

h

144.016 26.908 524.492 171.697 33.492 4720.355

LI

13

TABLE 6 The ARL₁comparison of the monitoring schemes based

on G_t, G_t^max, and G_t^sum

m

m₁

1

G_t

G^m_t^ax

G^s_t^um

85.89 (52.00)

53.43 (25.14)

43.92 (16.91)

27.70 (9.55)

25.17 (8.44)

17.24 (4.99)

10.42 (2.65)

7.73 (1.87)

6.63 (1.54)

80.51 (51.40)

57.68 (27.62)

51.91 (19.98)

35.90 (13.36)

34.07 (12.50)

28.52 (9.47)

23.45 (6.99)

20.61 (6.14)

19.31 (5.63)

181.85 (136.38)

86.00 (49.69)

59.84 (29.17)

32.82 (14.25)

28.33 (11.58)

16.35 (5.64)

8.74 (2.55)

6.30 (1.77)

5.35 (1.45)

3

5

8

10

20

50

80

100

1

123.93 (118.56) 111.90 (98.57) 452.76 (430.52)

3

75.18 (32.13)

64.98 (25.32)

55.19 (20.56)

50.35 (17.97)

31.34 (9.16)

20.66 (5.10)

15.76 (3.69)

13.33 (3.01)

10.47 (2.30)

8.78 (1.90)

6.83 (1.41)

5.67 (1.20)

4.90 (0.99)

4.45 (0.92)

4.03 (0.82)

3.66 (0.75)

3.37 (0.72)

76.72 (32.95)

70.05 (28.96)

64.05 (25.55)

61.10 (23.28)

42.46 (14.31)

34.72 (10.35)

30.78 (8.23)

28.14 (7.46)

25.59 (6.49)

24.14 (6.04)

22.23 (5.40)

21.00 (4.90)

19.87 (4.92)

19.30 (4.78)

18.81 (4.60)

18.23 (4.44)

17.70 (4.33)

17.32 (4.26)

215.51 (161.58)

157.96 (104.68)

112.17 (66.04)

95.71 (53.16)

44.15 (19.04)

21.51 (7.33)

14.60 (4.50)

11.82 (3.47)

8.84 (2.48)

5

8

10

20

50

80

100

1000 150

200

7.14 (1.95)

300

5.37 (1.41)

400

4.39 (1.17)

500

3.79 (0.98)

600

3.40 (0.91)

700

3.06 (0.80)

800

2.76 (0.75)

900

2.53 (0.71)

1000 3.12 (0.66)

2.35 (0.67)

Note. The standard deviations of the run lengths from the 2500 simulations

are reported in parentheses.

G_t^L

b₁

TABLE 7 The computational times (in minutes) of the monitoring schemes based on G_t,

G^Z_tand G^L_tin calculating their ARL₁s for different choices of m₁in the simulation study from

Section 3.1

m

100

G_t

0.24 2.34

G_t^Z

b₂

b₃

0.21 0.20 0.19

1000 2.51 33.27 2.71 2.11 1.89

4

CONCLUDING REMARKS

In this paper, we introduce a general class of global monitoring statistics for high-dimensional data streams. Our pro-

posed global monitoring statistics are easy to calculate, which makes them suitable for monitoring high-dimensional data

streams. To show the computational efficiency of our proposed global monitoring statistics, we report in Table 7 the com-

putational times of the five monitoring schemes in calculating their ARL₁s for different choices of m₁in the simulation

study from Section 3.1 on a Dell computer with Intel Core i7-6700HQ Processor. As we can see from Table 7, the mon-

itoring schemes from the thresholding-based G^L_tare the most efficient in terms of computation, since G^L_tonly requires

the comparison of the local monitoring statistics W_i,twith some pre-specified threshold. For our proposed monitoring

scheme, after we obtain the estimates of the q_(i),toffline beforehand, the total online computational effort in calculating

∑

m

our proposed G_tis the same as calculating

(W_(i),t− a_i)²I_{{W >a }}with all the a_igiven. Therefore, our G_tonly needs to

i=1

(i),t

i

order the W_i,t, which requires O(m log m) computations. Although this is more time-consuming than simply comparing

W_i,tto a threshold, from Table 7, we can see that the computational times of our G_t-based monitoring scheme are compa-

rable with those from the G_t^L-based monitoring schemes. In contrast, the computational times of the monitoring scheme

based on G^Z_tare almost 10 times the computational times of our proposed monitoring scheme. This is due to the extra

computation needed to convert W_i,tto U_i,tin calculating G^Z_t.

In our simulation studies reported in Section 3, we use the sample quantiles from a random sample of size 100 000 from

the IC distribution of the W_i,tto approximate the q_(i),t. In the following, we report new results for the simulation study

14

LI

from Section 3.1 when we use random samples of smaller sizes to estimate the q_(i),t. The size of the random IC W_i,tsample

used to estimate the q_(i),tis represented by B in Table 8. As mentioned earlier, our proposed global monitoring statistic is

∑

m

equivalent to

(W_(i),t− a_i)²I_{W

with a_ibeing given by the estimate of q_(i),t, which we obtain offline beforehand.

_(i),t>a_i}

i=1

The control limit h of our proposed monitoring scheme is then determined based on those given a_ivalues. With different

IC W_i,tsample sizes, the values of the a_ispecified in the above monitoring statistic will be different, so is the control limit

h of our proposed monitoring scheme. The h value corresponding to a specific B value in Table 8 is the control limit we

obtain for ARL₀= 1000 when the a_iis given by the estimates of the q_(i),tfrom a random IC W_i,tsample of size B. In

Table 8, the ARL when m₁= 0 indicates the simulated ARL₀. Therefore, as shown in the table, the simulated ARL₀s of

our proposed monitoring schemes with different IC W_i,tsample sizes are all close to the nominal level ARL₀= 1000. As

we can also see from the table, for m₁≠ 0, the reported ARL₁s for our proposed monitoring schemes with different IC

W_i,tsample sizes are also very similar. From those results, we can see that the size of the random IC W_i,tsample used to

estimate the q_(i),thas no significant impact on the performance of our proposed monitoring scheme.

As mentioned in Section 1, there are two types of monitoring schemes. For the second type of monitoring scheme, after

our proposed monitoring scheme based on G_ttriggers an alarm, we also need to identify which data stream is experiencing

abnormal activities. For this purpose, we can use the local monitoring statistics W_i,tto determine which data streams are

OC as follows: If W_i,tis larger than some control limit, say c_h, we conclude that the ith data stream is OC; otherwise,

we conclude that the ith data stream is IC. We refer to Li¹for more details on how to choose the control limit c_hin this

situation.

Although we only consider three types of local monitoring statistics as examples in the paper, our proposed global mon-

itoring statistic can work with any local monitoring statistic that is efficient for monitoring a single data stream. This

TABLE 8 The ARL results of our

proposed monitoring scheme based on

G_twhen the q_(i),tare estimated by the

sample quantiles from random samples

of different sizes

B = 100000

h = 20.798

1008.94 (1014.88) 994.45 (1021.60)

64.44 (32.88)

36.20 (14.95)

27.08 (10.53)

20.33 (7.26)

17.42 (6.11)

10.64 (3.67)

4.88 (1.57)

3.25 (0.97)

2.72 (0.77)

B = 100000

h = 25.041

B = 10000

h = 19.812

B = 5000

h = 20.341

1007.48 (1010.02) 1003.80 (993.95)

B = 1000

h = 27.496

m

m₁

0

1

3

5

8

10

20

50

80

100

65.28 (32.54)

36.43 (14.64)

27.16 (10.24)

20.29 (7.16)

17.14 (6.30)

10.46 (3.58)

4.82 (1.60)

3.23 (0.96)

2.65 (0.76)

B = 50000

66.57 (33.63)

36.65 (14.49)

26.72 (10.14)

20.04 (7.45)

17.38 (6.29)

10.24 (3.64)

4.72 (1.49)

3.16 (0.96)

2.62 (0.77)

B = 10000

62.58 (31.44)

36.20 (14.64)

27.26 (10.28)

21.01 (7.22)

18.54 (6.36)

11.54 (3.72)

5.54 (1.73)

3.70 (1.09)

3.07 (0.86)

B = 5000

h = 30.055

996.05 (1005.91)

79.94 (36.80)

51.60 (18.31)

41.59 (13.95)

34.72 (10.47)

30.88 (9.51)

22.04 (6.37)

12.25 (3.61)

8.61 (2.58)

7.17 (2.13)

5.14 (1.54)

3.97 (1.12)

2.79 (0.76)

2.21 (0.60)

1.86 (0.46)

1.67 (0.48)

1.47 (0.50)

1.26 (0.44)

1.11 (0.31)

1.03 (0.16)

100

m

m₁

0

1

3

5

8

10

20

50

80

100

h = 24.608

h = 22.512

1006.42 (1044.24) 1004.02 (1008.25) 994.39 (1024.27)

81.56 (37.40)

51.83 (18.91)

41.89 (13.58)

33.90 (10.30)

30.19 (8.96)

21.04 (6.24)

11.95 (3.67)

8.47 (2.54)

7.05 (2.13)

5.04 (1.43)

3.96 (1.13)

2.78 (0.80)

2.21 (0.60)

1.88 (0.46)

1.67 (0.48)

1.46 (0.50)

1.26 (0.44)

1.09 (0.29)

86.50 (39.92)

53.07 (19.28)

42.15 (13.88)

33.40 (10.55)

30.18 (9.44)

21.08 (6.32)

11.57 (3.47)

8.19 (2.59)

6.93 (2.06)

4.97 (1.47)

3.84 (1.13)

2.75 (0.78)

2.15 (0.58)

1.82 (0.48)

1.61 (0.49)

1.41 (0.49)

1.20 (0.40)

1.09 (0.29)

1.02 (0.14)

81.69 (37.52)

51.59 (18.65)

42.27 (13.91)

33.75 (10.26)

30.25 (9.45)

20.97 (6.10)

11.88 (3.53)

8.37 (2.47)

7.00 (2.10)

5.02 (1.48)

3.92 (1.10)

2.76 (0.77)

2.18 (0.59)

1.83 (0.46)

1.64 (0.49)

1.43 (0.50)

1.22 (0.42)

1.07 (0.26)

1.02 (0.14)

1000 150

200

300

400

500

600

700

800

900

1000 1.03 (0.17)

LI

15

flexibility makes our proposed global monitoring statistics suitable for many different real-world applications. The simu-

lation studies in the three examples we consider further show that our proposed global monitoring statistic performs well

under a variety of OC scenarios and has the best overall detection power comparing with other existing global monitoring

statistics.

ACKNOWLEDGEMENTS

The author thanks the editor and two anonymous referees for their constructive comments and suggestions, which greatly

improved the quality of the paper.

ORCID

Jun Li

https://orcid.org/0000-0002-0323-8255

REFERENCES

1. Li J. A two-stage online monitoring procedure for high-dimensional data streams. J Qual Technol. 2018. https://doi.org/10.1080/00224065.

2018.1507562

2. Tartakovsky AG, Rozovskia BL, Blazeka RB, Kim H. Detection of intrusions in information systems by sequential change-point methods

(with discussion). Stat Method. 2006;3:252-340.

3. Mei Y. Efficient scalable schemes for monitoring a large number of data streams. Biometrika. 2010;97:419-433.

4. Xie Y, Siegmund D. Sequential multi-sensor change-point detection. Ann Stat. 2013;41:670-692.

5. Zou C, Jiang W, Wang Z, Zi X.. An efficient on-line monitoring method for high-dimensional data streams. Technometrics. 2015;57:374-387.

6. Liu K, Zhang R, Mei Y.. Scalable SUM-shrinkage schemes for distributed monitoring large-scale data streams. Stat Sin. 2019;29:1-22.

7. Qiu P. Introduction to Statistical Process Control. Boca Raton: FL: Chapman & Hall/CRC; 2014.

8. Zhang J. Powerful goodness-of-fit tests based on the likelihood ratio. J R Stat Soc Ser B. 2002;64:281-294.

9. Grigg OA, Spiegelhalter DJ. An empirical approximation to the null unbounded steady-state distribution of the cumulative sum statistic.

Technometrics. 2008;50:501-511.

10. Sparks RS. CUSUM charts for signalling varying location shifts. J Qual Technol. 2000;32:157-171.

11. Han D, Tsung F.. A reference-free cuscore chart for dynamic mean change detection and a unified framework for charting performance

comparison. J Am Stat Assoc. 2006;101:368-386.

12. Lorden G, Pollak M. Sequential change-point detection procedures that are nearly optimal and computationally simple. Seq Anal.

2008;27:476-512.

13. Zou C, Tsung F. Likelihood ratio-based distribution-free EWMA control charts. J Qual Technol. 2010;42:174-196.

14. Ross GJ, Adams NM. Two nonparametric control charts for detecting arbitrary distribution changes. J Qual Technol. 2012;44:102-116.

15. Li J. Nonparametric adaptive CUSUM chart for detecting arbitrary distributional changes. Submitted. 2017. (arXiv:1712.05072).

AUTHOR BIOGRAPHY

Jun Li received her PhD in Statistics from the Department of Statistics and Biostatistics at Rutgers University

in 2006. Since then, she has been with the Department of Statistics at the University of California, Riverside, as

an assistant professor (2006-2012) and an associate professor (2012 to present). She is a lifetime member of the

American Statistical Association and the Institute of Mathematical Statistics. Her current research interests include

statistical process control and nonparametric multivariate analysis.

How to cite this article: Li J. Efficient global monitoring statistics for high-dimensional data. Qual Reliab Engng

Int. 2019; 1–15. https://doi.org/10.1002/qre.2557

Article Doi

Bifunctional Phosphine Ligand Enabled Gold-Catalyzed Alkynamide Cycloisomerization: Access to Electron-Rich 2-Aminofurans and Their Diels–Alder Adducts

DOI: 10.1002/anie.201908598

Source and publish data:

Authors:

Article abstract of DOI:10.1002/anie.201908598

Full text of DOI:10.1002/anie.201908598

Products guided by the article

R&D Labs maybe for 117873-04-8

Relevant to this article

Hot Product