Home / Journal / Interdisciplinary Education and Psychology
measurement error matching multilevel structural equation modeling
Qiu Wang, Richard T. Houang and Kimberly S. Maier
https://doi.org/10.31532/InterdiscipEducPsychol.1.1.009 17 Oct 2017
Based upon a twolevel structural equation model, this simulation study compares latent variable matching and matching through manifest variables. Selection bias is simulated on latent variable and/or manifest variables along with different magnitudes of reliability. Besides factor score matching and Mahalanobis distance matching, we examined two types of propensity score matching on: “naïve” propensity score derived from manifest covariates, and “true” propensity score derived from latent factor. Results suggest that 1) Mahalanobis distance matching works less effectively than propensity/factor score matching; 2) propensity score and factor score matching performed the best if both treatment and control groups have high reliability; 3) matching through manifest variables is optimal and preferable if latent composite variable is underrepresentative; 4) when latent variable represents the manifest variables well, latent variable matching is preferable and more efficient than matching on respective manifest variables; and 5) matching options such as caliper matching and replacement matching interact with the magnitude of reliability and matching with replacement on a smaller caliper performs the best for more reliable measures.
Measurement error, matching, multilevel, structural equation modeling
There is an increasing need for studies on how educational interventions affect student performance (Raudenbush & Sadoff, 2008; Spybrook, 2007). A study’s ability to assess the efficacy and efficiency of educational interventions depends on its hypothesis development, experimental design, controlled experimental trials, identification of the population of interest, and implementation (Sloane, 2008). The most challenging task is to obtain valid measurements of interventions in order to assess the effect of intervention (Raudenbush & Sadoff, 2008; Sloane, 2008). Measures of the classroom interventions that students receive can be subject to measurement errors (Raudenbush & Sadoff, 2008) in data collection through largescale surveys and observational studies (Cochran,1963, 1965, 1969, 1972; Rosenbaum, 2002).
Structural equation modeling (SEM, Bollen, 1989) incorporates the latent variable to account for measurement errors on manifest variables. Kaplan (1999) applied propensity score stratification into the Multiple Indicators Multiple Causes (MIMIC) model to deal with measurement error on the dependent variable for group difference estimation. However, propensity score matching was not conducted. Furthermore, the propensity score was estimated through observed covariates that may have measurement error. This Monte Carlo study uses an SEM framework (Jöreskog & Sörbom, 1996) to examine the effectiveness of matching through the latent variable and through manifest variables with measurement errors.
Measurement Errors
Measurement errors (Cochran, 1968b) of observed (manifest) variables have been well studied in linear regression (Fuller, 1987), logistic regression (Carroll, Ruppert, Stefanski & Crainiceanu, 2006; Spiegelman, Schneeweiss & McDermott, 1997), and survey sampling (Biemer et al., 2004; Fuller, 1995; Hansen, Hurwitz & Bershad, 1961; Mahalanobis, 1946); however, few studies have been conducted in matching since Cochran and Rubin (1973) reviewed the effect of measurement errors on bias reduction (Rubin, 1973a).
Measurement issues can have serious impact on the findings of a study because they reduce the efficiency of adjustment (Cochran, 1968a; Cochran, 1965). While the literature is replete with guidelines on how to use propensity score analysis (Pan & Bai, 2015) to estimate treatment effect, there is little research on how to adjust the measurement errors to examine bias reduction on covariates after matching (Jakubowski, 2015). Most researchers simply analyze and estimate propensity scores by taking the covariates as the perfect measures (Cochran, 1957; Cochran & Rubin, 1973). Recent propensity score analyses mainly examine how covariates with measurement errors affect treatment effect estimation though biascorrection or imputation (Battistin & Chesher, 2014; WebbVargas, Rudolph, Lenis, Murakami & Stuart, 2015), or inverse probability weight (McCaffrey, Lockwook, & Setodji 2011; Steiner, Cook & Shadish 2011). Propensity score matching was rarely conducted in these studies for bias reduction analysis.
Measurement Errors and Bias
Bias occurs when the estimate (testing score or observed treatment effect) differs from the value being estimated (true score or true effect) through sampling (Särndal, Swensson & Wretman, 2003). Bias due to measurement errors (Fuller, 1987) can occur in outcome Y and/or covariates X (Carroll et al., 2006; Cochran & Rubin, 1973). Thus, the outcome in an intervention effect model is a sum of three parts (Wooldridge, 2002): 1) the effect of intervention variety, 2) the effect of initial bias due to covariates X, and 3) the random measurement error. If participants are not randomly assigned to intervention groups (Cochran, 1969, 1972), then a study often has problems of selection bias^{1} (Heckman, 1979), indicating the initial unbalanced treatment and control groups in term of covariates X. Selection bias attenuates the treatment effect estimate and misleads one's conclusions (Campbell & Stanley, 1966).
Bias Reduction and Propensity Score Matching
Because the “golden rule" of randomization is generally broken in observational studies (Cochran, 1963, 1965, 1969, 1972; Rosenbaum, 2002), bias reduction techniques have been developed for causal inference (e.g., Rubin, 1974, 1978). These techniques include Cochran's three approaches including pairing, balancing, and stratification (Cochran, 1953), posthoc matching (Abadie & Imbens, 2006, 2007; Rubin, 1973a,b, 1976a,b, 1979, 1980), analysis of covariance (e.g., Cochran, 1957, 1969), inverse propensity score weighting (Angrist & Pischke, 2009; Horvitz & Thompson, 1952; McCaffrey & Hamilton, 2007), statistical modeling with adjustment (e.g. WLS estimation in HLM frame work, see Hong and Raudenbush, 2006), and double robust estimation using regression adjustment and inverse propensity score weighting (Kang & Schafer, 2007). The most recent development can also be referred to literature and materials in Pan and Bai (2015).
Posthoc matching depends on the summary measure, a functional composite of covariates (Rubin, 1985). The most commonly used composites in matching include the Mahalanobis distance (e.g., Rubin, 1980) and the propensity score (Rosenbaum and Rubin, 1983). Propensity score matching is a posthoc bias reduction method, which has been commonly used on observational data to approximate the individualrandomized trials to study a treatment effect of interest (Cochran, 1953, 1968a; Cochran & Rubin, 1973; Rosenbaum & Rubin, 1983; Rubin, 1973a,b). Propensity score matching (Rosenbaum & Rubin, 1983) is the most commonly used bias reduction technique of posthoc sampling (Cochran, 1953; Rubin, 1973a,b, 1976a,b, 1979, 1980) in causal inference and program evaluation. A propensity score (P) is a conditional probability that an individual belongs to the treatment group (Rosenbaum & Rubin, 1983). It is generally estimated by using the logit model of
Attenuated Bias Reduction Rate Due to Measurement Errors
Measurement errors attenuate the regression coefficient β of covariates X on outcome Y (Fuller, 1987; Jöreskog & Sörbom, 1996). Let
Cochran (Rubin, 2006, p. 20) found that under a simple linear regression, the measurement error on x attenuates the bias reduction rate by a ratio of 1/(1+h). h “ is the ratio of the variance of the errors of measurement to the variance of the correct measurements” (Cochran, 1968b, p. 295). In other words, 1/(1+h) can be rewritten as 1/r, with r representing the reliability.
Measurementerroradjusted Propensity Scores
When the true covariates (
The first method assumes that the true covariates have not been observed and the naïve parameter estimates are obtained using the observed covariates. An approximately consistent estimator of the parameters is provided through a functional adjustment on the naïve estimator (for details, see Rosner, Spiegelman & Willett, 1990; Rosner, Willett & Spiegelman, 1989). The second method of adjusting for measurement errors in logistic regression is through structural modeling, in which the distribution of the true covariates is parametrically modeled (S¨orbom, 1978; Jakubowski, 2015). For example, the maximum likelihood or Bayesianapproachbased SEM (Carroll et al., 2006; Lee, 2007, Chapter 9) can be used to deal with measurement errors.
This twostep adjusted method requires that X and X* have equal dimensions; however, the measurementerroradjusted propensity scores cannot be obtained directly using this approach because the integral in the subsequent propensity score function does not have a closedform solution (Carrel et al., 2006, p. 91). The approximate approach has been developed in Weller, Milton, Eison and Spiegelman (2007) using a multivariate normal conditional distribution of (X*│X, H), H represents the covariates without measurement errors.
Such or similar adjustment has been used in propensity analyses (e.g., Battistin and Chesher 2014; McCaffrey et al., 2011). For example, best linear unbiased predictor (BLUP) corrects measurement error to estimate propensity score using another errorfree covariate (McCaffrey et al., 2011, p. 8). However, when the second sample having both X and X* observed is not available, one cannot estimate the measurementerrorsadjusted propensity scores. In this situation, an alternative method such as SEM can be used to estimate the measurementerroradjusted propensity scores for matching.
Structural equation modeling (Bollen, 1989; Jöreskog & Sörbom, 1996) uses the latent variable to account for measurement errors on manifest variables. Using a twolevel SEM (Muthén, 1994), this study manipulates the reliability of the manifest variables to compare latent variable matching with matching through manifest variables.
Structural Equation Modeling as an Alternative
An SEMbased propensity score framework incorporates the latent variable to adjust measurement errors on manifest variables. Propensity scores can be estimated through the following hybrid SEM model:
The first equation, a measurement model, captures the linear relationship between the latent X* and observed X in both the treatment (D = 1) and control (D = 0) group. The second equation, a structural model, which is equivalent to the latent variable propensity score model in Equation (2) of Jakubowski (2015, p. 1291). It captures the nonlinear relationship between the latent X* and a latent propensity score.
Adapting a latent variable approach circumvents the posthoc coefficient adjustment (e.g., Weller et al., 2007) discussed above. The SEMbased propensity scores can be used in matching (Jakubowski, 2015). Note that the latent propensity score
Factor scores of the latent X* such as academic proficiency measures and ability constructs have been used to match individuals in order to achieve comparable groups (e.g., classical true score in Van der Linden & Hambleton, 1997). The latent construct is measured by multiple manifest items. In the most commonly used item response theory, individual ability is calibrated through a set of testitems with presumptive difficulty and discrimination (Lord & Novick, 1968). The calibrated ability estimation represents an examinee's academic propensity that the set of testitems are designed to measure. However, only matching on latent variables may fail to remove bias due to other omitted covariates H that are free of measurement error. A composite measure such as propensity score that summarizes both the latent X* and covariates H becomes necessary in matching. Wang, Maier and Houang (2017) simulated multilevel data and examined how omitted covariates attenuate the bias reduction rate in the SEMbased propensity score matching. However, the issue of measurement error was not taken into accounted in Wang et al. (see pp. 7273, 2017). Using the same multilevel SEMbased simulation (Wang, 2015; Wang et al., 2017) and measurementerroradjusted propensity score model of Equation (1), this study compares latent variable matching and matching through manifest variables to examine the effect of measurement error.
Longitudinal Design vs. Synthetic Cohort Design
Simulated quasiexperimental synthetic cohort design (SCD, Wiley & Wolfe, 1992) data with measurement errors are generated based on the Second International Mathematic Study (SIMS, International Association for the Evaluation of Educational Achievement, 1977). SIMS used a longitudinal design to study the effects of curriculum and classroom instruction targeted at the 8th grade (Cohort 2). Two waves of mathematic achievement data were collected, at the beginning (Time 0) and end of the school year (Time 1), respectively. Cohort 2 at Time 0 was in the control condition. After the “treatment" of one year of schooling, Cohort 2 at Time 1 data were collected to assess the schooling effect (δ_{C2T1C1T0}), defined as the average of “changes in mathematics achievement over the timespan of one school year at the particular grade level" (Wiley & Wolfe, 1992, p. 299).
In practice, Cohort 2 at Time 0 data may not be collected in a study, an alternative cohort Cohort 1 at Time 1 (e.g., grade 7 at Year_{i}) can serve as the “control” group to estimate schooling effect, denoted now as (δ^_{C2T1C1T1}). This design is called the synthetic cohort design (SCD; Wiley & Wolfe, 1992), which has been widely used in aging and epidemiological studies (e.g., Heimberg, Stein, Hiripi, & Kessler, 2000; Kessler, Stein, & Berglund, 1998). Using the synthetic cohort design (SCD), which was used for crossnational comparisons of schooling (Wiley & Wolfe, 1992) in the Third International Mathematics and Science Study 1995 (TIMSS 1995). In this design, the schooling effect is determined by comparing data of two adjacent grades: 7^{th} (Cohort 1) and 8^{th} grade (Cohort 2, the focal cohort). The two cohorts are measured at the same time point (Time 1). The SCD by nature is a quasilongitudinal design, where Cohort 1 at Time 1 data are treated as the “replacement” of Cohort 2 at Time 0 data to estimate schooling effect (δ^_{C2T1C1T1}). The schooling effect estimation bias in SCD is
In order to create a quasiexperimental SCD, it is necessary to generate Cohort 1 at Time 1 (7th grade in year i+1) data that are not comparable with Cohort 2 at Time 0 (8th grade in year i) due to measurement errors, so that matching can be used to reduce the simulated selection bias and to decrease the estimation bias of the schooling effect.
The simulation design model is based on data collected in the United States (SIMSUSA, Wolfe,1987), one of the seven countries that collected longitudinal data in SIMS. The final data set includes 126 regular classes and 2,296 students. The average class size is about 27. Tables 1 and 2 list the descriptive statistics of the outcome variables, covariates, and manifest variables. The model selection of variables is based on the previous studies (Schmidt & Burstein, 1992).
Table 1. Level1 Descriptive Statistics of the Final Twolevel Structural Equation Model
Variables 
Label 
Description 
Mean 
Educational 
YPWANT 
I want to learn more math (inverse code, 1–5^{a}) 
4.73 
YPWWELL 
Parents want me to do well (1–5^{a})  4.24 

YPENC 
Parents encourage me to do well in math (inverse code, 1–5^{a}) 
4.37 

Self 
YIWANT 
I want to do well in math (1–5^{a}) 
4.32 
YMORMTH 
Looking forward to taking more math (1–5^{a}) 
3.24 

YNOMORE 
Will take no more math if possible (inverse code, 1–5^{a}) 
3.73 

Family 
YPINT 
Parents are interested in helping math (inverse code, 1–5^{a}) 
3.72 
YFLIKES 
Father enjoys doing math (inverse code, 1–5^{a}) 
3.53 

YMLIKES 
Mother enjoys doing math (inverse code, 1–5^{a}) 
3.25 

YFABLE 
Father is able to do math homework (inverse code, 1–5^{a}) 
3.92 

YMABLE 
Mother is able to do math homework (inverse code, 1–5^{a}) 
3.71 

Math 
YMIMPT 
Mother thinks math is important (1–5^{a}) 
4.60 
YFIMPT 
Father thinks math is important (1–5^{a}) 
4.55 





Socioeconomic 
YFEDUC 
Father’s education level (1–4^{a}) 
3.38 
YMEDUC 
Mother’s education level (1–4^{a}) 
3.35 

YFOCCN 
Father’s occupation national code (1–8^{a}) 
4.26 

YMOCCN 
Mother’s occupation national code (1–8^{a}) 
4.11 

Age 
XAGE 
Grand meancentered age 
0.00 
Parental Help 
YFAMILY 
How frequently family help (1–3^{a}) 
1.75 
Education 
EDUECPT 
YMOREED: Years of education parents expected (1–4^{a}) 
2.97 
Homework 
YMHWKT 
Typical hours of math homework per week 
2.98 
Note: ^{a} 1 = not at all like, 2 = somehow unlike, 3 = unsure, 4 = somehow like, 5 = exactly like. ^{b} 1 = little schooling, 2 = primary school, 3 = secondary school, 4 = college or university or tertiary education. ^{c} 1 =unskilled worker, 2 = semi unskilled worker, 3 = skilled worker lower, 4 = skilled worker higher, 5 = clerk sales and related lower, 6 = clerk sales and related higher, 7 = professional and managerial lower, 8 = professional and managerial higher. ^{d} 1 = never/hardly, 2 = occasionally, 3 = regularly. ^{e} 1 = up to 2 years, 2 = 2 to 5years, 3 = 5 to 8 years, 4 = more than 8 years. N = 2,296.
Table 2. Level2 Descriptive Statistics of the Final Twolevel Structural Equation Model
Variables 
Label 
Description 
Mean 
Teacher/Classlevel covariates 

Class Size 
CLASSIZE 
Created from the number of students in class 
26.60 
Opportunity 
OLDARITH 
Prior OTL in Arithmetic 
7.10 
OLDGEOM 
Prior OTL in Geometry 
3.19 

NEWALG 
This year’s OTL in Algebra 
59.61 

NEWGEOM 
This year’s OTL in Geometry 
41.37 

Instruction 
TPPWEEK 
Number of hours of math instruction per week 
5.09 
Schoollevel covariates 

Qualified Math 
MTHONLY 
Proportion of qualified match teachers: Sum of SSPECM and SSPECF divided by STCHS 
0.14 
Note. N = 126.
Simulated Twolevel Structural Equation Model
The proposed twolevel SEM (Muthén, 1994) is shown in Figure 1. In the level1 equation, the posttest score is predicted by the pretest score, which is predicted by four student characteristics and five latent variables. The latent constructs and their manifest variables are listed in Table 1. In the level2 model, the intercept of pretest (β_{0}) is predicted by four class/schoollevel variables. The intercept of the posttest (α_{0}) is predicted by β_{0} and three classlevel variables. The level1 and level2 residuals are mutually independent of one another.
Figure 1. Twolevel structural equation model
Mplus (Muthén & Muthén, 19982015) is used to estimate factor loadings, regression coefficients, and residual variances (see Appendix). These modelbased estimates are treated as known parameter values to generate longitudinal data of Cohort 2 at Time 0 and Time 1. The population level SES variables are manipulated using Equation (6) in the simulation descriptions below. Our datadriven approach borrows the “sampling study” metric from MacCallum, Roznowski and Necowitz (1992), who treated the observed data as the “population”, from which random samples were drawn for simulation. Our approach differs from theirs in that we created a hypothetical population from which to draw data for simulation. The SIMSUSA data are collected to represent 3,681,939 8th graders nested in 136,368 classes across the seven strata in the United States (Wolfe, 1987). The simulated pseudopopulation includes 12,600 classes and 345,000 students with an average classsize of 27.13.
Table 3. Four Simulations of Matching on Latent and Manifest Variables
Simulation 
Cohort 2 Time 0 
Cohort 1 Time 1 

µ_{XSES} 
Latent η_{SES} 
Reliability 
ICC 
µ_{XSES} 
Latent η_{SES} 
Reliability 
ICC 

Pre 
Post 
Pre  Post 

One 
µ 
0 
Low 
.32 
.34 
µ + c_{1} 
0 
Low 
.31 
.33 
Two 
µ 
0 
Low 
.32 
.34 
µ 
0 
High 
.31 
.33 
Three 
µ 
0 
Low 
.32 
.34 
µ + c_{2} 
.68 
High 
.31 
.33 
Four 
µ 
0 
High 
.32 
.34 
µ + c_{2} 
.68 
High 
.31 
.33 
Note: ICC: Intraclass Correlation, which varied from 0.023 to .322 in the most commonly used largescale surveys on hieratically structured data (Hedges & Hedberg, 2007).
Due to its practical importance in education studies, only latent variable SES and its four manifest variables are manipulated to simulate selection bias. In this study, selection bias is indicated by the noncomparability between Cohort 2 at Time 0 (the treatment group) and Cohort 1 at Time 1 (the control).
Generated Treatment Group Cohort 2 at Time 0 Data
The manifest variables of the latent variable SES are generated through a multivariate normal distribution. The latent variable SES η_{SES} is associated with 4 manifest variables X_{SES}^{C2T0} through the measurement model
with e_{XSES} and η_{SES} ~ N(0,Φ_{SES}).X_{SES}^{C2T0} includes Father’s/ Mother’s education level (YFEDUC/YMEDUC), and Father’s/ Mother’s occupation national code(YFOCCN/YMOCCN). They are generated through a multivariate normal distribution, X_{SES}^{C2T0} ~ MN(μ_{XSES}^{C2T0},Σ_{XSES}^{C2T0}).
The four means are denoted as μ_{XSES}^{C2T0} = [3.375, 3.349, 4.277, 4.128]. The variance matrix Σ_{XSES}^{C2T0} is computed by λ_{XSES} Φ_{SES} λ'_{XSES} + Θ_{XSES}. The parameter values of λ_{XSES}, Φ_{SES} and Θ_{XSES} are available in the Appendix. The computed variances of Σ_{XSES}^{C2T0}) are 0.475, 0.405, 4.421, and 3.916. The reliability coefficient (Lord & Novick, 1968; Raykov, 1997) was computed as 0.25 in Cohort 2 at Time 0.
Generated Control Group Cohort 1 at Time 1 Data
Data generation of Cohort 1 at Time 1 involves manipulating random measurement errors and reliability values to simulate selection bias. The four manipulations are summarized in Table 3. Each manipulation represents one source of simulated selection bias, which causes noncomparability between Cohort 2 at Time 0 data and Cohort 1 at Time 1 data.
Simulation 1: C1T1’s manifest variable means differ from C2T0’s, with the same latent means and low reliability.
In this simulated Cohort 1 at Time 1, the four manifest variables of latent variable SES are generated through a multivariate normal distribution MN(μ_{XSES}^{C1T1}, Σ_{XSES}^{C1T1}), with μ_{XSES}^{C1T1} = c_{1} + μ_{XSES}^{C2T0}. Adding c_{1} indicates that each manifest variable mean of Cohort 1 at Time 1 is 0.5 standard deviations larger than that of Cohort 2 at Time 0. The four means are denoted as μ_{XSES}^{C2T0} = [3.720, 3.667, 5.328, 5.117]. The computed reliability coefficient for Cohort 1 at Time 1 is equal to 0.25, which is as low as that in Cohort 2 at Time 0. This low reliability has not been examined in the simulated propensity score analysis involved measurement error (reliability of .92 in McCaffrey et al., 2011; reliability from .5  .9 in Steiner et al., 2011 and Rodriguez de Gil et al., 2015; reliability of .5, .7, .9 and .999 in WebbVargas et al., 2015).
Simulation 2: C1T1’s manifest variables have higher reliability than C2T0’s, with the same manifest means and the same latent means.
In the second simulation, the manifest variables of Cohort 1 at Time 1 are generated through a multivariate normal distribution X_{SES}^{C1T1} ~ MN(μ_{XSES}^{C2T0}, Σ_{XSES}^{C1T1}). The residual variances of the four manifest variables are reduced by 90%. In turn, the computed reliability coefficient is increased to 0.78. A larger reliability coefficient indicates a stronger relationship between the four manifest variables and the latent variable η_{SES} in Cohort 1 at Time 1. This larger reliability is similar to a value that has been studied in Steiner et al., (2011), Rodriguez de Gil, et al. (2015) and WebbVargas et. al (2015).
Simulation 3: C1T1’s manifest variables have higher reliability, with a different latent variable mean from C2T0’s.
In the third simulation, the manifest variables of Cohort 1 at Time 1 have a higher reliability of 0.78; the manifest variables of Cohort 2 at Time 0 have a reliability of 0.25. In addition, the latent variable η_{SES} means in Cohort 1 at Time 1 is 0.68, which is half of the standard deviation of the latent variable η_{SES} in Cohort 2 at Time 0. Because of the latent mean difference, the manifest variable means of two cohorts differ by a constant vector c_{2}. That is, c_{2} = 0.68* λ_{XSES} based on Equation (6) above. In Simulation 1 or 2, the latent variable η_{SES} mean of Cohort 1 at Time 1 is equal to 0 (see Table 4’s Simulation One and Two).
Simulation 4: C1T1’s latent variable mean differs from C2T0’s, with the same high reliability.
In the fourth simulation, both Cohort 1 at Time 1 and Cohort 2 at Time 0 have a reliability of 0.78. This is achieved by using the manipulation discussed in Simulation 2. The mean of η_{SES} in Cohort 1 at Time 1 is manipulated in the same way as that discussed in Simulation 3.
Four Types of Matching
The R (R Development Core Team, 2007) module–MatchIt (Ho, Imai, King & Stuart, 2011)– carries out the four types of matching for each manipulation. The first matching, Propensity Score Matching Based on Manifest Variables (PSMMV), uses “naïve” propensity scores estimated through manifest variables. The second, Propensity Score Matching Based on Latent Variable (PSMLV), uses “true” propensity scores estimated from the latent variable. The third, Matching on Factor Score (MFS), treats estimated factor scores as propensityscorelike measures to reduce bias. The factor score is estimated through Mplus (Muthén & Muthén, 19982015). The last matching, Mahalanobis Distance Matching Based on Factor Score (MDMFS), uses the Mahalanobis distance of the estimated factor scores. If a unit can be reused in matching, it is called matching with replacement (Austin, 2014). Using simulated multilevel data, Wang et al. (2017) only conducted matching without replacement and suggested that future study should examine matching with replacement. In our study, each of the four types of matching is conducted with and without replacement. Implementing the same settings in Wang (2015) and Wang et al. (2017), we set up the caliber (Stuart & Rubin, 2008) at 0.2 and 0.01. The simulation design is 4 (simulations) × 4 (types of matching) ×2 (with/out replacement) × 2 (calipers), which determined the structure of Table 4. Each condition is simulated with 1,000 replications. Each replication randomly draws 100 treatment classes and 100 controls, with an average class size of 27. The sample size of each replication on average is 5,400.
Simulation Evaluation
The estimation bias reduction rate (Cochran & Rubin, 1973; Stuart & Rubin, 2008) is computed as:
For each of the 1,000 replications, there are four matching methods. In a replication, if the initial bias of the two cohorts is less than 0.5 standard deviations of the 1,000 initial biases, then the two cohorts are comparable and that replication’s matching results will not be used to compute the bias reduction rate. Table 4 summarizes the results of the four manipulation studies.
Table 4: Matching Results of Simulation Design: Four (Simulations)X Four (Types of Matching) X Two (with/out Replacement) X Two (Calipers)

Four Types of Simulations 

Bias Reduction Rate of 

C2T0 
C1T1 

Observed X¯ 
Latent Mean 
Reliability 
Observed X¯ 
Latent Mean 
Reliability 
Replacement 
Caliper 
PSMMV 
PSMLV 
MFS 
MDMFS 

One 
µ 
0 
Low 
µ+c_{1} 
0 
Low 
No 
.2 
60.77 
2.83 
2.15 
0.60 
No 
.01 
56.59 
3.38 
2.31 
4.34 

Yes 
.2 
58.66 
4.59 
4.59 
9.30 

Yes 
.01 
54.68 
2.64 
2.64 
19.5 

Two 
µ 
0 
Low 
µ 
0 
High 
No 
.2 
4.68 
0.15 
0.59 
2.04 
No 
.01 
2.62 
8.35 
9.26 
5.40 

Yes 
.2 
5.74 
2.14 
3.26 
4.85 

Yes 
.01 
4.85 
6.50 
6.50 
16.05 

Three 
µ 
0 
Low 
µ+c_{2} 
.68 
High 
No 
.2 
50.37 
49.46 
46.70 
4.70 
No 
.01 
52.65 
50.22 
49.06 
13.98 

Yes 
.2 
54.47 
47.43 
47.43 
48.87 

Yes 
.01 
58.53 
56.38 
56.38 
52.68 

Four 
µ 
0 
High 
µ+c_{2} 
.68 
High 
No 
.2 
55.93 
55.10 
56.83 
3.43 
No 
.01 
54.93 
54.51 
54.35 
13.74 

Yes 
.2 
57.11 
56.34 
56.34 
52.22 

Yes 
.01 
61.95 
61.56 
61.56 
53.03 
Note: c_{1} represents that each manifest variable mean of Cohort 1 at Time 1 is 0.5 standard deviations larger than that of Cohort 2 at Time 0; c_{2} represents that the latent variable mean in Cohort 1 at Time 1 is increased by a half of the standard deviation of the latent variable η_{SES} in Cohort 2 at Time 0. PSMMV: propensity score matching based on manifest variables; PSMLV: propensity score Matching based on latent variable; MFS: matching on factor score; MDMFS: Mahalanobis distance matching based on factor score.
Simulation 1
In simulation 1, C1T1’s manifest variable means differ from C2T0’s, with the same latent means (0) and low reliability (.25).
Propensity Score Matching Based on Manifest Variables (PSMMV)
Matching on propensity scores estimated from the four manifest variables through a larger caliper (0.2) without replacement reduces the schooling effect estimation bias in SCD (shortened as “estimation bias”) by 60.77%, and with replacement by 58.66%. Matching on a smaller caliper (0.01) without replacement reduces estimation bias by 56.59%, and with replacement by 54.68%.
Propensity Score Matching Based on Latent Variable (PSMLV)
Matching on propensity scores estimated from factor scores through a larger caliper without replacement reduces estimation bias by 2.83%; however, matching with replacement through a larger caliper increases estimation bias by 4.59%. Matching through a smaller caliper without replacement increases estimation bias by 3.38%, and with replacement by 2.64%.
Matching on Factor Score (MFS)
When the estimated factor score is used as a propensityscorelike measure to match through a larger caliper without replacement, it reduces estimation bias by 2.15%, but with replacement increases estimation bias by 4.59%. Matching on a smaller caliper without replacement increases estimation bias by 2.31%, and with replacement by 2.64%.
Mahalanobis Distance Matching Based on Factor Score (MDMFS)
If the estimated Mahalanobis distance of the estimated factor score is used for matching on a larger caliper without replacement, it reduces estimation bias by 0.60%, but with replacement increases estimation bias by 9.30%. Matching on a smaller caliper without replacement increases estimation bias by 4.34%, and with replacement by 19.5%.
In summary, when the latent variable could not represent the manifest variables well (i.e., low reliability), matching based on manifest variable was optimal. Using larger caliper reduced more bias than using smaller caliper. Given the same caliper, matching without replacement reduced more bias than matching with replacement.
Simulation 2
In simulation 2, C1T1’s manifest variables have higher reliability (0.78) than C0T2’s (0.25), with the same manifest means and the same latent means.
Propensity Score Matching Based on Manifest Variables
Matching on propensity scores estimated from the four manifest variables through a larger caliper without replacement increases estimation bias by 4.68%, but with replacement reduces estimation bias by 5.74%. Matching through a smaller caliper without replacement reduces estimation bias by 2.62%, but with replacement increases estimation bias by 4.85%.
Propensity Score Matching Based on Latent Variable
Matching on propensity scores estimated from factor scores through a larger caliper without replacement reduces estimation bias by 0.15%, but with replacement increases estimation bias by 2.14%. Matching through a smaller caliper without replacement reduces estimation bias by 8.35%, but with replacement increases estimation bias by 6.50%.
Matching on Factor Score
Matching on factor scores through a larger caliper without replacement reduces estimation bias by 0.59%, and with replacement by 3.26%. Matching through a smaller caliper without replacement reduces estimation bias by 9.26%, but with replacement increases estimation bias by 6.50%.
Mahalanobis Distance Matching Based on Factor Score
Mahalanobis distance matching based on estimated factor scores through a larger caliper without replacement reduces estimation bias by 2.04%, but with replacement increases estimation bias by 4.85%. Matching through a smaller caliper without replacement reduces estimation bias by 5.40%, but with replacement increases estimation bias by 16.05%.
In summary, when there was no difference between the two groups, matching based on manifest or latent variable was not necessary because little bias was reduced.
Simulation 3
Simulation 3 C1T1’s manifest variables have higher reliability (0.78), and a different latent variable mean (0.68) from C2T0’s (0).
Propensity Score Matching Based on Manifest Variables
Matching on propensity scores estimated from the four manifest variables through a larger caliper without replacement reduces estimation bias by 50.37%, and with replacement by 54.47%. Matching through a smaller caliper without replacement reduces estimation bias by 52.65%, and with replacement by 58.53%.
Propensity Score Matching Based on Latent Variable
Matching on propensity scores estimated from factor scores through a larger caliper without replacement reduces estimation bias by 49.46%, and with replacement 47.43%. Matching through a smaller caliper without replacement reduces estimation bias by 50.22%, and with replacement by 56.38%.
Matching on Factor Score
Matching on factor scores through a larger caliper without replacement reduces estimation bias by 46.70%, and with replacement by 47.43%. Matching through a smaller caliper without replacement reduces estimation bias by 49.06%, and with replacement by 56.38%.
Mahalanobis Distance Matching Based on Factor Score
Mahalanobis distance matching based on estimated factor scores through a larger caliper without replacement reduces estimation bias by 4.70%, and with replacement by 48.87%. Matching through a smaller caliper without replacement reduces estimation bias by 13.98%, and with replacement 52.68%.
Simulation 4
In simulation 4 C1T1’s latent variable mean (0.68) differs from C2T0’s (0), with the same high reliability (0.78).
Propensity Score Matching Based on Manifest Variables
Matching on propensity scores estimated from the four manifest variables through a larger caliper without replacement reduces estimation bias by 55.93%, and with replacement by 57.11%. Matching through a smaller caliper without replacement reduces estimation bias by 54.93%, and with replacement by 61.95%.
Propensity Score Matching Based on Latent Variable
Matching on propensity scores estimated from factor scores through a larger caliper without replacement reduces estimation bias by 55.10%, and with replacement by 56.34%. Matching through a smaller caliper without replacement reduces estimation bias by 54.51%, and with replacement by 61.56%.
Matching on Factor Score
Matching on factor scores through a larger caliper without replacement reduces estimation bias by 56.83%, and with replacement by 56.34%. Matching through a smaller caliper without replacement reduces estimation bias by 54.35%, and with replacement by 61.56%.
Mahalanobis Distance Matching Based on Factor Score
Mahalanobis distance matching based on estimated factor scores through a larger caliper without replacement reduces estimation bias by 3.43%, and with replacement by 52.22%. Matching through a smaller caliper without replacement reduces estimation bias by 13.74%, and with replacement by 53.03%.
In summary, when the latent variable represented the manifest variables well in both treatment and control groups, matching based on manifest and latent variable were equally optimal. Compared with Simulation 3’s results, using a more reliable control group (C2T0) in Simulation 4’s matching achieved the most optimal results. Matching with replacement and smaller caliper was the best choice. Matching with replacement outperformed matching without replacement. In matching without replacement, using larger caliper reduced more bias than using smaller caliper. For matching with replacement, using smaller caliper reduced more bias. Given the same caliper, matching with replacement reduced more bias than matching without replacement. Mahalanobis matching was the least efficient among the four matching methods. And, it was optimal only when matching with replacement was used; and using smaller caliper reduced more bias than larger caliper.
Matching on Factor Score Works Sufficiently and Better with Higher Reliability
Measurement error has a negative effect on bias reduction through a latent variable matching framework. If a latent variable, rather than measurement error, mainly accounts for the variation among manifest variables, then matching through the latent variable itself will be equivalent to matching through the propensity score that was computed from the latent variable. This result supports the previous practice of using latent ability or academic proficiency estimates to match treatment and control cohort participants (e.g., Van der Linden & Hambleton, 1997). Latent variable matching will achieve a better bias reduction result if the treatment and control cohorts both have high reliability measures than if either the treatment or control group has a high reliability measure. In cases that involve multimanifest variables, latent variable matching will be preferable because it is more efficient than matching on respective manifest variables. It is worth noting that latent variable matching approaches are effective if the two cohorts' factor score means are different. If, however, the two cohorts are comparable in terms of the latent variable means, then matching through the latent variable is not necessary.
Matching Based on Manifest Variables with Measurement Error Is Sufficient
In practice, studies often use different quality and types of data in terms of measurement reliability, which requires different matching options and leads to inconsistent results. Measurement error has a casebycase effect on propensity matching and does not necessarily attenuate the causal effect (Battistin & Chesher, 2014). Naïve propensity score can work as well as other errorcorrected propensity score methods (McCaffrey et al., 2011). WebbVargas et al. (2015) also found that using naïve covariate and imputation based method worked equally well in propensity score analysis on real data. Findings in Jakubowski (2015) were inconsistent. Matching based on manifest variable can work slightly worse (Table 3 Model B and C, p. 1302) than matching based on latent variable. However, matching based on manifest variable can work slightly better when reliability is the lowest due to measurement error variance and lack of common support (Table 3 Model D, p. 1302). Using manifest variable with measurement error for propensity score analysis can show advantage of bias reduction on treatment effect estimation, specifically when the latent composite variable is underrepresentative in observational studies. This study is focused on the bias reduction on the covariate rather than the treatment effect estimation. We found that manifest variable based propensity score matching works comparably to latent variable based propensity score matching.
Measurement Complexity and Mixed Matching Effects Request More Latent Variable Based Research on the Topic
This study demonstrates a complicated picture of matching through manifest variables and/or a latent variable because selection bias may be due to the intercept, latent variable and measurement error as shown in Equation (6). Simulation One indicates that matching on manifest variables outperforms matching on the latent variable. That is, if the two cohorts are different only on poorly measured manifest variables with considerable error, then latent variable matching works inefficiently to balance the manifest variables; however propensity score matching through these manifest variables is still optimal. In this situation, selection bias on covariate is NOT due to the latent variable, but measurement error. Then latent variable based propensity score matching is not sufficient to reduce selection bias on covariate. However, manifest variable based propensity score matching will work, because manifest variable contains an extra latent variable, i.e., measurement error. In practice, if the treatment and control groups are hypothesized to have the same distribution of latent ability, matching on propensity score estimated though manifest variables (e.g., observed performance measures) should be practical and optimal.
Matching Options (Replacement and Caliper) Interact with Measurement Reliability
Matching options include whether matching allows replacement and whether a smaller caliper is used. Austin (2011) found that optimal bias reduction rates are obtained when the caliper is in a range from 0 to 0.4 (p. 153). Lunt (2014) recommended to use a tighter caliper because bias reduction performance is worsened when the caliper becomes larger. Austin (2014) simulated nonhierarchical data found that matching with replacement performed as well as caliper matching without replacement. Wang et al. (2017) found that caliper matching interacts with data structure. They recommended that “different sizes of caliper should be used for level1 [i.e., studentlevel] and level2 [i.e., classlevel] matching (p. 67) ”. Our studentlevel matching results showed an interaction between matching options and data quality (i.e., measurement reliability). When measures for the two cohorts have equally low reliability, either using a larger caliper or matching without replacement will reduce more bias; matching without replacement on a larger caliper is optimal. However, the trend is reversed when the two cohorts have equally high reliability. That is, either using a smaller caliper or matching with replacement improves matching performance; and matching with replacement on a smaller caliper performs the best.
Mahalanobis Distance Matching Is Comparatively Nonsufficient
Previous studies have found that posthoc matching depends on the summary measure that is a functional composite of covariates (Rubin, 1985). The most commonly used composites are the Mahalanobis distance (e.g., Rubin, 1980) and the propensity score (Rosenbaum & Rubin, 1983). Mahalanobis distance matching is less effective than propensity matching on multilevel data (Wang, 2015). Similarly, this study found that Mahalanobis distance matching reduces bias less effectively than either propensity score matching or factor score matching. Mahalobis matching with replacement was comparatively optimal only when the treatment group data are highly reliably. Propensity score matching generally performs better than Mahalanobis distance matching when the true propensity score model is known and the sample size is large (Sekhon & Diamond, 2008). The simulation settings of this study favor propensity score matching. Each simulated condition determines a “true” and known propensity score model. Each of the 1000 replications uses 100 classes, and the sample size is approximately as large as 5,400.
Needs to Examine Both Level1 and Level2 Data With Measurement Errors
In educational studies, researchers often sample larger units from a hierarchically structured population (Cochran, 1963; Scott & Smith, 1969). Due to the hierarchical structure of the experimental design and data collection (Raudenbush & Sadoff, 2008), treatment units are generally classes or schools rather than individual students (Hedges, 2007). When clusters are assigned to interventions, noncomparable treatmentcontrol groups can arise from either level1 or level2 covariates (Raab & Butcher, 2001), resulting in selection bias. This study focuses on manifest variables and the latent variable only at level1, although the simulated model is based on a twolevel latent variable framework. Future research should explore how level2 measurement errors will affect the accuracy of matching in terms of the bias reduction rate when level2 matching and dual matching (Wang, 2015) are needed.
This study is based on work supported by the National Science Foundation (NSF) under Grant No. DUE0831581. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NSF.
TwoLevel Structural Equation Model Estimates (a.k.a True Pseudopopulation Parameter Values)
Level One Parameters 

Variable 
Label 
Loading 
Regression Coefficient 
Residual 

PRETEST 
POSTTEST 



Coef. 
SE 
p 
Coef. 
SE 
p 
Coef. 
SE 
p 
Est. 
SE 
p 
PreTest Score 
PRETEST 
 
 
 
 
 
 
.72 
.03 
.00 
31.87 
1.94 
0.00 
PostTest Score 
POSTTEST 
 
 
 
 
 
 
 
 
 
25.64 
1.27 
0.00 
Educational 
YPWANT 
1.00 
 
 
.87 
1.56 
.58 
 
 
 
.21 
.01 
0.00 
YPWWELL 
1.05 
.08 
.00 






.37 
.03 
0.00 

YPENC 
1.82 
.11 
.00 






.66 
.05 
0.00 

Self 
YIWANT 
1.00 
 
 
1.97 
.56 
.00 
 
 
 
.58 
.04 
0.00 
YMORMTH 
1.98 
.18 
.00 






.67 
.05 
0.00 

YNOMORE 
1.67 
.13 
.00 






.77 
.05 
0.00 

Family 
YPINT 
1.00 
 
 
.04 
.25 
.88 
 
 
 
.62 
.04 
0.00 
YFLIKES 
.77 
.05 
.00 






.73 
.03 
0.00 

YMLIKES 
.46 
.04 
.00 






1.05 
.04 
0.00 

YFABLE 
1.00 
.06 
.00 






.85 
.05 
0.00 

YMABLE 
.60 
.05 
.00 






1.27 
.05 
0.00 

Math Importance 
YMIMPT 
1.00 
 
 
.89 
.76 
.25 
 
 
 
.17 
.02 
0.00 
YFIMPT 
1.06 
.05 
.00 






.24 
.03 
0.00 

Socioeconomic 
YFEDUC 
1.00 
 
 
1.55 
.30 
.00 
 
 
 
.17 
.01 
0.00 
YMEDUC 
.72 
.04 
.00 






.24 
.01 
0.00 

YFOCCN 
1.94 
.13 
.00 






3.26 
.13 
0.00 

YMOCCN 
1.54 
.14 
.00 






3.18 
.13 
0.00 

Age 
XAGE 
 
 
 
.06 
.02 
.00 
 
 
 
 
 
 
Parental Help 
YFAMILY 
 
 
 
1.44 
.16 
.00 
 
 
 
 
 
 
Ed. Expectation 
EDUECPT 
 
 
 
1.28 
.17 
.00 
 
 
 
 
 
 
Homework 
YMHWKT 
 
 
 
.03 
.01 
.01 
 
 
 
 
 
 
Level Two Parameters 

Class Size 
CLASSIZE 
 
 
 
.20 
.06 
.00 
 
 
 
 
 
 
Opportunity 
OLDARITH 
 
 
 
.65 
.36 
.07 
 
 
 
 
 
 
OLDGEOM 
 
 
 
.79 
.94 
.41 
 
 
 
 
 
 

NEWALG 
 
 
 
 
 
 
.27 
.13 
.03 
 
 
 

NEWGEOM 
 
 
 
 
 
 
.37 
.14 
.01 
 
 
 

Instruction 
TPPWEEK 
 
 
 
 
 
 
.08 
.02 
.00 
 
 
 
Qualified Math 
MTHONLY 
 
 
 
4.51 
2.11 
.03 
 
 
 
 
 
 
Footnotes
^{1} Selection bias, also called “sample selection bias” (Heckman, 1979), refers to the bias that is due to the use of nonrandom samples in estimating relationships among variables of interests. It can occur in two situations: 1) selfselection by objects being studied, and 2) sample selection by researchers or data analysts. Using selectionbiased samples results in a biased estimate of the effect of an intervention that should have been randomly assigned. The intervention can refer to “treatment of migration, manpower training, or unionism” (Heckman, 1979, p. 154).
^{2} The estimated factor scores can be derived using SEM software packages such as Mplus (Muthén & Muthén, 19982015)
© 2019. Rivera Publications, Inc. All rights reserved.