Bias Reduction Rates for Latent Variable Matching versus Matching through Manifest Variables with Measurement Errors

There is an increasing need for studies on how educational interventions affect student performance (Raudenbush & Sadoff, 2008; Spybrook, 2007). A study’s ability to assess the efficacy and efficiency of educational interventions depends on its hypothesis development, experimental design, controlled experimental trials, identification of the population of interest, and implementation (Sloane, 2008). The most challenging task is to obtain valid measurements of interventions in order to assess the effect of intervention (Raudenbush & Sadoff, 2008; Sloane, 2008). Measures of the classroom interventions that students receive can be subject to measurement errors (Raudenbush & Sadoff, 2008) in data collection through large-scale surveys and observational studies (Cochran,1963, 1965, 1969, 1972; Rosenbaum, 2002).


Introduction
There is an increasing need for studies on how educational interventions affect student performance (Raudenbush & Sadoff, 2008;Spybrook, 2007).A study's ability to assess the efficacy and efficiency of educational interventions depends on its hypothesis development, experimental design, controlled experimental trials, identification of the population of interest, and implementation (Sloane, 2008).The most challenging task is to obtain valid measurements of interventions in order to assess the effect of intervention (Raudenbush & Sadoff, 2008;Sloane, 2008).Measures of the classroom interventions that students receive can be subject to measurement errors (Raudenbush & Sadoff, 2008) in data collection through large-scale surveys and observational studies (Cochran,1963(Cochran, , 1965(Cochran, , 1969(Cochran, , 1972;;Rosenbaum, 2002).
Structural equation modeling (SEM, Bollen, 1989) incorporates the latent variable to account for measurement errors on manifest variables.Kaplan (1999) applied propensity score stratification into the Multiple Indicators Multiple Causes (MIMIC) model to deal with measurement error on the dependent variable for group difference estimation.However, propensity score matching was not conducted.Furthermore, the propensity score was estimated through observed covariates that may have measurement error.This Monte Carlo study uses an SEM framework (Jöreskog & Sörbom, 1996) to examine the effectiveness of matching through the latent variable and through manifest variables with measurement

Abstract
Based upon a two-level structural equation model, this simulation study compares latent variable matching and matching through manifest variables.Selection bias is simulated on latent variable and/or manifest variables along with different magnitudes of reliability.
Besides factor score matching and Mahalanobis distance matching, we examined two types of propensity score matching on: "naïve" propensity score derived from manifest covariates, and "true" propensity score derived from latent factor.Results suggest that 1) Mahalanobis distance matching works less effectively than propensity/factor score matching; 2) propensity score and factor score matching performed the best if both treatment and control groups have high reliability; 3) matching through manifest variables is optimal and preferable if latent composite variable is under-representative; 4) when latent variable represents the manifest variables well, latent variable matching is preferable and more efficient than matching on respective manifest variables; and 5) matching options such as caliper matching and replacement matching interact with the magnitude of reliability and matching with replacement on a smaller caliper performs the best for more reliable measures. errors.
Measurement issues can have serious impact on the findings of a study because they reduce the efficiency of adjustment (Cochran, 1968a;Cochran, 1965).While the literature is replete with guidelines on how to use propensity score analysis (Pan & Bai, 2015) to estimate treatment effect, there is little research on how to adjust the measurement errors to examine bias reduction on covariates after matching (Jakubowski, 2015).Most researchers simply analyze and estimate propensity scores by taking the covariates as the perfect measures (Cochran, 1957;Cochran & Rubin, 1973).Recent propensity score analyses mainly examine how covariates with measurement errors affect treatment effect estimation though biascorrection or imputation (Battistin & Chesher, 2014;Webb-Vargas, Rudolph, Lenis, Murakami & Stuart, 2015), or inverse probability weight (McCaffrey, Lockwook, & Setodji 2011;Steiner, Cook & Shadish 2011).Propensity score matching was rarely conducted in these studies for bias reduction analysis.

Measurement Errors and Bias
Bias occurs when the estimate (testing score or observed treatment effect) differs from the value being estimated (true score or true effect) through sampling (Särndal, Swensson & Wretman, 2003).Bias due to measurement errors (Fuller, 1987) can occur in outcome Y and/or covariates X (Carroll et al., 2006;Cochran & Rubin, 1973).Thus, the outcome in an intervention effect model is a sum of three parts (Wooldridge, 2002): 1) the effect of intervention variety, 2) the effect of initial bias due to covariates X, and 3) the random measurement error.If participants are not randomly assigned to intervention groups 1 (Cochran, 1969(Cochran, , 1972)), then a study often has problems of selection bias (Heckman, 1979), indicating the initial unbalanced treatment and control groups in term of covariates X. Selection bias attenuates the treatment effect estimate and misleads one's conclusions (Campbell & Stanley, 1966).
Post-hoc matching depends on the summary measure, a functional composite of covariates (Rubin, 1985).The most commonly used composites in matching include the Mahalanobis distance (e.g., Rubin, 1980) and the propensity score (Rosenbaum and Rubin, 1983).Propensity score matching is a post-hoc bias reduction method, which has been commonly used on observational data to approximate the individual-randomized trials to study a treatment effect of interest (Cochran, 1953(Cochran, , 1968a;;Cochran & Rubin, 1973;Rosenbaum & Rubin, 1983;Rubin, 1973a,b).Propensity score matching (Rosenbaum & Rubin, 1983) is the most commonly used bias reduction technique of post-hoc sampling (Cochran, 1953;Rubin, 1973aRubin, ,b, 1976aRubin, ,b, 1979Rubin, , 1980) ) in causal inference and program evaluation.A propensity score (P) is a conditional probability that an individual belongs to the treatment group (Rosenbaum & Rubin, 1983).It is generally estimated by using the logit model of ln[P/(1-P)] β`X, indicating the natural logarithm of the odds (i.e., the ratio of P to 1 -P) = is functionally related to the background covariates (X, in a vector format).The propensity score estimated by a function of , summarizes the distribution information of all ................... potential covariates (Rosenbaum & Rubin, 1983;Rubin, 1985).Using the propensity score, a researcher can match participants from the treatment group with participants from the control group, so that the treatment group and control group can be balanced.This approach can significantly reduce bias in observational study (Rosenbaum, 2005;Rosenbaum & Rubin, 1985;Rubin & Thomas, 1992;Rubin & Waterman, 2006).It also improves the accuracy of the average treatment effect estimate (Abadie & Imbens, 2006), and facilitates causal inference (Greenland, 2004).

Attenuated Bias Reduction Rate Due to Measurement Errors
Measurement errors attenuate the regression coefficient β of covariates X on outcome Y (Fuller, 1987;Jöreskog & Sörbom, 1996).Let β ̃ be the attenuated regression coefficient.It has |β |<|β| and β = β × R in the bivariate regression (Cochran & Rubin, 1973).R is the attenuation rate due to measurement errors in the covariate x.Bias reduction rate on covariate x is attenuated by R=|β | ⁄ |β| due to measurement errors in covariate x (Cochran & Rubin, 1973).The estimation bias reduction rate (Cochran & Rubin, 1973) is computed as 100 (1-treatment effect estimation bias after matching / treatment effect estimation bias before matching)%.
Cochran (Rubin, 2006, p. 20) found that under a simple linear regression, the measurement error on x attenuates the bias reduction rate by a ratio of 1 ⁄ (1+h).h "is the ratio of the variance of the errors of measurement to the variance of the correct measurements" (Cochran, 1968b, p. 295).In other words, 1 ⁄ (1+h) can be rewritten as 1 ⁄ r, with r representing the reliability.

Measurement-error-adjusted Propensity Scores
When the true covariates (X*) are measured by X with errors, matching needs to be based upon the propensity scores Pr(D=1|X*) rather than Pr(D=1|X).There are two ways (Carroll et al., 2006) to adjust for measurement errors in the logit model used to estimate propensity scores.
The first method assumes that the true covariates have not been observed and the naïve parameter estimates are obtained using the observed covariates.An approximately consistent estimator of the parameters is provided through a functional adjustment on the naïve estimator (for details, see Rosner, Spiegelman & Willett,1990;Rosner, Willett & Spiegelman, 1989).The second method of adjusting for measurement errors in logistic regression is through structural modeling, in which the distribution of the true covariates is parametrically modeled (Sörbom, 1978;Jakubowski, 2015).For example, the maximum likelihood or Bayesian-approach-based SEM (Carroll et al., 2006;Lee, 2007, Chapter 9) can be used to deal with measurement errors.This two-step adjusted method requires that X and X* have equal dimensions; however, the measurement-error-adjusted propensity scores cannot be obtained directly using this approach because the integral in the subsequent propensity score function does not have a closed-form solution (Carrel et al., 2006, p. 91).The approximate approach has been developed in Weller, Milton, Eison and Spiegelman (2007) using a multivariate normal conditional distribution of (X*│X,H), H represents the covariates without measurement errors.Such or similar adjustment has been used in propensity analyses (e.g., Battistin and Chesher 2014;MaCaffrey et al., 2011).For example, best linear unbiased predictor (BLUP) corrects measurement error to estimate propensity score using another error-free covariate (MaCaffrey et al., 2011).However, when the second sample having both X and X* observed is not available, one cannot estimate the measurement-error-adjusted propensity scores.In this situation, an alternative method such as SEM can be used to estimate the measurementerror-adjusted propensity scores for matching.

Theoretical Framework
Structural equation modeling (Bollen, 1989;Jöreskog & Sörbom, 1996) uses the latent variable to account for measurement errors on manifest variables.Using a two-level SEM (Muthén, 1994), this study manipulates the reliability of the manifest variables to compare latent variable matching with matching through manifest variables.

Structural Equation Modeling as an Alternative
An SEM-based propensity score framework incorporates the latent variable to adjust measurement errors on manifest variables.Propensity scores can be estimated through the following hybrid SEM model: The first equation, a measurement model, captures the linear relationship between the latent X* and observed X in both the treatment (D = 1) and control (D = 0) group.The second equation, a structural model, which is equivalent to the latent variable propensity score model in Equation (2) of Jakubowski (2015, p. 1291).It captures the nonlinear relationship between the latent X* and a latent propensity score Pr(D=1|X*).
Adapting a latent variable approach circumvents the post-hoc coefficient adjustment (e.g., Weller et al., 2007) discussed above.The SEM-based propensity scores can be used in matching (Jakubowski, 2015).Note that the latent propensity score Pr(D=1|X*) and latent X* have a one-to-one functional relationship in the unidimensional case.Matching on estimated 2 propensity scores is mathematically equivalent to matching on the estimated factor scores of the latent X*.
Factor scores of the latent X* such as academic proficiency measures and ability constructs have been used to match individuals in order to achieve comparable groups (e.g., classical true score in Van der Linden & Hambleton, 1997).The latent construct is measured by multiple manifest items.In the most commonly used item response theory, individual ability is calibrated through a set of test-items with presumptive difficulty and discrimination (Lord & Novick, 1968).The calibrated ability estimation represents an examinee's academic propensity that the set of test-items are designed to measure.However, only matching on latent variables may fail to remove bias due to other omitted covariates H that are free of measurement error.A composite measure such as propensity score that summarizes both the latent X* and covariates H becomes necessary in matching.Wang, Maier and Houang (2017) simulated multi-level data and examined how omitted covariates attenuate the bias reduction rate in the SEM-based propensity score matching.However, the issue of measurement error was not taken into accounted in Wang et al.(see pp. 72-73, 2017).Using the same multi-level SEM-based simulation (Wang, 2015;Wang et al., 2017) and measurement-error-adjusted propensity score model of Equation ( 1), this study compares latent variable matching and matching through manifest variables to examine the effect of measurement error.

Longitudinal Design vs. Synthetic Cohort Design
Simulated quasi-experimental synthetic cohort design (SCD, Wiley & Wolfe,1992) data with measurement errors are generated based on the Second International Mathematic Study (SIMS, International Association for the Evaluation of Educational Achievement, 1977).SIMS used a longitudinal design to study the effects of curriculum and classroom instruction targeted at the 8th grade (Cohort 2).Two waves of mathematic achievement data were collected, at the beginning (Time 0) and end of the school year (Time 1), respectively.Cohort 2 at Time 0 was in the control condition.After the "treatment" of one year of schooling, Cohort 2 at Time 1 data were collected to assess the schooling effect (δ ), defined as the average

C2T1-C1T0
of "changes in mathematics achievement over the time-span of one school year at the particular grade level" (Wiley & Wolfe, 1992, p. 299).
In practice, Cohort 2 at Time 0 data may not be collected in a study, an alternative cohort Cohort 1 at Time 1 (e.g., grade 7 at Yeari) can serve as the "control" group to estimate schooling effect, denoted now as .This design is called the synthetic cohort design

C2T1-C1T1
(SCD; Wiley & Wolfe, 1992), which has been widely used in aging and epidemiological studies (e.g., Heimberg, Stein, Hiripi, & Kessler, 2000;Kessler, Stein, & Berglund, 1998).Using the synthetic cohort design (SCD), which was used for cross-national comparisons of schooling (Wiley & Wolfe,1992) in the Third International Mathematics andScience Study 1995 (TIMSS 1995).In this design, the schooling effect is determined by comparing data of two adjacent grades: 7th (Cohort 1) and 8th grade (Cohort 2, the focal cohort).The two cohorts are measured at the same time point (Time 1).The SCD by nature is a quasi-longitudinal design, where Cohort 1 at Time 1 data are treated as the "replacement" of Cohort 2 at Time 0 data to estimate schooling effect ( ).The schooling effect estimation bias in SCD is

C2T1-C1T1
In order to create a quasi-experimental SCD, it is necessary to generate Cohort 1 at Time 1 (7th grade in year i+1) data that are not comparable with Cohort 2 at Time 0 (8th grade in year i) due to measurement errors, so that matching can be used to reduce the simulated selection bias and to decrease the estimation bias of the schooling effect.
The simulation design model is based on data collected in the United States (SIMS-USA, Wolfe,1987), one of the seven countries that collected longitudinal data in SIMS.The final data set includes 126 regular classes and 2,296 students.The average class size is about 27.Tables 1 and 2 list the descriptive statistics of the outcome variables, covariates, and manifest variables.The model selection of variables is based on the previous studies (Schmidt & Burstein, 1992).

Simulated Two-level Structural Equation Model
The proposed two-level SEM (Muthén, 1994) is shown in Figure 1.In the level-1 equation, the post-test score is predicted by the pre-test score, which is predicted by four student characteristics and five latent variables.The latent constructs and their manifest variables are listed in Table 1.In the level-2 model, the intercept of pre-test (β ) is predicted by four class-   Mplus (Muthén & Muthén, 1998-2015) is used to estimate factor loadings, regression coefficients, and residual variances (see Appendix).These model-based estimates are treated as known parameter values to generate longitudinal data of Cohort 2 at Time 0 and Time 1.The population level SES variables are manipulated using Equation ( 6) in the simulation descriptions below.Our data-driven approach borrows the "sampling study" metric from MacCallum, Roznowski and Necowitz (1992), who treated the observed data as the "population", from which random samples were drawn for simulation.Our approach differs from theirs in that we created a hypothetical population from which to draw data for simulation.The SIMS-USA data are collected to represent 3,681,939 8th graders nested in 136,368 classes across the seven strata in the United States (Wolfe, 1987).The simulated pseudopopulation includes 12,600 classes and 345,000 students with an average class-size of 27.13.

Table 3. Four Simulations of Matching on Latent and Manifest Variables
Note: ICC: Intraclass Correlation, which varied from 0.023 to .322 in the most commonly used large-scale surveys on hieratically structured data (Hedges & Hedberg, 2007).
Due to its practical importance in education studies, only latent variable SES and its four manifest variables are manipulated to simulate selection bias.In this study, selection bias is indicated by the non-comparability between Cohort 2 at Time 0 (the treatment group) and  (Lord & Novick, 1968;Raykov, 1997) was computed as 0.25 in Cohort 2 at Time 0.

Generated Control Group Cohort 1 at Time 1 Data
Data generation of Cohort 1 at Time 1 involves manipulating random measurement errors and reliability values to simulate selection bias.The four manipulations are summarized in Table 3.Each manipulation represents one source of simulated selection bias, which causes non-comparability between Cohort 2 at Time 0 data and Cohort 1 at Time 1 data.In the second simulation, the manifest variables of Cohort 1 at Time 1 are generated through a multivariate normal distribution . The residual variances of the four manifest variables are reduced by 90%.In turn, the computed reliability coefficient is increased to 0.78.A larger reliability coefficient indicates a stronger relationship between the four manifest variables and the latent variable in Cohort 1 at Time 1.This larger reliability is similar to a value that has been studied in Steiner et al., (2011), Rodriguez de Gil, et al. (2015) and Webb-Vargas et. al (2015).
Simulation 3: C1T1's manifest variables have higher reliability, with a different latent variable mean from C2T0's.
In the third simulation, the manifest variables of Cohort 1 at Time 1 have a higher reliability of 0.78; the manifest variables of Cohort 2 at Time 0 have a reliability of 0.25.In addition, the latent variable means in Cohort 1 at Time 1 is 0.68, which is half of the standard deviation of the latent variable in Cohort 2 at Time 0. Because of the latent mean difference, the manifest variable means of two cohorts differ by a constant vector c .That is, based 2 on Equation (6) above.In Simulation 1 or 2, the latent variable mean of Cohort 1 at Time 1 is equal to 0 (see Table 4's Simulation One and Two).
Simulation 4: C1T1's latent variable mean differs from C2T0's, with the same high reliability.
In the fourth simulation, both Cohort 1 at Time 1 and Cohort 2 at Time 0 have a reliability of 0.78.This is achieved by using the manipulation discussed in Simulation 2. The mean of in Cohort 1 at Time 1 is manipulated in the same way as that discussed in Simulation 3.
) and Psychology. 2017, 1:9.The R (R Development Core Team, 2007) module-MatchIt (Ho, Imai, King & Stuart, 2011)-carries out the four types of matching for each manipulation.The first matching, Propensity Score Matching Based on Manifest Variables (PSMMV), uses "naïve" propensity scores estimated through manifest variables.The second, Propensity Score Matching Based on Latent Variable (PSMLV), uses "true" propensity scores estimated from the latent variable.The third, Matching on Factor Score (MFS), treats estimated factor scores as propensityscore-like measures to reduce bias.The factor score is estimated through Mplus (Muthén & Muthén, 1998-2015).The last matching, Mahalanobis Distance Matching Based on Factor Score (MDMFS), uses the Mahalanobis distance of the estimated factor scores.If a unit can be re-used in matching, it is called matching with replacement (Austin, 2014).Using simulated multilevel data, Wang et al. (2017) only conducted matching without replacement and suggested that future study should examine matching with replacement.In our study, each of the four types of matching is conducted with and without replacement.Implementing the same settings in Wang (2015) and Wang et al. (2017), we set up the caliber (Stuart & Rubin, 2008) at 0.2 and 0.01.The simulation design is 4 (simulations) × 4 (types of matching) × 2 (with/out replacement) × 2 (calipers), which determined the structure of Table 4.Each condition is simulated with 1,000 replications.Each replication randomly draws 100 treatment classes and 100 controls, with an average class size of 27.The sample size of each replication on average is 5,400.

Simulation Evaluation
The estimation bias reduction rate (Cochran & Rubin, 1973;Stuart & Rubin, 2008) is computed as: . A larger value bias reduction rate indicates a better performance of matching.For each of the 1,000 replications, there are four matching methods.In a replication, if the initial bias of the two cohorts is less than 0.5 standard deviations of the 1,000 initial biases, then the two cohorts are comparable and that replication's matching results will not be used to compute the bias reduction rate.Table 4 summarizes the results of the four manipulation studies.

Propensity Score Matching Based on Manifest Variables (PSMMV)
Matching on propensity scores estimated from the four manifest variables through a larger caliper (0.2) without replacement reduces the schooling effect estimation bias in SCD (shortened as "estimation bias") by 60.77%, and with replacement by 58.66%.Matching on a smaller caliper (0.01) without replacement reduces estimation bias by 56.59%, and with replacement by 54.68%.

Propensity Score Matching Based on Latent Variable (PSMLV)
Matching on propensity scores estimated from factor scores through a larger caliper without replacement reduces estimation bias by 2.83%; however, matching with replacement through a larger caliper increases estimation bias by 4.59%.Matching through a smaller caliper without replacement increases estimation bias by 3.38%, and with replacement by 2.64%.

Matching on Factor Score (MFS)
When the estimated factor score is used as a propensity-score-like measure to match through a larger caliper without replacement, it reduces estimation bias by 2.15%, but with replacement increases estimation bias by 4.59%.Matching on a smaller caliper without replacement increases estimation bias by 2.31%, and with replacement by 2.64%.

Mahalanobis Distance Matching Based on Factor Score (MDMFS)
If the estimated Mahalanobis distance of the estimated factor score is used for matching on a larger caliper without replacement, it reduces estimation bias by 0.60%, but with replacement increases estimation bias by 9.30%.Matching on a smaller caliper without replacement increases estimation bias by 4.34%, and with replacement by 19.5%.
In summary, when the latent variable could not represent the manifest variables well (i.e., low reliability), matching based on manifest variable was optimal.Using larger caliper reduced more bias than using smaller caliper.Given the same caliper, matching without replacement reduced more bias than matching with replacement.

Simulation 2
In simulation 2, C1T1's manifest variables have higher reliability (0.78) than C0T2's (0.25), with the same manifest means and the same latent means.

Propensity Score Matching Based on Manifest Variables
Matching on propensity scores estimated from the four manifest variables through a larger caliper without replacement increases estimation bias by 4.68%, but with replacement reduces estimation bias by 5.74%.Matching through a smaller caliper without replacement reduces estimation bias by 2.62%, but with replacement increases estimation bias by 4.85%.

Propensity Score Matching Based on Latent Variable
Matching on propensity scores estimated from factor scores through a larger caliper without replacement reduces estimation bias by 0.15%, but with replacement increases estimation bias by 2.14%.Matching through a smaller caliper without replacement reduces estimation bias by 8.35%, but with replacement increases estimation bias by 6.50%.

Matching on Factor Score
Matching on factor scores through a larger caliper without replacement reduces estimation bias by 0.59%, and with replacement by 3.26%.Matching through a smaller caliper without replacement reduces estimation bias by 9.26%, but with replacement increases estimation bias by 6.50%.

Mahalanobis Distance Matching Based on Factor Score
Mahalanobis distance matching based on estimated factor scores through a larger caliper without replacement reduces estimation bias by 2.04%, but with replacement increases estimation bias by 4.85%.Matching through a smaller caliper without replacement reduces estimation bias by 5.40%, but with replacement increases estimation bias by 16.05%.In summary, when there was no difference between the two groups, matching based on manifest or latent variable was not necessary because little bias was reduced.

Propensity Score Matching Based on Manifest Variables
Matching on propensity scores estimated from the four manifest variables through a larger caliper without replacement reduces estimation bias by 50.37%, and with replacement by 54.47%.Matching through a smaller caliper without replacement reduces estimation bias by 52.65%, and with replacement by 58.53%.

Propensity Score Matching Based on Latent Variable
Matching on propensity scores estimated from factor scores through a larger caliper without replacement reduces estimation bias by 49.46%, and with replacement 47.43%.Matching through a smaller caliper without replacement reduces estimation bias by 50.22%, and with replacement by 56.38%.

Matching on Factor Score
Matching on factor scores through a larger caliper without replacement reduces estimation bias by 46.70%, and with replacement by 47.43%.Matching through a smaller caliper without replacement reduces estimation bias by 49.06%, and with replacement by 56.38%.

Mahalanobis Distance Matching Based on Factor Score
Mahalanobis distance matching based on estimated factor scores through a larger caliper without replacement reduces estimation bias by 4.70%, and with replacement by 48.87%.Matching through a smaller caliper without replacement reduces estimation bias by 13.98%, and with replacement 52.68%.
In summary, when the latent variable represented the manifest variables well in the treatment group (i.e., high reliability), matching based on manifest and latent variable were equally optimal.Contrasted with Simulation 1's results, using smaller caliper reduced more bias than using larger caliper.Matching with replacement and smaller caliper produced the best results.Given the same caliper, matching with replacement reduced more bias than matching without replacement.Mahalanobis matching without replacement was the least efficient among the four matching methods.Mahalanobis matching with replacement was comparatively optimal; and using smaller caliper reduced more bias than larger caliper.

Propensity Score Matching Based on Manifest Variables
Matching on propensity scores estimated from the four manifest variables through a larger caliper without replacement reduces estimation bias by 55.93%, and with replacement by 57.11%.Matching through a smaller caliper without replacement reduces estimation bias by 54.93%, and with replacement by 61.95%.

Propensity Score Matching Based on Latent Variable
Matching on propensity scores estimated from factor scores through a larger caliper without replacement reduces estimation bias by 55.10%, and with replacement by 56.34%.Matching through a smaller caliper without replacement reduces estimation bias by 54.51%, and with replacement by 61.56%.

Matching on Factor Score
Matching on factor scores through a larger caliper without replacement reduces estimation bias by 56.83%, and with replacement by 56.34%.Matching through a smaller caliper without replacement reduces estimation bias by 54.35%, and with replacement by 61.56%.

Mahalanobis Distance Matching Based on Factor Score
Mahalanobis distance matching based on estimated factor scores through a larger caliper without replacement reduces estimation bias by 3.43%, and with replacement by 52.22%.Matching through a smaller caliper without replacement reduces estimation bias by 13.74%, and with replacement by 53.03%.
In summary, when the latent variable represented the manifest variables well in both treatment and control groups, matching based on manifest and latent variable were equally optimal.Compared with Simulation 3's results, using a more reliable control group (C2T0) in Simulation 4's matching achieved the most optimal results.Matching with replacement and smaller caliper was the best choice.Matching with replacement outperformed matching without replacement.In matching without replacement, using larger caliper reduced more bias than using smaller caliper.For matching with replacement, using smaller caliper reduced more bias.Given the same caliper, matching with replacement reduced more bias than matching without replacement.Mahalanobis matching was the least efficient among the four matching methods.And, it was optimal only when matching with replacement was used; and using smaller caliper reduced more bias than larger caliper.

Discussion and Conclusion
Matching on Factor Score Works Sufficiently and Better with Higher Reliability Measurement error has a negative effect on bias reduction through a latent variable matching framework.If a latent variable, rather than measurement error, mainly accounts for the variation among manifest variables, then matching through the latent variable itself will be equivalent to matching through the propensity score that was computed from the latent variable.This result supports the previous practice of using latent ability or academic proficiency estimates to match treatment and control cohort participants (e.g., Van der Linden & Hambleton, 1997).Latent variable matching will achieve a better bias reduction result if the treatment and control cohorts both have high reliability measures than if either the treatment or control group has a high reliability measure.In cases that involve multi-manifest variables, latent variable matching will be preferable because it is more efficient than matching on respective manifest variables.It is worth noting that latent variable matching approaches are effective if the two cohorts' factor score means are different.If, however, the two cohorts are comparable in terms of the latent variable means, then matching through the latent variable is not necessary.

Matching Based on Manifest Variables with Measurement Error Is Sufficient
In practice, studies often use different quality and types of data in terms of measurement reliability, which requires different matching options and leads to inconsistent results.Measurement error has a case-by-case effect on propensity matching and does not necessarily attenuate the causal effect (Battistin & Chesher, 2014).Naïve propensity score can work as well as other error-corrected propensity score methods (MaCaffrey et al., 2011).Webb-Vargas et al. (2015) also found that using naïve covariate and imputation based method worked equally well in propensity score analysis on real data.Findings in Jakubowski (2015) were inconsistent.Matching based on manifest variable can work slightly worse (Table 3 Model B and C, p. 1302) than matching based on latent variable.However, matching based on manifest variable can work slightly better when reliability is the lowest due to measurement error variance and lack of common support (Table 3 Model D, p. 1302).Using manifest variable with measurement error for propensity score analysis can show advantage of bias reduction on treatment effect estimation, specifically when the latent composite variable is under-representative in observational studies.This study is focused on the bias reduction on the covariate rather than the treatment effect estimation.We found that manifest variable based propensity score matching works comparably to latent variable based propensity score matching.

Measurement Complexity and Mixed Matching Effects Request More Latent Variable Based Research on the Topic
This study demonstrates a complicated picture of matching through manifest variables and/or a latent variable because selection bias may be due to the intercept, latent variable and measurement error as shown in Equation ( 6).Simulation One indicates that matching on manifest variables out-performs matching on the latent variable.That is, if the two cohorts are different only on poorly measured manifest variables with considerable error, then latent variable matching works inefficiently to balance the manifest variables; however propensity score matching through these manifest variables is still optimal.In this situation, selection bias on covariate is NOT due to the latent variable, but measurement error.Then latent variable based propensity score matching is not sufficient to reduce selection bias on covariate.However, manifest variable based propensity score matching will work, because manifest variable contains an extra latent variable, i.e., measurement error.In practice, if the treatment and control groups are hypothesized to have the same distribution of latent ability, matching on propensity score estimated though manifest variables (e.g., observed performance measures) should be practical and optimal.
If the two cohorts differ in terms of the latent variable as shown in Simulation Three and Four, then matching on factor scores or propensity scores that have been estimated using the latent variable works as well as matching on propensity scores that have been estimated from manifest variables.In this situation, because the manifest variable is a linear function of the latent variable and measurement error, increasing reliability will improve the performance of both latent variable matching and manifest variable matching.This implies that, if the latent score does exist, it can be directly used for matching, and there is no need to estimate propensity score through multiple manifest variables that possibly have measurement error.Simulation Two's bias reduction rates of all the proposed matching are small and close to zero.It simulates a situation where treatment and control groups are comparable.In this situation, matching is NOT necessary, because it will improve nothing on the comparability of treatment and control groups.

Matching Options (Replacement and Caliper) Interact with Measurement Reliability
Matching options include whether matching allows replacement and whether a smaller caliper is used.Austin (2011) found that optimal bias reduction rates are obtained when the caliper is in a range from 0 to 0.4 (p.153).Lunt (2014) recommended to use a tighter caliper because bias reduction performance is worsened when the caliper becomes larger.Austin (2014) simulated non-hierarchical data found that matching with replacement performed as well as caliper matching without replacement.Wang et al. (2017) found that caliper matching interacts with data structure.They recommended that "different sizes of caliper should be used for level-1 [i.e., student-level] and level-2 [i.e., class-level] matching (p.67) ".Our student-level matching results showed an interaction between matching options and data quality (i.e., measurement reliability).When measures for the two cohorts have equally low reliability, either using a larger caliper or matching without replacement will reduce more bias; matching without replacement on a larger caliper is optimal.However, the trend is reversed when the two cohorts have equally high reliability.That is, either using a smaller caliper or matching with replacement improves matching performance; and matching with replacement on a smaller caliper performs the best.

Mahalanobis Distance Matching Is Comparatively Nonsufficient
Previous studies have found that post-hoc matching depends on the summary measure that is a functional composite of covariates (Rubin, 1985).The most commonly used composites are the Mahalanobis distance (e.g., Rubin, 1980) and the propensity score (Rosenbaum & Rubin, 1983).Mahalanobis distance matching is less effective than propensity matching on multilevel data (Wang, 2015).Similarly, this study found that Mahalanobis distance matching reduces bias less effectively than either propensity score matching or factor score matching.Mahalobis matching with replacement was comparatively optimal only when the treatment group data are highly reliably.Propensity score matching generally performs better than Mahalanobis distance matching when the true propensity score model is known and the sample size is large (Sekhon& Diamond, 2008).The simulation settings of this study favor propensity score matching.Each simulated condition determines a "true" and known propensity score model.Each of the 1000 replications uses 100 classes, and the sample size is approximately as large as 5,400.

Needs to Examine Both Level-1 and Level-2 Data With Measurement Errors
In educational studies, researchers often sample larger units from a hierarchically structured population (Cochran, 1963;Scott & Smith, 1969).Due to the hierarchical structure of the experimental design and data collection (Raudenbush & Sadoff, 2008), treatment units are generally classes or schools rather than individual students (Hedges, 2007).When clusters are assigned to interventions, non-comparable treatment-control groups can arise from either level-1 or level-2 covariates (Raab & Butcher, 2001), resulting in selection bias.This study focuses on manifest variables and the latent variable only at level-1, although the simulated model is based on a two-level latent variable framework.Future research should explore how level-2 measurement errors will affect the accuracy of matching in terms of the bias reduction rate when level-2 matching and dual matching (Wang, 2015) . Interdisciplinary Education and Psychology.2017, 1:9.
a b Note : 1 = not at all like, 2 = somehow unlike, 3 = unsure, 4 = somehow like, 5 = exactly like. 1 = little c schooling, 2 = primary school, 3 = secondary school, 4 = college or university or tertiary education.1 =unskilled worker, 2 = semi-unskilled worker, 3 = skilled worker lower, 4 = skilled worker higher, 5 = clerk sales and related lower, 6 = clerk sales and related higher, 7 = professional and managerial lower, 8 learn more math (inverse code, 1-5 a ) YPWWELL Parents want me to do well (1-5 a ) YPENC Parents encourage me to do well in math (inverse code,1-5 do well in math (1-5 ) YMORMTH YNOMOR Will take no more math if possible (inverse code, 1doing math (inverse code,1-5 a ) YFABLE Father is able to do math home work (inverse code,1-5 a ) YMABLE Mother is able to do math home work (inverse code,1 -5 a ) Years of education parents expected (1-4 e ) Homework YMHWKT Typical hours of math home work per week a Looking forward to taking more math (1-5 ) a Wang et al.Interdisciplinary Education and Psychology.2017, 1:9.

0
/school-level variables.The intercept of the post-test (α ) is predicted by β and three class-0 0level variables.The level-1 and level-2 residuals are mutually independent of one another.

Figure 1 .
Figure 1.Two-level structural equation model

Simulation 1 :
C1T1's manifest variable means differ from C2T0's, with the same latent means and low reliability.In this simulated Cohort 1 at Time 1, the four manifest variables of latent variable SES are generated through a multivariate normal distribution , with Adding c indicates that each manifest variable mean of Cohort 1 at Time 1 is 0.5 standard deviations 1 C2T0 larger than that of Cohort 2 at Time 0. The four means are denoted as µ = [3.720,3.667, 5.328, 5.117].The computed reliability coefficient for Cohort 1 at Time 1 is equal to 0.25, which is as low as that in Cohort 2 at Time 0. This low reliability has not been examined in the simulated propensity score analysis involved measurement error (reliability of .92 in MaCaffrey et al., 2011; reliability from .5 -.9 inSteiner et al., 2011 and Rodriguez de Gil et al., 2015; reliability of  .5, .7,.9 and .999 in Webb-Vargas et al., 2015).

Simulation 2 :
C1T1's manifest variables have higher reliability than C2T0's, with the same manifest means and the same latent means.

Table 1 .
Level-1 Descriptive Statistics of the Final Two-level Structural Equation Model

Table 2 .
Level-2 Descriptive Statistics of the Final Two-level Structural Equation Model Note.N = 126.

Treatment Group Cohort 2 at Time 0 Data The
manifest variables of the latent variable SES are generated through a multivariate normal distribution.The latent variable SES is associated with 4 manifest variable ..........through the measurement model with and includes Father's / Mother's education level (YFEDUC / YMEDUC), and Father's / Mother's occupation national code (YFOCCN / YMOCCN).They are generated through a multivariate normal distribution The four means are denoted as .The variance matrix .... is computed by .The parameter values of and are available in the Appendix.The computed variances of are 0.475, 0.405, 4.421, and 3.916.The reliability coefficient and with and Wang et al.Interdisciplinary Education and Psychology.2017,1:9.Cohort 1 at Time 1 (the control).

Table 4 .
Matching Results of Simulation Design: Four (Simulations) Four (Types of Matching) Two (with/out Replacement) Two (Calipers) Note: c represents that each manifest variable mean of Cohort 1 at Time 1 is 0.5 standard deviations 1 larger than that of Cohort 2 at Time 0; c represents that the latent variable mean in Cohort 1 at Time 1 is 2 increased by a half of the standard deviation of the latent variable η in Cohort 2 at Time 0. PSMMV: SESpropensity score matching based on manifest variables; PSMLV: propensity score Matching based on latent variable; MFS: matching on factor score; MDMFS: Mahalanobis distance matching based on factor score.
are needed.