Pub Date : 2022-10-01Epub Date: 2021-06-04DOI: 10.1177/00131644211019447
Tenko Raykov, Christine DiStefano
A latent variable modeling-based procedure is discussed that permits to readily point and interval estimate the design effect index in multilevel settings using widely circulated software. The method provides useful information about the relationship of important parameter standard errors when accounting for clustering effects relative to conducting single-level analyses. The approach can also be employed as an addendum to point and interval estimation of the intraclass correlation coefficient in empirical research. The discussed procedure makes it easily possible to evaluate the design effect in two-level studies by utilizing the popular latent variable modeling methodology and is illustrated with an example.
{"title":"Design Effect in Multilevel Settings: A Commentary on a Latent Variable Modeling Procedure for Its Evaluation.","authors":"Tenko Raykov, Christine DiStefano","doi":"10.1177/00131644211019447","DOIUrl":"10.1177/00131644211019447","url":null,"abstract":"<p><p>A latent variable modeling-based procedure is discussed that permits to readily point and interval estimate the design effect index in multilevel settings using widely circulated software. The method provides useful information about the relationship of important parameter standard errors when accounting for clustering effects relative to conducting single-level analyses. The approach can also be employed as an addendum to point and interval estimation of the intraclass correlation coefficient in empirical research. The discussed procedure makes it easily possible to evaluate the design effect in two-level studies by utilizing the popular latent variable modeling methodology and is illustrated with an example.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"82 5","pages":"1020-1030"},"PeriodicalIF":2.7,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/00131644211019447","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40626319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-01Epub Date: 2022-01-07DOI: 10.1177/00131644211063453
Patrick D Manapat, Michael C Edwards
When fitting unidimensional item response theory (IRT) models, the population distribution of the latent trait (θ) is often assumed to be normally distributed. However, some psychological theories would suggest a nonnormal θ. For example, some clinical traits (e.g., alcoholism, depression) are believed to follow a positively skewed distribution where the construct is low for most people, medium for some, and high for few. Failure to account for nonnormality may compromise the validity of inferences and conclusions. Although corrections have been developed to account for nonnormality, these methods can be computationally intensive and have not yet been widely adopted. Previous research has recommended implementing nonnormality corrections when θ is not "approximately normal." This research focused on examining how far θ can deviate from normal before the normality assumption becomes untenable. Specifically, our goal was to identify the type(s) and degree(s) of nonnormality that result in unacceptable parameter recovery for the graded response model (GRM) and 2-parameter logistic model (2PLM).
{"title":"Examining the Robustness of the Graded Response and 2-Parameter Logistic Models to Violations of Construct Normality.","authors":"Patrick D Manapat, Michael C Edwards","doi":"10.1177/00131644211063453","DOIUrl":"10.1177/00131644211063453","url":null,"abstract":"<p><p>When fitting unidimensional item response theory (IRT) models, the population distribution of the latent trait (θ) is often assumed to be normally distributed. However, some psychological theories would suggest a nonnormal θ. For example, some clinical traits (e.g., alcoholism, depression) are believed to follow a positively skewed distribution where the construct is low for most people, medium for some, and high for few. Failure to account for nonnormality may compromise the validity of inferences and conclusions. Although corrections have been developed to account for nonnormality, these methods can be computationally intensive and have not yet been widely adopted. Previous research has recommended implementing nonnormality corrections when θ is not \"approximately normal.\" This research focused on examining how far θ can deviate from normal before the normality assumption becomes untenable. Specifically, our goal was to identify the type(s) and degree(s) of nonnormality that result in unacceptable parameter recovery for the graded response model (GRM) and 2-parameter logistic model (2PLM).</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"82 5","pages":"967-988"},"PeriodicalIF":2.7,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9386882/pdf/10.1177_00131644211063453.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40626322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-01DOI: 10.1177/00131644211045347
Stefanie A Wind
Researchers frequently use Mokken scale analysis (MSA), which is a nonparametric approach to item response theory, when they have relatively small samples of examinees. Researchers have provided some guidance regarding the minimum sample size for applications of MSA under various conditions. However, these studies have not focused on item-level measurement problems, such as violations of monotonicity or invariant item ordering (IIO). Moreover, these studies have focused on problems that occur for a complete sample of examinees. The current study uses a simulation study to consider the sensitivity of MSA item analysis procedures to problematic item characteristics that occur within limited ranges of the latent variable. Results generally support the use of MSA with small samples (N around 100 examinees) as long as multiple indicators of item quality are considered.
{"title":"Identifying Problematic Item Characteristics With Small Samples Using Mokken Scale Analysis.","authors":"Stefanie A Wind","doi":"10.1177/00131644211045347","DOIUrl":"https://doi.org/10.1177/00131644211045347","url":null,"abstract":"<p><p>Researchers frequently use Mokken scale analysis (MSA), which is a nonparametric approach to item response theory, when they have relatively small samples of examinees. Researchers have provided some guidance regarding the minimum sample size for applications of MSA under various conditions. However, these studies have not focused on item-level measurement problems, such as violations of monotonicity or invariant item ordering (IIO). Moreover, these studies have focused on problems that occur for a complete sample of examinees. The current study uses a simulation study to consider the sensitivity of MSA item analysis procedures to problematic item characteristics that occur within limited ranges of the latent variable. Results generally support the use of MSA with small samples (<i>N</i> around 100 examinees) as long as multiple indicators of item quality are considered.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"82 4","pages":"747-756"},"PeriodicalIF":2.7,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9228692/pdf/10.1177_00131644211045347.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10290041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-01DOI: 10.1177/00131644211043207
Kuan-Yu Jin, Thomas Eckes
Performance assessments heavily rely on human ratings. These ratings are typically subject to various forms of error and bias, threatening the assessment outcomes' validity and fairness. Differential rater functioning (DRF) is a special kind of threat to fairness manifesting itself in unwanted interactions between raters and performance- or construct-irrelevant factors (e.g., examinee gender, rater experience, or time of rating). Most DRF studies have focused on whether raters show differential severity toward known groups of examinees. This study expands the DRF framework and investigates the more complex case of dual DRF effects, where DRF is simultaneously present in rater severity and centrality. Adopting a facets modeling approach, we propose the dual DRF model (DDRFM) for detecting and measuring these effects. In two simulation studies, we found that dual DRF effects (a) negatively affected measurement quality and (b) can reliably be detected and compensated under the DDRFM. Using sample data from a large-scale writing assessment (N = 1,323), we demonstrate the practical measurement consequences of the dual DRF effects. Findings have implications for researchers and practitioners assessing the psychometric quality of ratings.
{"title":"Detecting Differential Rater Functioning in Severity and Centrality: The Dual DRF Facets Model.","authors":"Kuan-Yu Jin, Thomas Eckes","doi":"10.1177/00131644211043207","DOIUrl":"https://doi.org/10.1177/00131644211043207","url":null,"abstract":"<p><p>Performance assessments heavily rely on human ratings. These ratings are typically subject to various forms of error and bias, threatening the assessment outcomes' validity and fairness. Differential rater functioning (DRF) is a special kind of threat to fairness manifesting itself in unwanted interactions between raters and performance- or construct-irrelevant factors (e.g., examinee gender, rater experience, or time of rating). Most DRF studies have focused on whether raters show differential severity toward known groups of examinees. This study expands the DRF framework and investigates the more complex case of dual DRF effects, where DRF is simultaneously present in rater severity and centrality. Adopting a facets modeling approach, we propose the dual DRF model (DDRFM) for detecting and measuring these effects. In two simulation studies, we found that dual DRF effects (a) negatively affected measurement quality and (b) can reliably be detected and compensated under the DDRFM. Using sample data from a large-scale writing assessment (<i>N</i> = 1,323), we demonstrate the practical measurement consequences of the dual DRF effects. Findings have implications for researchers and practitioners assessing the psychometric quality of ratings.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"82 4","pages":"757-781"},"PeriodicalIF":2.7,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9228693/pdf/10.1177_00131644211043207.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10271624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-01Epub Date: 2021-07-21DOI: 10.1177/00131644211032388
Dimiter M Dimitrov
Proposed is a new method of standard setting referred to as response vector for mastery (RVM) method. Under the RVM method, the task of panelists that participate in the standard setting process does not involve conceptualization of a borderline examinee and probability judgments as it is the case with the Angoff and bookmark methods. Also, the RVM-based computation of a cut-score is not based on a single item (e.g., marked in an ordered item booklet) but, instead, on a response vector (1/0 scores) on items and their parameters calibrated in item response theory or under the recently developed D-scoring method. Illustrations with hypothetical and real-data scenarios of standard setting are provided and methodological aspects of the RVM method are discussed.
{"title":"The Response Vector for Mastery Method of Standard Setting.","authors":"Dimiter M Dimitrov","doi":"10.1177/00131644211032388","DOIUrl":"10.1177/00131644211032388","url":null,"abstract":"<p><p>Proposed is a new method of standard setting referred to as response vector for mastery (RVM) method. Under the RVM method, the task of panelists that participate in the standard setting process does not involve conceptualization of a borderline examinee and probability judgments as it is the case with the Angoff and bookmark methods. Also, the RVM-based computation of a cut-score is not based on a single item (e.g., marked in an ordered item booklet) but, instead, on a response vector (1/0 scores) on items and their parameters calibrated in item response theory or under the recently developed <i>D</i>-scoring method. Illustrations with hypothetical and real-data scenarios of standard setting are provided and methodological aspects of the RVM method are discussed.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"82 4","pages":"719-746"},"PeriodicalIF":2.1,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9228699/pdf/10.1177_00131644211032388.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10271623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-01Epub Date: 2021-07-26DOI: 10.1177/00131644211028995
Sooyong Lee, Suhwa Han, Seung W Choi
Response data containing an excessive number of zeros are referred to as zero-inflated data. When differential item functioning (DIF) detection is of interest, zero-inflation can attenuate DIF effects in the total sample and lead to underdetection of DIF items. The current study presents a DIF detection procedure for response data with excess zeros due to the existence of unobserved heterogeneous subgroups. The suggested procedure utilizes the factor mixture modeling (FMM) with MIMIC (multiple-indicator multiple-cause) to address the compromised DIF detection power via the estimation of latent classes. A Monte Carlo simulation was conducted to evaluate the suggested procedure in comparison to the well-known likelihood ratio (LR) DIF test. Our simulation study results indicated the superiority of FMM over the LR DIF test in terms of detection power and illustrated the importance of accounting for latent heterogeneity in zero-inflated data. The empirical data analysis results further supported the use of FMM by flagging additional DIF items over and above the LR test.
{"title":"DIF Detection With Zero-Inflation Under the Factor Mixture Modeling Framework.","authors":"Sooyong Lee, Suhwa Han, Seung W Choi","doi":"10.1177/00131644211028995","DOIUrl":"10.1177/00131644211028995","url":null,"abstract":"<p><p>Response data containing an excessive number of zeros are referred to as zero-inflated data. When differential item functioning (DIF) detection is of interest, zero-inflation can attenuate DIF effects in the total sample and lead to underdetection of DIF items. The current study presents a DIF detection procedure for response data with excess zeros due to the existence of unobserved heterogeneous subgroups. The suggested procedure utilizes the factor mixture modeling (FMM) with MIMIC (multiple-indicator multiple-cause) to address the compromised DIF detection power via the estimation of latent classes. A Monte Carlo simulation was conducted to evaluate the suggested procedure in comparison to the well-known likelihood ratio (LR) DIF test. Our simulation study results indicated the superiority of FMM over the LR DIF test in terms of detection power and illustrated the importance of accounting for latent heterogeneity in zero-inflated data. The empirical data analysis results further supported the use of FMM by flagging additional DIF items over and above the LR test.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"82 4","pages":"678-704"},"PeriodicalIF":2.1,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9228697/pdf/10.1177_00131644211028995.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10290044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-01DOI: 10.1177/00131644211049746
Robert L Brennan, Stella Y Kim, Won-Chan Lee
This article extends multivariate generalizability theory (MGT) to tests with different random-effects designs for each level of a fixed facet. There are numerous situations in which the design of a test and the resulting data structure are not definable by a single design. One example is mixed-format tests that are composed of multiple-choice and free-response items, with the latter involving variability attributable to both items and raters. In this case, two distinct designs are needed to fully characterize the design and capture potential sources of error associated with each item format. Another example involves tests containing both testlets and one or more stand-alone sets of items. Testlet effects need to be taken into account for the testlet-based items, but not the stand-alone sets of items. This article presents an extension of MGT that faithfully models such complex test designs, along with two real-data examples. Among other things, these examples illustrate that estimates of error variance, error-tolerance ratios, and reliability-like coefficients can be biased if there is a mismatch between the user-specified universe of generalization and the complex nature of the test.
{"title":"Extended Multivariate Generalizability Theory With Complex Design Structures.","authors":"Robert L Brennan, Stella Y Kim, Won-Chan Lee","doi":"10.1177/00131644211049746","DOIUrl":"https://doi.org/10.1177/00131644211049746","url":null,"abstract":"<p><p>This article extends multivariate generalizability theory (MGT) to tests with different random-effects designs for each level of a fixed facet. There are numerous situations in which the design of a test and the resulting data structure are not definable by a single design. One example is mixed-format tests that are composed of multiple-choice and free-response items, with the latter involving variability attributable to both items and raters. In this case, two distinct designs are needed to fully characterize the design and capture potential sources of error associated with each item format. Another example involves tests containing both testlets and one or more stand-alone sets of items. Testlet effects need to be taken into account for the testlet-based items, but not the stand-alone sets of items. This article presents an extension of MGT that faithfully models such complex test designs, along with two real-data examples. Among other things, these examples illustrate that estimates of error variance, error-tolerance ratios, and reliability-like coefficients can be biased if there is a mismatch between the user-specified universe of generalization and the complex nature of the test.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"82 4","pages":"617-642"},"PeriodicalIF":2.7,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9228696/pdf/10.1177_00131644211049746.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10290043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-01DOI: 10.1177/00131644211023868
Chansoon Lee, Hong Qian
Using classical test theory and item response theory, this study applied sequential procedures to a real operational item pool in a variable-length computerized adaptive testing (CAT) to detect items whose security may be compromised. Moreover, this study proposed a hybrid threshold approach to improve the detection power of the sequential procedure while controlling the Type I error rate. The hybrid threshold approach uses a local threshold for each item in an early stage of the CAT administration, and then it uses the global threshold in the decision-making stage. Applying various simulation factors, a series of simulation studies examined which factors contribute significantly to the power rate and lag time of the procedure. In addition to the simulation study, a case study investigated whether the procedures are applicable to the real item pool administered in CAT and can identify potentially compromised items in the pool. This research found that the increment of probability of a correct answer (p-increment) was the simulation factor most important to the sequential procedures' ability to detect compromised items. This study also found that the local threshold approach improved power rates and shortened lag times when the p-increment was small. The findings of this study could help practitioners implement the sequential procedures using the hybrid threshold approach in real-time CAT administration.
{"title":"Hybrid Threshold-Based Sequential Procedures for Detecting Compromised Items in a Computerized Adaptive Testing Licensure Exam.","authors":"Chansoon Lee, Hong Qian","doi":"10.1177/00131644211023868","DOIUrl":"https://doi.org/10.1177/00131644211023868","url":null,"abstract":"<p><p>Using classical test theory and item response theory, this study applied sequential procedures to a real operational item pool in a variable-length computerized adaptive testing (CAT) to detect items whose security may be compromised. Moreover, this study proposed a hybrid threshold approach to improve the detection power of the sequential procedure while controlling the Type I error rate. The hybrid threshold approach uses a local threshold for each item in an early stage of the CAT administration, and then it uses the global threshold in the decision-making stage. Applying various simulation factors, a series of simulation studies examined which factors contribute significantly to the power rate and lag time of the procedure. In addition to the simulation study, a case study investigated whether the procedures are applicable to the real item pool administered in CAT and can identify potentially compromised items in the pool. This research found that the increment of probability of a correct answer (<i>p</i>-increment) was the simulation factor most important to the sequential procedures' ability to detect compromised items. This study also found that the local threshold approach improved power rates and shortened lag times when the <i>p</i>-increment was small. The findings of this study could help practitioners implement the sequential procedures using the hybrid threshold approach in real-time CAT administration.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"82 4","pages":"782-810"},"PeriodicalIF":2.7,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/00131644211023868","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10272075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-01DOI: 10.1177/00131644211033899
Zhehan Jiang, Mark Raymond, Christine DiStefano, Dexin Shi, Ren Liu, Junhua Sun
Computing confidence intervals around generalizability coefficients has long been a challenging task in generalizability theory. This is a serious practical problem because generalizability coefficients are often computed from designs where some facets have small sample sizes, and researchers have little guide regarding the trustworthiness of the coefficients. As generalizability theory can be framed to a linear mixed-effect model (LMM), bootstrap and simulation techniques from LMM paradigm can be used to construct the confidence intervals. The purpose of this research is to examine four different LMM-based methods for computing the confidence intervals that have been proposed and to determine their accuracy under six simulated conditions based on the type of test scores (normal, dichotomous, and polytomous data) and data measurement design (p×i×r and p× [i:r]). A bootstrap technique called "parametric methods with spherical random effects" consistently produced more accurate confidence intervals than the three other LMM-based methods. Furthermore, the selected technique was compared with model-based approach to investigate the performance at the levels of variance components via the second simulation study, where the numbers of examines, raters, and items were varied. We conclude with the recommendation generalizability coefficients, the confidence interval should accompany the point estimate.
{"title":"A Monte Carlo Study of Confidence Interval Methods for Generalizability Coefficient.","authors":"Zhehan Jiang, Mark Raymond, Christine DiStefano, Dexin Shi, Ren Liu, Junhua Sun","doi":"10.1177/00131644211033899","DOIUrl":"https://doi.org/10.1177/00131644211033899","url":null,"abstract":"<p><p>Computing confidence intervals around generalizability coefficients has long been a challenging task in generalizability theory. This is a serious practical problem because generalizability coefficients are often computed from designs where some facets have small sample sizes, and researchers have little guide regarding the trustworthiness of the coefficients. As generalizability theory can be framed to a linear mixed-effect model (LMM), bootstrap and simulation techniques from LMM paradigm can be used to construct the confidence intervals. The purpose of this research is to examine four different LMM-based methods for computing the confidence intervals that have been proposed and to determine their accuracy under six simulated conditions based on the type of test scores (normal, dichotomous, and polytomous data) and data measurement design (<i>p</i>×<i>i</i>×<i>r</i> and <i>p</i>× [<i>i:r</i>]). A bootstrap technique called \"parametric methods with spherical random effects\" consistently produced more accurate confidence intervals than the three other LMM-based methods. Furthermore, the selected technique was compared with model-based approach to investigate the performance at the levels of variance components via the second simulation study, where the numbers of examines, raters, and items were varied. We conclude with the recommendation generalizability coefficients, the confidence interval should accompany the point estimate.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"82 4","pages":"705-718"},"PeriodicalIF":2.7,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9228698/pdf/10.1177_00131644211033899.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10290038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-01DOI: 10.1177/00131644211033902
Allison W Cooperman, David J Weiss, Chun Wang
Adaptive measurement of change (AMC) is a psychometric method for measuring intra-individual change on one or more latent traits across testing occasions. Three hypothesis tests-a Z test, likelihood ratio test, and score ratio index-have demonstrated desirable statistical properties in this context, including low false positive rates and high true positive rates. However, the extant AMC research has assumed that the item parameter values in the simulated item banks were devoid of estimation error. This assumption is unrealistic for applied testing settings, where item parameters are estimated from a calibration sample before test administration. Using Monte Carlo simulation, this study evaluated the robustness of the common AMC hypothesis tests to the presence of item parameter estimation error when measuring omnibus change across four testing occasions. Results indicated that item parameter estimation error had at most a small effect on false positive rates and latent trait change recovery, and these effects were largely explained by the computerized adaptive testing item bank information functions. Differences in AMC performance as a function of item parameter estimation error and choice of hypothesis test were generally limited to simulees with particularly low or high latent trait values, where the item bank provided relatively lower information. These simulations highlight how AMC can accurately measure intra-individual change in the presence of item parameter estimation error when paired with an informative item bank. Limitations and future directions for AMC research are discussed.
{"title":"Robustness of Adaptive Measurement of Change to Item Parameter Estimation Error.","authors":"Allison W Cooperman, David J Weiss, Chun Wang","doi":"10.1177/00131644211033902","DOIUrl":"https://doi.org/10.1177/00131644211033902","url":null,"abstract":"<p><p>Adaptive measurement of change (AMC) is a psychometric method for measuring intra-individual change on one or more latent traits across testing occasions. Three hypothesis tests-a <i>Z</i> test, likelihood ratio test, and score ratio index-have demonstrated desirable statistical properties in this context, including low false positive rates and high true positive rates. However, the extant AMC research has assumed that the item parameter values in the simulated item banks were devoid of estimation error. This assumption is unrealistic for applied testing settings, where item parameters are estimated from a calibration sample before test administration. Using Monte Carlo simulation, this study evaluated the robustness of the common AMC hypothesis tests to the presence of item parameter estimation error when measuring omnibus change across four testing occasions. Results indicated that item parameter estimation error had at most a small effect on false positive rates and latent trait change recovery, and these effects were largely explained by the computerized adaptive testing item bank information functions. Differences in AMC performance as a function of item parameter estimation error and choice of hypothesis test were generally limited to simulees with particularly low or high latent trait values, where the item bank provided relatively lower information. These simulations highlight how AMC can accurately measure intra-individual change in the presence of item parameter estimation error when paired with an informative item bank. Limitations and future directions for AMC research are discussed.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"82 4","pages":"643-677"},"PeriodicalIF":2.7,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9228691/pdf/10.1177_00131644211033902.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10290045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}