Missingness due to not-reached items and omitted items has received much attention in the recent psychometric literature. Such missingness, if not handled properly, would lead to biased parameter estimation, as well as inaccurate inference of examinees, and further erode the validity of the test. This paper reviews some commonly used IRT based models allowing missingness, followed by three popular examinee scoring methods, including maximum likelihood estimation, maximum a posteriori, and expected a posteriori. Simulation studies were conducted to compare these examinee scoring methods across these commonly used models in the presence of missingness. Results showed that all the methods could infer examinees' ability accurately when the missingness is ignorable. If the missingness is nonignorable, incorporating those missing responses would improve the precision in estimating abilities for examinees with missingness, especially when the test length is short. In terms of examinee scoring methods, expected a posteriori method performed better for evaluating latent traits under models allowing missingness. An empirical study based on the PISA 2015 Science Test was further performed.
{"title":"A Note on Latent Traits Estimates under IRT Models with Missingness","authors":"Jinxin Guo, Xin Xu, Tao Xin","doi":"10.1111/jedm.12365","DOIUrl":"10.1111/jedm.12365","url":null,"abstract":"<p>Missingness due to not-reached items and omitted items has received much attention in the recent psychometric literature. Such missingness, if not handled properly, would lead to biased parameter estimation, as well as inaccurate inference of examinees, and further erode the validity of the test. This paper reviews some commonly used IRT based models allowing missingness, followed by three popular examinee scoring methods, including maximum likelihood estimation, maximum a posteriori, and expected a posteriori. Simulation studies were conducted to compare these examinee scoring methods across these commonly used models in the presence of missingness. Results showed that all the methods could infer examinees' ability accurately when the missingness is ignorable. If the missingness is nonignorable, incorporating those missing responses would improve the precision in estimating abilities for examinees with missingness, especially when the test length is short. In terms of examinee scoring methods, expected a posteriori method performed better for evaluating latent traits under models allowing missingness. An empirical study based on the PISA 2015 Science Test was further performed.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 4","pages":"575-625"},"PeriodicalIF":1.3,"publicationDate":"2023-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44924100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The study presents multivariate sequential monitoring procedures for examining test-taking behaviors online. The procedures monitor examinee's responses and response times and signal aberrancy as soon as significant change is identifieddetected in the test-taking behavior. The study in particular proposes three schemes to track different indicators of a test-taking mode—the observable manifest variables, latent trait variables, and measurement likelihood. For each procedure, sequential sampling strategies are presented to implement online monitoring. Numerical experimentation based on simulated data suggests that the proposed procedures demonstrate adequate performance. The procedures identified examinees with aberrant behaviors with high detection power and timeliness, while maintaining error rates reasonably small. Experimental application to real data also suggested that the procedures have practical relevance to real assessments. Based on the observations from the experiential analysis, the study discusses implications and guidelines for practical use.
{"title":"Online Monitoring of Test-Taking Behavior Based on Item Responses and Response Times","authors":"Suhwa Han, Hyeon-Ah Kang","doi":"10.1111/jedm.12367","DOIUrl":"10.1111/jedm.12367","url":null,"abstract":"<p>The study presents multivariate sequential monitoring procedures for examining test-taking behaviors online. The procedures monitor examinee's responses and response times and signal aberrancy as soon as significant change is identifieddetected in the test-taking behavior. The study in particular proposes three schemes to track different indicators of a test-taking mode—the observable manifest variables, latent trait variables, and measurement likelihood. For each procedure, sequential sampling strategies are presented to implement online monitoring. Numerical experimentation based on simulated data suggests that the proposed procedures demonstrate adequate performance. The procedures identified examinees with aberrant behaviors with high detection power and timeliness, while maintaining error rates reasonably small. Experimental application to real data also suggested that the procedures have practical relevance to real assessments. Based on the observations from the experiential analysis, the study discusses implications and guidelines for practical use.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 4","pages":"651-675"},"PeriodicalIF":1.3,"publicationDate":"2023-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46552013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Test takers wishing to gain an unfair advantage often share answers with other test takers, either sharing all answers (a full key) or some (a partial key). Detecting key sharing during a tight testing window requires an efficient, easily interpretable, and rich form of analysis that is descriptive and inferential. We introduce a detection method based on multiple correspondence analysis (MCA) that identifies test takers with unusual response similarities. The method simultaneously detects multiple shared keys (partial or full), plots results, and is computationally efficient as it requires only matrix operations. We describe the method, evaluate its detection accuracy under various simulation conditions, and demonstrate the procedure on a real data set with known test-taking misbehavior. The simulation results showed that the MCA method had reasonably high power under realistic conditions and maintained the nominal false-positive level, except when the group size was very large or partial shared keys had more than 50% of the items. The real data analysis illustrated visual detection procedures and inference about the item responses possibly shared in the key, which was likely shared among 91 test takers, many of whom were confirmed by nonstatistical investigation to have engaged in test-taking misconduct.
{"title":"Detecting Group Collaboration Using Multiple Correspondence Analysis","authors":"Joseph H. Grochowalski, Amy Hendrickson","doi":"10.1111/jedm.12363","DOIUrl":"10.1111/jedm.12363","url":null,"abstract":"<p>Test takers wishing to gain an unfair advantage often share answers with other test takers, either sharing all answers (a full key) or some (a partial key). Detecting key sharing during a tight testing window requires an efficient, easily interpretable, and rich form of analysis that is descriptive and inferential. We introduce a detection method based on multiple correspondence analysis (MCA) that identifies test takers with unusual response similarities. The method simultaneously detects multiple shared keys (partial or full), plots results, and is computationally efficient as it requires only matrix operations. We describe the method, evaluate its detection accuracy under various simulation conditions, and demonstrate the procedure on a real data set with known test-taking misbehavior. The simulation results showed that the MCA method had reasonably high power under realistic conditions and maintained the nominal false-positive level, except when the group size was very large or partial shared keys had more than 50% of the items. The real data analysis illustrated visual detection procedures and inference about the item responses possibly shared in the key, which was likely shared among 91 test takers, many of whom were confirmed by nonstatistical investigation to have engaged in test-taking misconduct.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 3","pages":"402-427"},"PeriodicalIF":1.3,"publicationDate":"2023-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44728675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The purpose of this study was to compare calibration and linking methods for placing pretest item parameter estimates on the item pool scale in a 1-3 computerized multistage adaptive testing design in terms of item parameter recovery. Two models were used: embedded-section, in which pretest items were administered within a separate module, and embedded-items, in which pretest items were distributed across operational modules. The calibration methods were separate calibration with linking (SC) and fixed calibration (FC) with three parallel approaches under each (FC-1 and SC-1; FC-2 and SC-2; and FC-3 and SC-3). The FC-1 and SC-1 used only operational items in the routing module to link pretest items. The FC-2 and SC-2 also used only operational items in the routing module for linking, but in addition, the operational items in second stage modules were freely estimated. The FC-3 and SC-3 used operational items in all modules to link pretest items. The third calibration approach (i.e., FC-3 and SC-3) yielded the best results. For all three approaches, SC outperformed FC in all study conditions which were module length, sample size and examinee distributions.
{"title":"Pretest Item Calibration in Computerized Multistage Adaptive Testing","authors":"Rabia Karatoprak Ersen, Won-Chan Lee","doi":"10.1111/jedm.12361","DOIUrl":"10.1111/jedm.12361","url":null,"abstract":"<p>The purpose of this study was to compare calibration and linking methods for placing pretest item parameter estimates on the item pool scale in a 1-3 computerized multistage adaptive testing design in terms of item parameter recovery. Two models were used: embedded-section, in which pretest items were administered within a separate module, and embedded-items, in which pretest items were distributed across operational modules. The calibration methods were separate calibration with linking (SC) and fixed calibration (FC) with three parallel approaches under each (FC-1 and SC-1; FC-2 and SC-2; and FC-3 and SC-3). The FC-1 and SC-1 used only operational items in the routing module to link pretest items. The FC-2 and SC-2 also used only operational items in the routing module for linking, but in addition, the operational items in second stage modules were freely estimated. The FC-3 and SC-3 used operational items in all modules to link pretest items. The third calibration approach (i.e., FC-3 and SC-3) yielded the best results. For all three approaches, SC outperformed FC in all study conditions which were module length, sample size and examinee distributions.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 3","pages":"379-401"},"PeriodicalIF":1.3,"publicationDate":"2023-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12361","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48133014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A conceptualization of multiple-choice exams in terms of signal detection theory (SDT) leads to simple measures of item difficulty and item discrimination that are closely related to, but also distinct from, those used in classical item analysis (CIA). The theory defines a “true split,” depending on whether or not examinees know an item, and so it provides a basis for using total scores to split item tables, as done in CIA, while also clarifying benefits and limitations of the approach. The SDT item difficulty and discrimination measures differ from those used in CIA in that they explicitly consider the role of distractors and avoid limitations due to range restrictions. A new screening measure is also introduced. The measures are theoretically well-grounded and are simple to compute by hand calculations or with standard software for choice models; simulations show that they offer advantages over traditional measures.
{"title":"Classical Item Analysis from a Signal Detection Perspective","authors":"Lawrence T. DeCarlo","doi":"10.1111/jedm.12358","DOIUrl":"10.1111/jedm.12358","url":null,"abstract":"<p>A conceptualization of multiple-choice exams in terms of signal detection theory (SDT) leads to simple measures of item difficulty and item discrimination that are closely related to, but also distinct from, those used in classical item analysis (CIA). The theory defines a “true split,” depending on whether or not examinees know an item, and so it provides a basis for using total scores to split item tables, as done in CIA, while also clarifying benefits and limitations of the approach. The SDT item difficulty and discrimination measures differ from those used in CIA in that they explicitly consider the role of distractors and avoid limitations due to range restrictions. A new screening measure is also introduced. The measures are theoretically well-grounded and are simple to compute by hand calculations or with standard software for choice models; simulations show that they offer advantages over traditional measures.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 3","pages":"520-547"},"PeriodicalIF":1.3,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42654295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the original article, it was written that “Then the MLE scoring and DIF analysis with RDIF statistics were performed using the est_score and rdif functions, respectively, in the R (R Core Team, 2019) package irtplay (p.90).” However, the irtplay package has been removed from the CRAN repository due to intellectual property (IP) violation issues. Instead, a new R package called irtQ (Lim & Wells, 2023) has been released as a successor to irtplay. All IP issues have been resolved in irtQ, ensuring that the package is compliant with industry standards. https://doi.org/10.1111/jedm.12313
We would like to inform that the same functions of est_score and rdif used in the original study are also included in irtQ. Thus, it can be used as a replacement for irtplay. We apologize for any confusion caused by the previous version of the article.
{"title":"Corrigendum: A Residual-Based Differential Item Functioning Detection Framework in Item Response Theory","authors":"Hwanggyu Lim, Edison M. Choe, Kyung T. Han","doi":"10.1111/jedm.12362","DOIUrl":"10.1111/jedm.12362","url":null,"abstract":"<p>In the original article, it was written that “Then the MLE scoring and DIF analysis with RDIF statistics were performed using the <i>est_score</i> and <i>rdif</i> functions, respectively, in the R (R Core Team, 2019) package irtplay (p.90).” However, the irtplay package has been removed from the CRAN repository due to intellectual property (IP) violation issues. Instead, a new R package called irtQ (Lim & Wells, <span>2023</span>) has been released as a successor to irtplay. All IP issues have been resolved in irtQ, ensuring that the package is compliant with industry standards. https://doi.org/10.1111/jedm.12313</p><p>We would like to inform that the same functions of <i>est_score</i> and <i>rdif</i> used in the original study are also included in irtQ. Thus, it can be used as a replacement for irtplay. We apologize for any confusion caused by the previous version of the article.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 1","pages":"170"},"PeriodicalIF":1.3,"publicationDate":"2023-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12362","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44304771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jodi M. Casabianca, John R. Donoghue, Hyo Jeong Shin, Szu-Fu Chao, Ikkyu Choi
Using item-response theory to model rater effects provides an alternative solution for rater monitoring and diagnosis, compared to using standard performance metrics. In order to fit such models, the ratings data must be sufficiently connected in order to estimate rater effects. Due to popular rating designs used in large-scale testing scenarios, there tends to be a large proportion of missing data, yielding sparse matrices and estimation issues. In this article, we explore the impact of different types of connectedness, or linkage, brought about by using a linkage set—a collection of responses scored by most or all raters. We also explore the impact of the properties and composition of the linkage set, the different connectedness yielded from different rating designs, and the role of scores from automated scoring engines. In designing monitoring systems using the rater response version of the generalized partial credit model, the study results suggest use of a linkage set, especially a large one that is comprised of responses representing the full score scale. Results also show that a double-human-scoring design provides more connectedness than a design with one human and an automated scoring engine. Furthermore, scores from automated scoring engines do not provide adequate connectedness. We discuss considerations for operational implementation and further study.
{"title":"Using Linkage Sets to Improve Connectedness in Rater Response Model Estimation","authors":"Jodi M. Casabianca, John R. Donoghue, Hyo Jeong Shin, Szu-Fu Chao, Ikkyu Choi","doi":"10.1111/jedm.12360","DOIUrl":"10.1111/jedm.12360","url":null,"abstract":"<p>Using item-response theory to model rater effects provides an alternative solution for rater monitoring and diagnosis, compared to using standard performance metrics. In order to fit such models, the ratings data must be sufficiently connected in order to estimate rater effects. Due to popular rating designs used in large-scale testing scenarios, there tends to be a large proportion of missing data, yielding sparse matrices and estimation issues. In this article, we explore the impact of different types of connectedness, or linkage, brought about by using a linkage set—a collection of responses scored by most or all raters. We also explore the impact of the properties and composition of the linkage set, the different connectedness yielded from different rating designs, and the role of scores from automated scoring engines. In designing monitoring systems using the rater response version of the generalized partial credit model, the study results suggest use of a linkage set, especially a large one that is comprised of responses representing the full score scale. Results also show that a double-human-scoring design provides more connectedness than a design with one human and an automated scoring engine. Furthermore, scores from automated scoring engines do not provide adequate connectedness. We discuss considerations for operational implementation and further study.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 3","pages":"428-454"},"PeriodicalIF":1.3,"publicationDate":"2023-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47979944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. Jake Thompson, Brooke Nash, Amy K. Clark, Jeffrey C. Hoover
As diagnostic classification models become more widely used in large-scale operational assessments, we must give consideration to the methods for estimating and reporting reliability. Researchers must explore alternatives to traditional reliability methods that are consistent with the design, scoring, and reporting levels of diagnostic assessment systems. In this article, we describe and evaluate a method for simulating retests to summarize reliability evidence at multiple reporting levels. We evaluate how the performance of reliability estimates from simulated retests compares to other measures of classification consistency and accuracy for diagnostic assessments that have previously been described in the literature, but which limit the level at which reliability can be reported. Overall, the findings show that reliability estimates from simulated retests are an accurate measure of reliability and are consistent with other measures of reliability for diagnostic assessments. We then apply this method to real data from the Examination for the Certificate of Proficiency in English to demonstrate the method in practice and compare reliability estimates from observed data. Finally, we discuss implications for the field and possible next directions.
{"title":"Using Simulated Retests to Estimate the Reliability of Diagnostic Assessment Systems","authors":"W. Jake Thompson, Brooke Nash, Amy K. Clark, Jeffrey C. Hoover","doi":"10.1111/jedm.12359","DOIUrl":"10.1111/jedm.12359","url":null,"abstract":"<p>As diagnostic classification models become more widely used in large-scale operational assessments, we must give consideration to the methods for estimating and reporting reliability. Researchers must explore alternatives to traditional reliability methods that are consistent with the design, scoring, and reporting levels of diagnostic assessment systems. In this article, we describe and evaluate a method for simulating retests to summarize reliability evidence at multiple reporting levels. We evaluate how the performance of reliability estimates from simulated retests compares to other measures of classification consistency and accuracy for diagnostic assessments that have previously been described in the literature, but which limit the level at which reliability can be reported. Overall, the findings show that reliability estimates from simulated retests are an accurate measure of reliability and are consistent with other measures of reliability for diagnostic assessments. We then apply this method to real data from the Examination for the Certificate of Proficiency in English to demonstrate the method in practice and compare reliability estimates from observed data. Finally, we discuss implications for the field and possible next directions.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 3","pages":"455-475"},"PeriodicalIF":1.3,"publicationDate":"2023-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47801652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Katherine E. Castellano, Daniel F. McCaffrey, J. R. Lockwood
The simple average of student growth scores is often used in accountability systems, but it can be problematic for decision making. When computed using a small/moderate number of students, it can be sensitive to the sample, resulting in inaccurate representations of growth of the students, low year-to-year stability, and inequities for low-incidence groups. An alternative designed to address these issues is to use an Empirical Best Linear Prediction (EBLP), which is a weighted average of growth score data from other years and/or subjects. We apply both approaches to two statewide datasets to answer empirical questions about their performance. The EBLP outperforms the simple average in accuracy and cross-year stability with the exception that accuracy was not necessarily improved for very large districts in one of the states. In such exceptions, we show a beneficial alternative may be to use a hybrid approach in which very large districts receive the simple average and all others receive the EBLP. We find that adding more growth score data to the computation of the EBLP can improve accuracy, but not necessarily for larger schools/districts. We review key decision points in aggregate growth reporting and in specifying an EBLP weighted average in practice.
{"title":"An Exploration of an Improved Aggregate Student Growth Measure Using Data from Two States","authors":"Katherine E. Castellano, Daniel F. McCaffrey, J. R. Lockwood","doi":"10.1111/jedm.12354","DOIUrl":"10.1111/jedm.12354","url":null,"abstract":"<p>The simple average of student growth scores is often used in accountability systems, but it can be problematic for decision making. When computed using a small/moderate number of students, it can be sensitive to the sample, resulting in inaccurate representations of growth of the students, low year-to-year stability, and inequities for low-incidence groups. An alternative designed to address these issues is to use an Empirical Best Linear Prediction (EBLP), which is a weighted average of growth score data from other years and/or subjects. We apply both approaches to two statewide datasets to answer empirical questions about their performance. The EBLP outperforms the simple average in accuracy and cross-year stability with the exception that accuracy was not necessarily improved for very large districts in one of the states. In such exceptions, we show a beneficial alternative may be to use a hybrid approach in which very large districts receive the simple average and all others receive the EBLP. We find that adding more growth score data to the computation of the EBLP can improve accuracy, but not necessarily for larger schools/districts. We review key decision points in aggregate growth reporting and in specifying an EBLP weighted average in practice.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 2","pages":"173-201"},"PeriodicalIF":1.3,"publicationDate":"2023-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41556588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Test scores are often used to make decisions about examinees, such as in licensure and certification testing, as well as in many educational contexts. In some cases, these decisions are based upon compensatory scores, such as those from multiple sections or components of an exam. Classification accuracy and classification consistency are two psychometric characteristics of test scores that are often reported when decisions are based on those scores, and several techniques currently exist for estimating both accuracy and consistency. However, research on classification accuracy and consistency on compensatory test scores is scarce. This study demonstrates two techniques that can be used to estimate classification accuracy and consistency when test scores are used in a compensatory manner. First, a simulation study demonstrates that both methods provide very similar results under the studied conditions. Second, we demonstrate how the two methods could be used with a high-stakes licensure exam.
{"title":"Classification Accuracy and Consistency of Compensatory Composite Test Scores","authors":"J. Carl Setzer, Ying Cheng, Cheng Liu","doi":"10.1111/jedm.12357","DOIUrl":"10.1111/jedm.12357","url":null,"abstract":"<p>Test scores are often used to make decisions about examinees, such as in licensure and certification testing, as well as in many educational contexts. In some cases, these decisions are based upon compensatory scores, such as those from multiple sections or components of an exam. Classification accuracy and classification consistency are two psychometric characteristics of test scores that are often reported when decisions are based on those scores, and several techniques currently exist for estimating both accuracy and consistency. However, research on classification accuracy and consistency on compensatory test scores is scarce. This study demonstrates two techniques that can be used to estimate classification accuracy and consistency when test scores are used in a compensatory manner. First, a simulation study demonstrates that both methods provide very similar results under the studied conditions. Second, we demonstrate how the two methods could be used with a high-stakes licensure exam.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 3","pages":"501-519"},"PeriodicalIF":1.3,"publicationDate":"2023-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46918652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}