Pub Date : 2023-03-01Epub Date: 2023-01-12DOI: 10.1177/01466216231151705
Wenjing Guo, Stefanie A Wind
In standalone performance assessments, researchers have explored the influence of different rating designs on the sensitivity of latent trait model indicators to different rater effects as well as the impacts of different rating designs on student achievement estimates. However, the literature provides little guidance on the degree to which different rating designs might affect rater classification accuracy (severe/lenient) and rater measurement precision in both standalone performance assessments and mixed-format assessments. Using results from an analysis of National Assessment of Educational Progress (NAEP) data, we conducted simulation studies to systematically explore the impacts of different rating designs on rater measurement precision and rater classification accuracy (severe/lenient) in mixed-format assessments. The results suggest that the complete rating design produced the highest rater classification accuracy and greatest rater measurement precision, followed by the multiple-choice (MC) + spiral link design and the MC link design. Considering that complete rating designs are not practical in most testing situations, the MC + spiral link design may be a useful choice because it balances cost and performance. We consider the implications of our findings for research and practice.
{"title":"The Effects of Rating Designs on Rater Classification Accuracy and Rater Measurement Precision in Large-Scale Mixed-Format Assessments.","authors":"Wenjing Guo, Stefanie A Wind","doi":"10.1177/01466216231151705","DOIUrl":"10.1177/01466216231151705","url":null,"abstract":"<p><p>In standalone performance assessments, researchers have explored the influence of different rating designs on the sensitivity of latent trait model indicators to different rater effects as well as the impacts of different rating designs on student achievement estimates. However, the literature provides little guidance on the degree to which different rating designs might affect rater classification accuracy (severe/lenient) and rater measurement precision in both standalone performance assessments and mixed-format assessments. Using results from an analysis of National Assessment of Educational Progress (NAEP) data, we conducted simulation studies to systematically explore the impacts of different rating designs on rater measurement precision and rater classification accuracy (severe/lenient) in mixed-format assessments. The results suggest that the complete rating design produced the highest rater classification accuracy and greatest rater measurement precision, followed by the multiple-choice (MC) + spiral link design and the MC link design. Considering that complete rating designs are not practical in most testing situations, the MC + spiral link design may be a useful choice because it balances cost and performance. We consider the implications of our findings for research and practice.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"47 2","pages":"91-105"},"PeriodicalIF":1.2,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9979195/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10846015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-01Epub Date: 2022-10-04DOI: 10.1177/01466216221124087
Waldir Leôncio, Marie Wiberg, Michela Battauz
Test equating is a statistical procedure to ensure that scores from different test forms can be used interchangeably. There are several methodologies available to perform equating, some of which are based on the Classical Test Theory (CTT) framework and others are based on the Item Response Theory (IRT) framework. This article compares equating transformations originated from three different frameworks, namely IRT Observed-Score Equating (IRTOSE), Kernel Equating (KE), and IRT Kernel Equating (IRTKE). The comparisons were made under different data-generating scenarios, which include the development of a novel data-generation procedure that allows the simulation of test data without relying on IRT parameters while still providing control over some test score properties such as distribution skewness and item difficulty. Our results suggest that IRT methods tend to provide better results than KE even when the data are not generated from IRT processes. KE might be able to provide satisfactory results if a proper pre-smoothing solution can be found, while also being much faster than IRT methods. For daily applications, we recommend observing the sensibility of the results to the equating method, minding the importance of good model fit and meeting the assumptions of the framework.
{"title":"Evaluating Equating Transformations in IRT Observed-Score and Kernel Equating Methods.","authors":"Waldir Leôncio, Marie Wiberg, Michela Battauz","doi":"10.1177/01466216221124087","DOIUrl":"10.1177/01466216221124087","url":null,"abstract":"<p><p>Test equating is a statistical procedure to ensure that scores from different test forms can be used interchangeably. There are several methodologies available to perform equating, some of which are based on the Classical Test Theory (CTT) framework and others are based on the Item Response Theory (IRT) framework. This article compares equating transformations originated from three different frameworks, namely IRT Observed-Score Equating (IRTOSE), Kernel Equating (KE), and IRT Kernel Equating (IRTKE). The comparisons were made under different data-generating scenarios, which include the development of a novel data-generation procedure that allows the simulation of test data without relying on IRT parameters while still providing control over some test score properties such as distribution skewness and item difficulty. Our results suggest that IRT methods tend to provide better results than KE even when the data are not generated from IRT processes. KE might be able to provide satisfactory results if a proper pre-smoothing solution can be found, while also being much faster than IRT methods. For daily applications, we recommend observing the sensibility of the results to the equating method, minding the importance of good model fit and meeting the assumptions of the framework.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"47 2","pages":"123-140"},"PeriodicalIF":1.0,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/74/30/10.1177_01466216221124087.PMC9979196.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10846018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-01Epub Date: 2023-01-29DOI: 10.1177/01466216231151701
Selena Wang, Paul De Boeck, Marcel Yotebieng
Heywood cases are known from linear factor analysis literature as variables with communalities larger than 1.00, and in present day factor models, the problem also shows in negative residual variances. For binary data, factor models for ordinal data can be applied with either delta parameterization or theta parametrization. The former is more common than the latter and can yield Heywood cases when limited information estimation is used. The same problem shows up as non convergence cases in theta parameterized factor models and as extremely large discriminations in item response theory (IRT) models. In this study, we explain why the same problem appears in different forms depending on the method of analysis. We first discuss this issue using equations and then illustrate our conclusions using a small simulation study, where all three methods, delta and theta parameterized ordinal factor models (with estimation based on polychoric correlations and thresholds) and an IRT model (with full information estimation), are used to analyze the same datasets. The results generalize across WLS, WLSMV, and ULS estimators for the factor models for ordinal data. Finally, we analyze real data with the same three approaches. The results of the simulation study and the analysis of real data confirm the theoretical conclusions.
{"title":"Heywood Cases in Unidimensional Factor Models and Item Response Models for Binary Data.","authors":"Selena Wang, Paul De Boeck, Marcel Yotebieng","doi":"10.1177/01466216231151701","DOIUrl":"10.1177/01466216231151701","url":null,"abstract":"<p><p>Heywood cases are known from linear factor analysis literature as variables with communalities larger than 1.00, and in present day factor models, the problem also shows in negative residual variances. For binary data, factor models for ordinal data can be applied with either delta parameterization or theta parametrization. The former is more common than the latter and can yield Heywood cases when limited information estimation is used. The same problem shows up as non convergence cases in theta parameterized factor models and as extremely large discriminations in item response theory (IRT) models. In this study, we explain why the same problem appears in different forms depending on the method of analysis. We first discuss this issue using equations and then illustrate our conclusions using a small simulation study, where all three methods, delta and theta parameterized ordinal factor models (with estimation based on polychoric correlations and thresholds) and an IRT model (with full information estimation), are used to analyze the same datasets. The results generalize across WLS, WLSMV, and ULS estimators for the factor models for ordinal data. Finally, we analyze real data with the same three approaches. The results of the simulation study and the analysis of real data confirm the theoretical conclusions.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"47 2","pages":"141-154"},"PeriodicalIF":1.2,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9979198/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10846019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-01Epub Date: 2022-09-23DOI: 10.1177/01466216221129271
Sandip Sinharay, Matthew S Johnson, Wei Wang, Jing Miao
Targeted double scoring, or, double scoring of only some (but not all) responses, is used to reduce the burden of scoring performance tasks for several mastery tests (Finkelman, Darby, & Nering, 2008). An approach based on statistical decision theory (e.g., Berger, 1989; Ferguson, 1967; Rudner, 2009) is suggested to evaluate and potentially improve upon the existing strategies in targeted double scoring for mastery tests. An application of the approach to data from an operational mastery test shows that a refinement of the currently used strategy would lead to substantial cost savings.
{"title":"Targeted Double Scoring of Performance Tasks Using a Decision-Theoretic Approach.","authors":"Sandip Sinharay, Matthew S Johnson, Wei Wang, Jing Miao","doi":"10.1177/01466216221129271","DOIUrl":"10.1177/01466216221129271","url":null,"abstract":"<p><p>Targeted double scoring, or, double scoring of only some (but not all) responses, is used to reduce the burden of scoring performance tasks for several mastery tests (Finkelman, Darby, & Nering, 2008). An approach based on statistical decision theory (e.g., Berger, 1989; Ferguson, 1967; Rudner, 2009) is suggested to evaluate and potentially improve upon the existing strategies in targeted double scoring for mastery tests. An application of the approach to data from an operational mastery test shows that a refinement of the currently used strategy would lead to substantial cost savings.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"47 2","pages":"155-163"},"PeriodicalIF":1.2,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9979197/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9393345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01Epub Date: 2022-09-30DOI: 10.1177/01466216221124091
Niek Frans, Johan Braeken, Bernard P Veldkamp, Muirne C S Paap
The use of empirical prior information about participants has been shown to substantially improve the efficiency of computerized adaptive tests (CATs) in educational settings. However, it is unclear how these results translate to clinical settings, where small item banks with highly informative polytomous items often lead to very short CATs. We explored the risks and rewards of using prior information in CAT in two simulation studies, rooted in applied clinical examples. In the first simulation, prior precision and bias in the prior location were manipulated independently. Our results show that a precise personalized prior can meaningfully increase CAT efficiency. However, this reward comes with the potential risk of overconfidence in wrong empirical information (i.e., using a precise severely biased prior), which can lead to unnecessarily long tests, or severely biased estimates. The latter risk can be mitigated by setting a minimum number of items that are to be administered during the CAT, or by setting a less precise prior; be it at the expense of canceling out any efficiency gains. The second simulation, with more realistic bias and precision combinations in the empirical prior, places the prevalence of the potential risks in context. With similar estimation bias, an empirical prior reduced CAT test length, compared to a standard normal prior, in 68% of cases, by a median of 20%; while test length increased in only 3% of cases. The use of prior information in CAT seems to be a feasible and simple method to reduce test burden for patients and clinical practitioners alike.
{"title":"Empirical Priors in Polytomous Computerized Adaptive Tests: Risks and Rewards in Clinical Settings.","authors":"Niek Frans, Johan Braeken, Bernard P Veldkamp, Muirne C S Paap","doi":"10.1177/01466216221124091","DOIUrl":"https://doi.org/10.1177/01466216221124091","url":null,"abstract":"<p><p>The use of empirical prior information about participants has been shown to substantially improve the efficiency of computerized adaptive tests (CATs) in educational settings. However, it is unclear how these results translate to clinical settings, where small item banks with highly informative polytomous items often lead to very short CATs. We explored the risks and rewards of using prior information in CAT in two simulation studies, rooted in applied clinical examples. In the first simulation, prior precision and bias in the prior location were manipulated independently. Our results show that a precise personalized prior can meaningfully increase CAT efficiency. However, this reward comes with the potential risk of overconfidence in wrong empirical information (i.e., using a precise severely biased prior), which can lead to unnecessarily long tests, or severely biased estimates. The latter risk can be mitigated by setting a minimum number of items that are to be administered during the CAT, or by setting a less precise prior; be it at the expense of canceling out any efficiency gains. The second simulation, with more realistic bias and precision combinations in the empirical prior, places the prevalence of the potential risks in context. With similar estimation bias, an empirical prior reduced CAT test length, compared to a standard normal prior, in 68% of cases, by a median of 20%; while test length increased in only 3% of cases. The use of prior information in CAT seems to be a feasible and simple method to reduce test burden for patients and clinical practitioners alike.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"47 1","pages":"48-63"},"PeriodicalIF":1.2,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/57/79/10.1177_01466216221124091.PMC9679926.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40494727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01Epub Date: 2022-09-17DOI: 10.1177/01466216221108077
Zhuangzhuang Han, Sandip Sinharay, Matthew S Johnson, Xiang Liu
The S-X2 statistic (Orlando & Thissen, 2000) is popular among researchers and practitioners who are interested in the assessment of item fit. However, the statistic suffers from the Chernoff-Lehmann problem (Chernoff & Lehmann, 1954) and hence does not have a known asymptotic null distribution. This paper suggests a modified version of the S-X2 statistic that is based on the modified Rao-Robson χ2 statistic (Rao & Robson, 1974). A simulation study and a real data analyses demonstrate that the use of the modified statistic instead of the S-X2 statistic would lead to fewer items being flagged for misfit.
{"title":"The Standardized S-<i>X</i> <sup>2</sup> Statistic for Assessing Item Fit.","authors":"Zhuangzhuang Han, Sandip Sinharay, Matthew S Johnson, Xiang Liu","doi":"10.1177/01466216221108077","DOIUrl":"10.1177/01466216221108077","url":null,"abstract":"<p><p>The S-<i>X</i> <sup>2</sup> statistic (Orlando & Thissen, 2000) is popular among researchers and practitioners who are interested in the assessment of item fit. However, the statistic suffers from the Chernoff-Lehmann problem (Chernoff & Lehmann, 1954) and hence does not have a known asymptotic null distribution. This paper suggests a modified version of the S-<i>X</i> <sup>2</sup> statistic that is based on the modified Rao-Robson <i>χ</i> <sup>2</sup> statistic (Rao & Robson, 1974). A simulation study and a real data analyses demonstrate that the use of the modified statistic instead of the S-<i>X</i> <sup>2</sup> statistic would lead to fewer items being flagged for misfit.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"47 1","pages":"3-18"},"PeriodicalIF":1.2,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9679924/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40494731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01Epub Date: 2022-09-20DOI: 10.1177/01466216221128011
Katherine E Castellano, Sandip Sinharay, Jiangang Hao, Chen Li
In response to the closures of test centers worldwide due to the COVID-19 pandemic, several testing programs offered large-scale standardized assessments to examinees remotely. However, due to the varying quality of the performance of personal devices and internet connections, more at-home examinees likely suffered "disruptions" or an interruption in the connectivity to their testing session compared to typical test-center administrations. Disruptions have the potential to adversely affect examinees and lead to fairness or validity issues. The goal of this study was to investigate the extent to which disruptions impacted performance of at-home examinees using data from a large-scale admissions test. Specifically, the study involved comparing the average test scores of the disrupted examinees with those of the non-disrupted examinees after weighting the non-disrupted examinees to resemble the disrupted examinees along baseline characteristics. The results show that disruptions had a small negative impact on test scores on average. However, there was little difference in performance between the disrupted and non-disrupted examinees after removing records of the disrupted examinees who were unable to complete the test.
{"title":"An Investigation Into the Impact of Test Session Disruptions for At-Home Test Administrations.","authors":"Katherine E Castellano, Sandip Sinharay, Jiangang Hao, Chen Li","doi":"10.1177/01466216221128011","DOIUrl":"10.1177/01466216221128011","url":null,"abstract":"<p><p>In response to the closures of test centers worldwide due to the COVID-19 pandemic, several testing programs offered large-scale standardized assessments to examinees remotely. However, due to the varying quality of the performance of personal devices and internet connections, more at-home examinees likely suffered \"disruptions\" or an interruption in the connectivity to their testing session compared to typical test-center administrations. Disruptions have the potential to adversely affect examinees and lead to fairness or validity issues. The goal of this study was to investigate the extent to which disruptions impacted performance of at-home examinees using data from a large-scale admissions test. Specifically, the study involved comparing the average test scores of the disrupted examinees with those of the non-disrupted examinees after weighting the non-disrupted examinees to resemble the disrupted examinees along baseline characteristics. The results show that disruptions had a small negative impact on test scores on average. However, there was little difference in performance between the disrupted and non-disrupted examinees after removing records of the disrupted examinees who were unable to complete the test.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"47 1","pages":"76-82"},"PeriodicalIF":1.2,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9679922/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40494729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diagnostic classification models (DCMs) have been used to classify examinees into groups based on their possession status of a set of latent traits. In addition to traditional item-based scoring approaches, examinees may be scored based on their completion of a series of small and similar tasks. Those scores are usually considered as count variables. To model count scores, this study proposes a new class of DCMs that uses the negative binomial distribution at its core. We explained the proposed model framework and demonstrated its use through an operational example. Simulation studies were conducted to evaluate the performance of the proposed model and compare it with the Poisson-based DCM.
{"title":"Applying Negative Binomial Distribution in Diagnostic Classification Models for Analyzing Count Data.","authors":"Ren Liu, Ihnwhi Heo, Haiyan Liu, Dexin Shi, Zhehan Jiang","doi":"10.1177/01466216221124604","DOIUrl":"10.1177/01466216221124604","url":null,"abstract":"<p><p>Diagnostic classification models (DCMs) have been used to classify examinees into groups based on their possession status of a set of latent traits. In addition to traditional item-based scoring approaches, examinees may be scored based on their completion of a series of small and similar tasks. Those scores are usually considered as count variables. To model count scores, this study proposes a new class of DCMs that uses the negative binomial distribution at its core. We explained the proposed model framework and demonstrated its use through an operational example. Simulation studies were conducted to evaluate the performance of the proposed model and compare it with the Poisson-based DCM.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"47 1","pages":"64-75"},"PeriodicalIF":1.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/07/94/10.1177_01466216221124604.PMC9679925.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40494728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01Epub Date: 2022-10-10DOI: 10.1177/01466216221125178
Feri Wijayanto, Ioan Gabriel Bucur, Perry Groot, Tom Heskes
The R package autoRasch has been developed to perform a Rasch analysis in a (semi-)automated way. The automated part of the analysis is achieved by optimizing the so-called in-plus-out-of-questionnaire log-likelihood (IPOQ-LL) or IPOQ-LL-DIF when differential item functioning (DIF) is included. These criteria measure the quality of fit on a pre-collected survey, depending on which items are included in the final instrument. To compute these criteria, autoRasch fits the generalized partial credit model (GPCM) or the generalized partial credit model with differential item functioning (GPCM-DIF) using penalized joint maximum likelihood estimation (PJMLE). The package further allows the user to reevaluate the output of the automated method and use it as a basis for performing a manual Rasch analysis and provides standard statistics of Rasch analyses (e.g., outfit, infit, person separation reliability, and residual correlation) to support the model reevaluation.
{"title":"autoRasch: An R Package to Do Semi-Automated Rasch Analysis.","authors":"Feri Wijayanto, Ioan Gabriel Bucur, Perry Groot, Tom Heskes","doi":"10.1177/01466216221125178","DOIUrl":"https://doi.org/10.1177/01466216221125178","url":null,"abstract":"<p><p>The R package autoRasch has been developed to perform a Rasch analysis in a (semi-)automated way. The automated part of the analysis is achieved by optimizing the so-called <i>in-plus-out-of-questionnaire log-likelihood</i> (IPOQ-LL) or IPOQ-LL-DIF when differential item functioning (DIF) is included. These criteria measure the quality of fit on a pre-collected survey, depending on which items are included in the final instrument. To compute these criteria, autoRasch fits the generalized partial credit model (GPCM) or the generalized partial credit model with differential item functioning (GPCM-DIF) using penalized joint maximum likelihood estimation (PJMLE). The package further allows the user to reevaluate the output of the automated method and use it as a basis for performing a manual Rasch analysis and provides standard statistics of Rasch analyses (e.g., outfit, infit, person separation reliability, and residual correlation) to support the model reevaluation.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"47 1","pages":"83-85"},"PeriodicalIF":1.2,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9679921/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40494732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01Epub Date: 2022-09-06DOI: 10.1177/01466216221124045
Chunyan Liu, Daniel Jurich
In equating practice, the existence of outliers in the anchor items may deteriorate the equating accuracy and threaten the validity of test scores. Therefore, stability of the anchor item performance should be evaluated before conducting equating. This study used simulation to investigate the performance of the t-test method in detecting outliers and compared its performance with other outlier detection methods, including the logit difference method with 0.5 and 0.3 as the cutoff values and the robust z statistic with 2.7 as the cutoff value. The investigated factors included sample size, proportion of outliers, item difficulty drift direction, and group difference. Across all simulated conditions, the t-test method outperformed the other methods in terms of sensitivity of flagging true outliers, bias of the estimated translation constant, and the root mean square error of examinee ability estimates.
在等分实践中,如果锚定项目中存在异常值,可能会降低等分的准确性,并威胁到测验分数的效度。因此,在进行等分前,应评估锚点项目成绩的稳定性。本研究采用模拟方法研究了 t 检验法在检测离群值方面的性能,并将其与其他离群值检测方法进行了比较,包括以 0.5 和 0.3 为临界值的对数差分法和以 2.7 为临界值的稳健 z 统计法。调查因素包括样本量、异常值比例、项目难度漂移方向和组间差异。在所有模拟条件下,t 检验法在标记真实离群值的灵敏度、估计平移常数的偏差和考生能力估计的均方根误差方面均优于其他方法。
{"title":"Outlier Detection Using t-test in Rasch IRT Equating under NEAT Design.","authors":"Chunyan Liu, Daniel Jurich","doi":"10.1177/01466216221124045","DOIUrl":"10.1177/01466216221124045","url":null,"abstract":"<p><p>In equating practice, the existence of outliers in the anchor items may deteriorate the equating accuracy and threaten the validity of test scores. Therefore, stability of the anchor item performance should be evaluated before conducting equating. This study used simulation to investigate the performance of the <i>t</i>-test method in detecting outliers and compared its performance with other outlier detection methods, including the logit difference method with 0.5 and 0.3 as the cutoff values and the robust <i>z</i> statistic with 2.7 as the cutoff value. The investigated factors included sample size, proportion of outliers, item difficulty drift direction, and group difference. Across all simulated conditions, the <i>t</i>-test method outperformed the other methods in terms of sensitivity of flagging true outliers, bias of the estimated translation constant, and the root mean square error of examinee ability estimates.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"47 1","pages":"34-47"},"PeriodicalIF":1.2,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9679927/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40494730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}