首页 > 最新文献

Applied Psychological Measurement最新文献

英文 中文
The Effects of Rating Designs on Rater Classification Accuracy and Rater Measurement Precision in Large-Scale Mixed-Format Assessments. 在大规模混合格式评估中,评分设计对评分者分类准确性和评分者测量精确度的影响。
IF 1.2 4区 心理学 Q2 Social Sciences Pub Date : 2023-03-01 Epub Date: 2023-01-12 DOI: 10.1177/01466216231151705
Wenjing Guo, Stefanie A Wind

In standalone performance assessments, researchers have explored the influence of different rating designs on the sensitivity of latent trait model indicators to different rater effects as well as the impacts of different rating designs on student achievement estimates. However, the literature provides little guidance on the degree to which different rating designs might affect rater classification accuracy (severe/lenient) and rater measurement precision in both standalone performance assessments and mixed-format assessments. Using results from an analysis of National Assessment of Educational Progress (NAEP) data, we conducted simulation studies to systematically explore the impacts of different rating designs on rater measurement precision and rater classification accuracy (severe/lenient) in mixed-format assessments. The results suggest that the complete rating design produced the highest rater classification accuracy and greatest rater measurement precision, followed by the multiple-choice (MC) + spiral link design and the MC link design. Considering that complete rating designs are not practical in most testing situations, the MC + spiral link design may be a useful choice because it balances cost and performance. We consider the implications of our findings for research and practice.

在独立的成绩评估中,研究人员探讨了不同评分设计对潜在特质模型指标对不同评分者效应的敏感性的影响,以及不同评分设计对学生成绩估计值的影响。然而,对于在独立的成绩评估和混合格式评估中,不同的评分设计会在多大程度上影响评分者分类的准确性(严重/宽松)和评分者测量的精确性,文献几乎没有提供指导。利用对美国国家教育进步评估(NAEP)数据的分析结果,我们进行了模拟研究,系统地探讨了在混合形式评估中,不同评分设计对评分者测量精度和评分者分类精度(严重/宽松)的影响。结果表明,完全评分设计产生了最高的评分者分类准确度和最高的评分者测量精确度,其次是多项选择(MC)+螺旋链接设计和MC链接设计。考虑到完整评分设计在大多数测试环境中并不实用,MC + 螺旋链接设计可能是一个有用的选择,因为它兼顾了成本和性能。我们考虑了研究结果对研究和实践的影响。
{"title":"The Effects of Rating Designs on Rater Classification Accuracy and Rater Measurement Precision in Large-Scale Mixed-Format Assessments.","authors":"Wenjing Guo, Stefanie A Wind","doi":"10.1177/01466216231151705","DOIUrl":"10.1177/01466216231151705","url":null,"abstract":"<p><p>In standalone performance assessments, researchers have explored the influence of different rating designs on the sensitivity of latent trait model indicators to different rater effects as well as the impacts of different rating designs on student achievement estimates. However, the literature provides little guidance on the degree to which different rating designs might affect rater classification accuracy (severe/lenient) and rater measurement precision in both standalone performance assessments and mixed-format assessments. Using results from an analysis of National Assessment of Educational Progress (NAEP) data, we conducted simulation studies to systematically explore the impacts of different rating designs on rater measurement precision and rater classification accuracy (severe/lenient) in mixed-format assessments. The results suggest that the complete rating design produced the highest rater classification accuracy and greatest rater measurement precision, followed by the multiple-choice (MC) + spiral link design and the MC link design. Considering that complete rating designs are not practical in most testing situations, the MC + spiral link design may be a useful choice because it balances cost and performance. We consider the implications of our findings for research and practice.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9979195/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10846015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating Equating Transformations in IRT Observed-Score and Kernel Equating Methods. 评估 IRT 观察得分和核等价方法中的等价变换。
IF 1 4区 心理学 Q4 PSYCHOLOGY, MATHEMATICAL Pub Date : 2023-03-01 Epub Date: 2022-10-04 DOI: 10.1177/01466216221124087
Waldir Leôncio, Marie Wiberg, Michela Battauz

Test equating is a statistical procedure to ensure that scores from different test forms can be used interchangeably. There are several methodologies available to perform equating, some of which are based on the Classical Test Theory (CTT) framework and others are based on the Item Response Theory (IRT) framework. This article compares equating transformations originated from three different frameworks, namely IRT Observed-Score Equating (IRTOSE), Kernel Equating (KE), and IRT Kernel Equating (IRTKE). The comparisons were made under different data-generating scenarios, which include the development of a novel data-generation procedure that allows the simulation of test data without relying on IRT parameters while still providing control over some test score properties such as distribution skewness and item difficulty. Our results suggest that IRT methods tend to provide better results than KE even when the data are not generated from IRT processes. KE might be able to provide satisfactory results if a proper pre-smoothing solution can be found, while also being much faster than IRT methods. For daily applications, we recommend observing the sensibility of the results to the equating method, minding the importance of good model fit and meeting the assumptions of the framework.

测验等化是一种统计程序,旨在确保不同测验形式的分数可以互换使用。有多种方法可用于等分,其中一些基于经典测验理论(CTT)框架,另一些则基于项目反应理论(IRT)框架。本文比较了源自三种不同框架的等分转换方法,即 IRT 观察得分等分法(IRTOSE)、核等分法(KE)和 IRT 核等分法(IRTKE)。比较是在不同的数据生成情景下进行的,其中包括开发一种新颖的数据生成程序,该程序允许在不依赖 IRT 参数的情况下模拟测试数据,同时还能控制某些测试得分属性,如分布偏度和项目难度。我们的结果表明,即使数据不是由 IRT 过程生成的,IRT 方法也往往能提供比 KE 更好的结果。如果能找到合适的预平滑方案,KE 也许能提供令人满意的结果,而且比 IRT 方法快得多。在日常应用中,我们建议观察结果对均衡方法的敏感性,同时注意良好的模型拟合和满足框架假设的重要性。
{"title":"Evaluating Equating Transformations in IRT Observed-Score and Kernel Equating Methods.","authors":"Waldir Leôncio, Marie Wiberg, Michela Battauz","doi":"10.1177/01466216221124087","DOIUrl":"10.1177/01466216221124087","url":null,"abstract":"<p><p>Test equating is a statistical procedure to ensure that scores from different test forms can be used interchangeably. There are several methodologies available to perform equating, some of which are based on the Classical Test Theory (CTT) framework and others are based on the Item Response Theory (IRT) framework. This article compares equating transformations originated from three different frameworks, namely IRT Observed-Score Equating (IRTOSE), Kernel Equating (KE), and IRT Kernel Equating (IRTKE). The comparisons were made under different data-generating scenarios, which include the development of a novel data-generation procedure that allows the simulation of test data without relying on IRT parameters while still providing control over some test score properties such as distribution skewness and item difficulty. Our results suggest that IRT methods tend to provide better results than KE even when the data are not generated from IRT processes. KE might be able to provide satisfactory results if a proper pre-smoothing solution can be found, while also being much faster than IRT methods. For daily applications, we recommend observing the sensibility of the results to the equating method, minding the importance of good model fit and meeting the assumptions of the framework.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/74/30/10.1177_01466216221124087.PMC9979196.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10846018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Heywood Cases in Unidimensional Factor Models and Item Response Models for Binary Data. 二元数据的单维因子模型和项目反应模型中的海伍德案例。
IF 1.2 4区 心理学 Q2 Social Sciences Pub Date : 2023-03-01 Epub Date: 2023-01-29 DOI: 10.1177/01466216231151701
Selena Wang, Paul De Boeck, Marcel Yotebieng

Heywood cases are known from linear factor analysis literature as variables with communalities larger than 1.00, and in present day factor models, the problem also shows in negative residual variances. For binary data, factor models for ordinal data can be applied with either delta parameterization or theta parametrization. The former is more common than the latter and can yield Heywood cases when limited information estimation is used. The same problem shows up as non convergence cases in theta parameterized factor models and as extremely large discriminations in item response theory (IRT) models. In this study, we explain why the same problem appears in different forms depending on the method of analysis. We first discuss this issue using equations and then illustrate our conclusions using a small simulation study, where all three methods, delta and theta parameterized ordinal factor models (with estimation based on polychoric correlations and thresholds) and an IRT model (with full information estimation), are used to analyze the same datasets. The results generalize across WLS, WLSMV, and ULS estimators for the factor models for ordinal data. Finally, we analyze real data with the same three approaches. The results of the simulation study and the analysis of real data confirm the theoretical conclusions.

在线性因子分析文献中,海伍德案例被认为是公有性大于 1.00 的变量,在当今的因子模型中,该问题也表现为负残差方差。对于二元数据,序数数据的因子模型可以采用 delta 参数化或 Theta 参数化。前者比后者更常见,在使用有限信息估计时,可能会产生海伍德案例。同样的问题还表现在θ参数化因子模型中的不收敛情况,以及项目反应理论(IRT)模型中的超大判别率。在本研究中,我们将解释为什么同一问题会因分析方法的不同而以不同的形式出现。我们首先用方程来讨论这个问题,然后用一个小型模拟研究来说明我们的结论。在这个研究中,我们使用了所有三种方法,即 delta 和 theta 参数化序数因子模型(基于多变量相关性和阈值进行估计)以及 IRT 模型(基于全信息估计)来分析相同的数据集。其结果与 WLS、WLSMV 和 ULS 对序数数据因子模型的估计结果一致。最后,我们用同样的三种方法分析了真实数据。模拟研究和真实数据分析的结果证实了理论结论。
{"title":"Heywood Cases in Unidimensional Factor Models and Item Response Models for Binary Data.","authors":"Selena Wang, Paul De Boeck, Marcel Yotebieng","doi":"10.1177/01466216231151701","DOIUrl":"10.1177/01466216231151701","url":null,"abstract":"<p><p>Heywood cases are known from linear factor analysis literature as variables with communalities larger than 1.00, and in present day factor models, the problem also shows in negative residual variances. For binary data, factor models for ordinal data can be applied with either delta parameterization or theta parametrization. The former is more common than the latter and can yield Heywood cases when limited information estimation is used. The same problem shows up as non convergence cases in theta parameterized factor models and as extremely large discriminations in item response theory (IRT) models. In this study, we explain why the same problem appears in different forms depending on the method of analysis. We first discuss this issue using equations and then illustrate our conclusions using a small simulation study, where all three methods, delta and theta parameterized ordinal factor models (with estimation based on polychoric correlations and thresholds) and an IRT model (with full information estimation), are used to analyze the same datasets. The results generalize across WLS, WLSMV, and ULS estimators for the factor models for ordinal data. Finally, we analyze real data with the same three approaches. The results of the simulation study and the analysis of real data confirm the theoretical conclusions.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9979198/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10846019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Targeted Double Scoring of Performance Tasks Using a Decision-Theoretic Approach. 使用决策理论方法对绩效任务进行有针对性的双重评分。
IF 1.2 4区 心理学 Q2 Social Sciences Pub Date : 2023-03-01 Epub Date: 2022-09-23 DOI: 10.1177/01466216221129271
Sandip Sinharay, Matthew S Johnson, Wei Wang, Jing Miao

Targeted double scoring, or, double scoring of only some (but not all) responses, is used to reduce the burden of scoring performance tasks for several mastery tests (Finkelman, Darby, & Nering, 2008). An approach based on statistical decision theory (e.g., Berger, 1989; Ferguson, 1967; Rudner, 2009) is suggested to evaluate and potentially improve upon the existing strategies in targeted double scoring for mastery tests. An application of the approach to data from an operational mastery test shows that a refinement of the currently used strategy would lead to substantial cost savings.

有针对性的双重计分,或只对部分(而非全部)答案进行双重计分,被用于减轻一些掌握测验中成绩任务的计分负担(Finkelman, Darby, & Nering, 2008)。建议采用一种基于统计决策理论(如 Berger, 1989; Ferguson, 1967; Rudner, 2009)的方法来评估并改进现有的有针对性的掌握测验双重评分策略。将该方法应用于一项操作性掌握测试的数据表明,改进目前使用的策略将大大节省成本。
{"title":"Targeted Double Scoring of Performance Tasks Using a Decision-Theoretic Approach.","authors":"Sandip Sinharay, Matthew S Johnson, Wei Wang, Jing Miao","doi":"10.1177/01466216221129271","DOIUrl":"10.1177/01466216221129271","url":null,"abstract":"<p><p>Targeted double scoring, or, double scoring of only some (but not all) responses, is used to reduce the burden of scoring performance tasks for several mastery tests (Finkelman, Darby, & Nering, 2008). An approach based on statistical decision theory (e.g., Berger, 1989; Ferguson, 1967; Rudner, 2009) is suggested to evaluate and potentially improve upon the existing strategies in targeted double scoring for mastery tests. An application of the approach to data from an operational mastery test shows that a refinement of the currently used strategy would lead to substantial cost savings.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9979197/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9393345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Empirical Priors in Polytomous Computerized Adaptive Tests: Risks and Rewards in Clinical Settings. 经验先验在多细胞计算机自适应测试:风险和回报在临床设置。
IF 1.2 4区 心理学 Q2 Social Sciences Pub Date : 2023-01-01 Epub Date: 2022-09-30 DOI: 10.1177/01466216221124091
Niek Frans, Johan Braeken, Bernard P Veldkamp, Muirne C S Paap

The use of empirical prior information about participants has been shown to substantially improve the efficiency of computerized adaptive tests (CATs) in educational settings. However, it is unclear how these results translate to clinical settings, where small item banks with highly informative polytomous items often lead to very short CATs. We explored the risks and rewards of using prior information in CAT in two simulation studies, rooted in applied clinical examples. In the first simulation, prior precision and bias in the prior location were manipulated independently. Our results show that a precise personalized prior can meaningfully increase CAT efficiency. However, this reward comes with the potential risk of overconfidence in wrong empirical information (i.e., using a precise severely biased prior), which can lead to unnecessarily long tests, or severely biased estimates. The latter risk can be mitigated by setting a minimum number of items that are to be administered during the CAT, or by setting a less precise prior; be it at the expense of canceling out any efficiency gains. The second simulation, with more realistic bias and precision combinations in the empirical prior, places the prevalence of the potential risks in context. With similar estimation bias, an empirical prior reduced CAT test length, compared to a standard normal prior, in 68% of cases, by a median of 20%; while test length increased in only 3% of cases. The use of prior information in CAT seems to be a feasible and simple method to reduce test burden for patients and clinical practitioners alike.

使用有关参与者的经验先验信息已被证明可大大提高教育环境中计算机化自适应测试(CATs)的效率。然而,尚不清楚这些结果如何转化为临床环境,在临床环境中,具有高信息量的多染色体项目的小型物项库通常导致非常短的cat。我们在两个模拟研究中探讨了在CAT中使用先验信息的风险和回报,这些研究植根于临床应用实例。在第一个仿真中,先验精度和先验位置的偏差是独立操纵的。我们的研究结果表明,精确的个性化先验可以有效地提高CAT效率。然而,这种奖励伴随着对错误经验信息过度自信的潜在风险(例如,使用精确的严重偏差先验),这可能导致不必要的长时间测试,或严重偏差估计。后一种风险可以通过设定在CAT期间管理的最小项目数量或设定不太精确的先验来减轻;以抵消任何效率收益为代价。第二个模拟,在经验先验中具有更现实的偏差和精度组合,将潜在风险的普遍性置于背景中。在类似的估计偏差下,与标准正态先验相比,经验先验在68%的情况下减少了CAT测试长度,中位数减少了20%;而测试时间只增加了3%。在CAT中使用先验信息似乎是一种既可行又简单的方法,可以减轻患者和临床医生的检查负担。
{"title":"Empirical Priors in Polytomous Computerized Adaptive Tests: Risks and Rewards in Clinical Settings.","authors":"Niek Frans,&nbsp;Johan Braeken,&nbsp;Bernard P Veldkamp,&nbsp;Muirne C S Paap","doi":"10.1177/01466216221124091","DOIUrl":"https://doi.org/10.1177/01466216221124091","url":null,"abstract":"<p><p>The use of empirical prior information about participants has been shown to substantially improve the efficiency of computerized adaptive tests (CATs) in educational settings. However, it is unclear how these results translate to clinical settings, where small item banks with highly informative polytomous items often lead to very short CATs. We explored the risks and rewards of using prior information in CAT in two simulation studies, rooted in applied clinical examples. In the first simulation, prior precision and bias in the prior location were manipulated independently. Our results show that a precise personalized prior can meaningfully increase CAT efficiency. However, this reward comes with the potential risk of overconfidence in wrong empirical information (i.e., using a precise severely biased prior), which can lead to unnecessarily long tests, or severely biased estimates. The latter risk can be mitigated by setting a minimum number of items that are to be administered during the CAT, or by setting a less precise prior; be it at the expense of canceling out any efficiency gains. The second simulation, with more realistic bias and precision combinations in the empirical prior, places the prevalence of the potential risks in context. With similar estimation bias, an empirical prior reduced CAT test length, compared to a standard normal prior, in 68% of cases, by a median of 20%; while test length increased in only 3% of cases. The use of prior information in CAT seems to be a feasible and simple method to reduce test burden for patients and clinical practitioners alike.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/57/79/10.1177_01466216221124091.PMC9679926.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40494727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The Standardized S-X 2 Statistic for Assessing Item Fit. 用于评估项目契合度的标准化 S-X 2 统计量。
IF 1.2 4区 心理学 Q2 Social Sciences Pub Date : 2023-01-01 Epub Date: 2022-09-17 DOI: 10.1177/01466216221108077
Zhuangzhuang Han, Sandip Sinharay, Matthew S Johnson, Xiang Liu

The S-X 2 statistic (Orlando & Thissen, 2000) is popular among researchers and practitioners who are interested in the assessment of item fit. However, the statistic suffers from the Chernoff-Lehmann problem (Chernoff & Lehmann, 1954) and hence does not have a known asymptotic null distribution. This paper suggests a modified version of the S-X 2 statistic that is based on the modified Rao-Robson χ 2 statistic (Rao & Robson, 1974). A simulation study and a real data analyses demonstrate that the use of the modified statistic instead of the S-X 2 statistic would lead to fewer items being flagged for misfit.

S-X 2 统计量(Orlando & Thissen,2000 年)深受对项目拟合度评估感兴趣的研究人员和从业人员的欢迎。然而,该统计量存在 Chernoff-Lehmann 问题(Chernoff & Lehmann, 1954),因此没有已知的渐近零分布。本文提出了一种基于修正的 Rao-Robson χ 2 统计量(Rao & Robson,1974 年)的修正版 S-X 2 统计量。一项模拟研究和一项真实数据分析表明,使用修正统计量而不是 S-X 2 统计量将会导致更少的项目被标记为不匹配。
{"title":"The Standardized S-<i>X</i> <sup>2</sup> Statistic for Assessing Item Fit.","authors":"Zhuangzhuang Han, Sandip Sinharay, Matthew S Johnson, Xiang Liu","doi":"10.1177/01466216221108077","DOIUrl":"10.1177/01466216221108077","url":null,"abstract":"<p><p>The S-<i>X</i> <sup>2</sup> statistic (Orlando & Thissen, 2000) is popular among researchers and practitioners who are interested in the assessment of item fit. However, the statistic suffers from the Chernoff-Lehmann problem (Chernoff & Lehmann, 1954) and hence does not have a known asymptotic null distribution. This paper suggests a modified version of the S-<i>X</i> <sup>2</sup> statistic that is based on the modified Rao-Robson <i>χ</i> <sup>2</sup> statistic (Rao & Robson, 1974). A simulation study and a real data analyses demonstrate that the use of the modified statistic instead of the S-<i>X</i> <sup>2</sup> statistic would lead to fewer items being flagged for misfit.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9679924/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40494731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
An Investigation Into the Impact of Test Session Disruptions for At-Home Test Administrations. 关于考试时段中断对在家考试的影响的调查。
IF 1.2 4区 心理学 Q2 Social Sciences Pub Date : 2023-01-01 Epub Date: 2022-09-20 DOI: 10.1177/01466216221128011
Katherine E Castellano, Sandip Sinharay, Jiangang Hao, Chen Li

In response to the closures of test centers worldwide due to the COVID-19 pandemic, several testing programs offered large-scale standardized assessments to examinees remotely. However, due to the varying quality of the performance of personal devices and internet connections, more at-home examinees likely suffered "disruptions" or an interruption in the connectivity to their testing session compared to typical test-center administrations. Disruptions have the potential to adversely affect examinees and lead to fairness or validity issues. The goal of this study was to investigate the extent to which disruptions impacted performance of at-home examinees using data from a large-scale admissions test. Specifically, the study involved comparing the average test scores of the disrupted examinees with those of the non-disrupted examinees after weighting the non-disrupted examinees to resemble the disrupted examinees along baseline characteristics. The results show that disruptions had a small negative impact on test scores on average. However, there was little difference in performance between the disrupted and non-disrupted examinees after removing records of the disrupted examinees who were unable to complete the test.

由于 COVID-19 大流行导致世界各地的考试中心关闭,一些考试项目通过远程方式向考生提供大规模标准化评估。然而,由于个人设备和互联网连接的性能质量参差不齐,与典型的考点测试相比,更多的在家考生可能会遭遇 "中断 "或测试连接中断。中断有可能对考生造成不利影响,并导致公平性或有效性问题。本研究的目的是利用大规模入学考试的数据,调查中断对在家考生成绩的影响程度。具体来说,这项研究是在对未受干扰的考生进行加权,使其与受干扰考生的基线特征相似后,比较受干扰考生与未受干扰考生的平均考试成绩。结果显示,中断对考试成绩的平均负面影响较小。然而,在剔除无法完成测试的中断考生记录后,中断考生和非中断考生的成绩差别不大。
{"title":"An Investigation Into the Impact of Test Session Disruptions for At-Home Test Administrations.","authors":"Katherine E Castellano, Sandip Sinharay, Jiangang Hao, Chen Li","doi":"10.1177/01466216221128011","DOIUrl":"10.1177/01466216221128011","url":null,"abstract":"<p><p>In response to the closures of test centers worldwide due to the COVID-19 pandemic, several testing programs offered large-scale standardized assessments to examinees remotely. However, due to the varying quality of the performance of personal devices and internet connections, more at-home examinees likely suffered \"disruptions\" or an interruption in the connectivity to their testing session compared to typical test-center administrations. Disruptions have the potential to adversely affect examinees and lead to fairness or validity issues. The goal of this study was to investigate the extent to which disruptions impacted performance of at-home examinees using data from a large-scale admissions test. Specifically, the study involved comparing the average test scores of the disrupted examinees with those of the non-disrupted examinees after weighting the non-disrupted examinees to resemble the disrupted examinees along baseline characteristics. The results show that disruptions had a small negative impact on test scores on average. However, there was little difference in performance between the disrupted and non-disrupted examinees after removing records of the disrupted examinees who were unable to complete the test.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9679922/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40494729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Applying Negative Binomial Distribution in Diagnostic Classification Models for Analyzing Count Data. 在诊断分类模型中应用负二项分布分析计数数据。
IF 1 4区 心理学 Q4 PSYCHOLOGY, MATHEMATICAL Pub Date : 2023-01-01 Epub Date: 2022-09-06 DOI: 10.1177/01466216221124604
Ren Liu, Ihnwhi Heo, Haiyan Liu, Dexin Shi, Zhehan Jiang

Diagnostic classification models (DCMs) have been used to classify examinees into groups based on their possession status of a set of latent traits. In addition to traditional item-based scoring approaches, examinees may be scored based on their completion of a series of small and similar tasks. Those scores are usually considered as count variables. To model count scores, this study proposes a new class of DCMs that uses the negative binomial distribution at its core. We explained the proposed model framework and demonstrated its use through an operational example. Simulation studies were conducted to evaluate the performance of the proposed model and compare it with the Poisson-based DCM.

诊断分类模型(DCM)已被用于根据受试者对一组潜在特质的掌握情况将其分为不同的组别。除了传统的基于项目的评分方法外,还可以根据考生完成一系列类似的小任务的情况进行评分。这些分数通常被视为计数变量。为了建立计数分数模型,本研究提出了一类新的 DCM,其核心是负二项分布。我们解释了所提出的模型框架,并通过一个操作示例演示了其使用。我们进行了模拟研究,以评估所提出模型的性能,并将其与基于泊松的 DCM 进行比较。
{"title":"Applying Negative Binomial Distribution in Diagnostic Classification Models for Analyzing Count Data.","authors":"Ren Liu, Ihnwhi Heo, Haiyan Liu, Dexin Shi, Zhehan Jiang","doi":"10.1177/01466216221124604","DOIUrl":"10.1177/01466216221124604","url":null,"abstract":"<p><p>Diagnostic classification models (DCMs) have been used to classify examinees into groups based on their possession status of a set of latent traits. In addition to traditional item-based scoring approaches, examinees may be scored based on their completion of a series of small and similar tasks. Those scores are usually considered as count variables. To model count scores, this study proposes a new class of DCMs that uses the negative binomial distribution at its core. We explained the proposed model framework and demonstrated its use through an operational example. Simulation studies were conducted to evaluate the performance of the proposed model and compare it with the Poisson-based DCM.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/07/94/10.1177_01466216221124604.PMC9679925.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40494728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
autoRasch: An R Package to Do Semi-Automated Rasch Analysis. autoRasch:一个R包来做半自动的Rasch分析。
IF 1.2 4区 心理学 Q2 Social Sciences Pub Date : 2023-01-01 Epub Date: 2022-10-10 DOI: 10.1177/01466216221125178
Feri Wijayanto, Ioan Gabriel Bucur, Perry Groot, Tom Heskes

The R package autoRasch has been developed to perform a Rasch analysis in a (semi-)automated way. The automated part of the analysis is achieved by optimizing the so-called in-plus-out-of-questionnaire log-likelihood (IPOQ-LL) or IPOQ-LL-DIF when differential item functioning (DIF) is included. These criteria measure the quality of fit on a pre-collected survey, depending on which items are included in the final instrument. To compute these criteria, autoRasch fits the generalized partial credit model (GPCM) or the generalized partial credit model with differential item functioning (GPCM-DIF) using penalized joint maximum likelihood estimation (PJMLE). The package further allows the user to reevaluate the output of the automated method and use it as a basis for performing a manual Rasch analysis and provides standard statistics of Rasch analyses (e.g., outfit, infit, person separation reliability, and residual correlation) to support the model reevaluation.

R包autoRasch已经被开发出来以一种(半)自动化的方式执行Rasch分析。分析的自动化部分是通过优化所谓的问卷内加外对数似然(IPOQ-LL)或包含差异项目功能(DIF)的IPOQ-LL-DIF来实现的。这些标准根据最终工具中包含的项目来衡量预先收集的调查的匹配质量。为了计算这些准则,autoRasch使用惩罚联合最大似然估计(PJMLE)拟合广义部分信用模型(GPCM)或带微分项目函数的广义部分信用模型(GPCM- dif)。该软件包进一步允许用户重新评估自动化方法的输出,并将其用作执行手动Rasch分析的基础,并提供Rasch分析的标准统计数据(例如,装备,infit,人员分离可靠性和残差相关性),以支持模型重新评估。
{"title":"autoRasch: An R Package to Do Semi-Automated Rasch Analysis.","authors":"Feri Wijayanto,&nbsp;Ioan Gabriel Bucur,&nbsp;Perry Groot,&nbsp;Tom Heskes","doi":"10.1177/01466216221125178","DOIUrl":"https://doi.org/10.1177/01466216221125178","url":null,"abstract":"<p><p>The R package autoRasch has been developed to perform a Rasch analysis in a (semi-)automated way. The automated part of the analysis is achieved by optimizing the so-called <i>in-plus-out-of-questionnaire log-likelihood</i> (IPOQ-LL) or IPOQ-LL-DIF when differential item functioning (DIF) is included. These criteria measure the quality of fit on a pre-collected survey, depending on which items are included in the final instrument. To compute these criteria, autoRasch fits the generalized partial credit model (GPCM) or the generalized partial credit model with differential item functioning (GPCM-DIF) using penalized joint maximum likelihood estimation (PJMLE). The package further allows the user to reevaluate the output of the automated method and use it as a basis for performing a manual Rasch analysis and provides standard statistics of Rasch analyses (e.g., outfit, infit, person separation reliability, and residual correlation) to support the model reevaluation.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9679921/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40494732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Outlier Detection Using t-test in Rasch IRT Equating under NEAT Design. 在 NEAT 设计下的 Rasch IRT Equating 中使用 t 检验检测离群值。
IF 1.2 4区 心理学 Q2 Social Sciences Pub Date : 2023-01-01 Epub Date: 2022-09-06 DOI: 10.1177/01466216221124045
Chunyan Liu, Daniel Jurich

In equating practice, the existence of outliers in the anchor items may deteriorate the equating accuracy and threaten the validity of test scores. Therefore, stability of the anchor item performance should be evaluated before conducting equating. This study used simulation to investigate the performance of the t-test method in detecting outliers and compared its performance with other outlier detection methods, including the logit difference method with 0.5 and 0.3 as the cutoff values and the robust z statistic with 2.7 as the cutoff value. The investigated factors included sample size, proportion of outliers, item difficulty drift direction, and group difference. Across all simulated conditions, the t-test method outperformed the other methods in terms of sensitivity of flagging true outliers, bias of the estimated translation constant, and the root mean square error of examinee ability estimates.

在等分实践中,如果锚定项目中存在异常值,可能会降低等分的准确性,并威胁到测验分数的效度。因此,在进行等分前,应评估锚点项目成绩的稳定性。本研究采用模拟方法研究了 t 检验法在检测离群值方面的性能,并将其与其他离群值检测方法进行了比较,包括以 0.5 和 0.3 为临界值的对数差分法和以 2.7 为临界值的稳健 z 统计法。调查因素包括样本量、异常值比例、项目难度漂移方向和组间差异。在所有模拟条件下,t 检验法在标记真实离群值的灵敏度、估计平移常数的偏差和考生能力估计的均方根误差方面均优于其他方法。
{"title":"Outlier Detection Using t-test in Rasch IRT Equating under NEAT Design.","authors":"Chunyan Liu, Daniel Jurich","doi":"10.1177/01466216221124045","DOIUrl":"10.1177/01466216221124045","url":null,"abstract":"<p><p>In equating practice, the existence of outliers in the anchor items may deteriorate the equating accuracy and threaten the validity of test scores. Therefore, stability of the anchor item performance should be evaluated before conducting equating. This study used simulation to investigate the performance of the <i>t</i>-test method in detecting outliers and compared its performance with other outlier detection methods, including the logit difference method with 0.5 and 0.3 as the cutoff values and the robust <i>z</i> statistic with 2.7 as the cutoff value. The investigated factors included sample size, proportion of outliers, item difficulty drift direction, and group difference. Across all simulated conditions, the <i>t</i>-test method outperformed the other methods in terms of sensitivity of flagging true outliers, bias of the estimated translation constant, and the root mean square error of examinee ability estimates.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9679927/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40494730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Applied Psychological Measurement
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1