Journal of Educational Measurement最新文献

英文中文

Detecting Group Collaboration Using Multiple Correspondence Analysis 利用多重对应分析检测群体协作

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2023-03-23 DOI: 10.1111/jedm.12363

Joseph H. Grochowalski, Amy Hendrickson

Test takers wishing to gain an unfair advantage often share answers with other test takers, either sharing all answers (a full key) or some (a partial key). Detecting key sharing during a tight testing window requires an efficient, easily interpretable, and rich form of analysis that is descriptive and inferential. We introduce a detection method based on multiple correspondence analysis (MCA) that identifies test takers with unusual response similarities. The method simultaneously detects multiple shared keys (partial or full), plots results, and is computationally efficient as it requires only matrix operations. We describe the method, evaluate its detection accuracy under various simulation conditions, and demonstrate the procedure on a real data set with known test-taking misbehavior. The simulation results showed that the MCA method had reasonably high power under realistic conditions and maintained the nominal false-positive level, except when the group size was very large or partial shared keys had more than 50% of the items. The real data analysis illustrated visual detection procedures and inference about the item responses possibly shared in the key, which was likely shared among 91 test takers, many of whom were confirmed by nonstatistical investigation to have engaged in test-taking misconduct.

希望获得不公平优势的考生经常与其他考生共享答案，要么共享所有答案(完整答案)，要么共享部分答案(部分答案)。在严格的测试窗口中检测密钥共享需要一种高效、易于解释和丰富的分析形式，这种形式是描述性和推断性的。我们介绍了一种基于多重对应分析(MCA)的检测方法，该方法可以识别具有异常反应相似性的考生。该方法同时检测多个共享密钥(部分或全部)，绘制结果，并且计算效率高，因为它只需要矩阵操作。我们描述了该方法，评估了其在各种模拟条件下的检测精度，并在具有已知测试错误行为的真实数据集上演示了该方法。仿真结果表明，除了组大小非常大或部分共享密钥超过50%的项目外，MCA方法在实际条件下具有相当高的功率，并保持名义上的假阳性水平。真实的数据分析说明了视觉检测程序和对可能在关键中共享的项目反应的推断，该关键可能在91名考生中共享，其中许多人被非统计调查证实参与了考试不当行为。

{"title":"Detecting Group Collaboration Using Multiple Correspondence Analysis","authors":"Joseph H. Grochowalski, Amy Hendrickson","doi":"10.1111/jedm.12363","DOIUrl":"10.1111/jedm.12363","url":null,"abstract":"Test takers wishing to gain an unfair advantage often share answers with other test takers, either sharing all answers (a full key) or some (a partial key). Detecting key sharing during a tight testing window requires an efficient, easily interpretable, and rich form of analysis that is descriptive and inferential. We introduce a detection method based on multiple correspondence analysis (MCA) that identifies test takers with unusual response similarities. The method simultaneously detects multiple shared keys (partial or full), plots results, and is computationally efficient as it requires only matrix operations. We describe the method, evaluate its detection accuracy under various simulation conditions, and demonstrate the procedure on a real data set with known test-taking misbehavior. The simulation results showed that the MCA method had reasonably high power under realistic conditions and maintained the nominal false-positive level, except when the group size was very large or partial shared keys had more than 50% of the items. The real data analysis illustrated visual detection procedures and inference about the item responses possibly shared in the key, which was likely shared among 91 test takers, many of whom were confirmed by nonstatistical investigation to have engaged in test-taking misconduct.","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 3","pages":"402-427"},"PeriodicalIF":1.3,"publicationDate":"2023-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44728675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pretest Item Calibration in Computerized Multistage Adaptive Testing 计算机多阶段自适应测试中的测试前项目校准

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2023-03-10 DOI: 10.1111/jedm.12361

Rabia Karatoprak Ersen, Won-Chan Lee

The purpose of this study was to compare calibration and linking methods for placing pretest item parameter estimates on the item pool scale in a 1-3 computerized multistage adaptive testing design in terms of item parameter recovery. Two models were used: embedded-section, in which pretest items were administered within a separate module, and embedded-items, in which pretest items were distributed across operational modules. The calibration methods were separate calibration with linking (SC) and fixed calibration (FC) with three parallel approaches under each (FC-1 and SC-1; FC-2 and SC-2; and FC-3 and SC-3). The FC-1 and SC-1 used only operational items in the routing module to link pretest items. The FC-2 and SC-2 also used only operational items in the routing module for linking, but in addition, the operational items in second stage modules were freely estimated. The FC-3 and SC-3 used operational items in all modules to link pretest items. The third calibration approach (i.e., FC-3 and SC-3) yielded the best results. For all three approaches, SC outperformed FC in all study conditions which were module length, sample size and examinee distributions.

本研究的目的是比较在1-3个计算机化多阶段自适应测试设计中，将预测项目参数估计放置在项目池量表上的校准和连接方法在项目参数恢复方面的差异。使用了两种模型:嵌入部分模型，其中预试项目在一个单独的模块中进行管理;嵌入项目模型，其中预试项目分布在操作模块中。标定方法分为连接分离标定(SC)和固定标定(FC)两种，分别采用FC-1和SC-1两种平行标定方法;FC-2和SC-2;FC-3和SC-3)。FC-1和SC-1只使用路由模块中的操作项来连接预测项。FC-2和SC-2同样只使用路由模块的操作项进行连接，但二级模块的操作项是自由估计的。FC-3和SC-3使用所有模块中的操作项目来连接预测项目。第三种校准方法(即FC-3和SC-3)产生的结果最好。对于所有三种方法，SC在模块长度，样本量和考生分布的所有研究条件下都优于FC。

{"title":"Pretest Item Calibration in Computerized Multistage Adaptive Testing","authors":"Rabia Karatoprak Ersen, Won-Chan Lee","doi":"10.1111/jedm.12361","DOIUrl":"10.1111/jedm.12361","url":null,"abstract":"The purpose of this study was to compare calibration and linking methods for placing pretest item parameter estimates on the item pool scale in a 1-3 computerized multistage adaptive testing design in terms of item parameter recovery. Two models were used: embedded-section, in which pretest items were administered within a separate module, and embedded-items, in which pretest items were distributed across operational modules. The calibration methods were separate calibration with linking (SC) and fixed calibration (FC) with three parallel approaches under each (FC-1 and SC-1; FC-2 and SC-2; and FC-3 and SC-3). The FC-1 and SC-1 used only operational items in the routing module to link pretest items. The FC-2 and SC-2 also used only operational items in the routing module for linking, but in addition, the operational items in second stage modules were freely estimated. The FC-3 and SC-3 used operational items in all modules to link pretest items. The third calibration approach (i.e., FC-3 and SC-3) yielded the best results. For all three approaches, SC outperformed FC in all study conditions which were module length, sample size and examinee distributions.","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 3","pages":"379-401"},"PeriodicalIF":1.3,"publicationDate":"2023-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12361","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48133014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Classical Item Analysis from a Signal Detection Perspective 信号检测视角下的经典项目分析

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2023-02-27 DOI: 10.1111/jedm.12358

Lawrence T. DeCarlo

A conceptualization of multiple-choice exams in terms of signal detection theory (SDT) leads to simple measures of item difficulty and item discrimination that are closely related to, but also distinct from, those used in classical item analysis (CIA). The theory defines a “true split,” depending on whether or not examinees know an item, and so it provides a basis for using total scores to split item tables, as done in CIA, while also clarifying benefits and limitations of the approach. The SDT item difficulty and discrimination measures differ from those used in CIA in that they explicitly consider the role of distractors and avoid limitations due to range restrictions. A new screening measure is also introduced. The measures are theoretically well-grounded and are simple to compute by hand calculations or with standard software for choice models; simulations show that they offer advantages over traditional measures.

从信号检测理论(SDT)的角度将多项选择题考试概念化，导致了项目难度和项目辨别力的简单测量，这些测量与经典项目分析(CIA)中使用的方法密切相关，但又截然不同。该理论定义了“真正的分割”，取决于考生是否知道一个项目，因此它为使用总分来分割项目表提供了基础，就像CIA所做的那样，同时也阐明了这种方法的优点和局限性。SDT的项目难度和区分度测量与CIA的不同之处在于，它们明确考虑了干扰因素的作用，避免了由于范围限制而产生的限制。并介绍了一种新的筛选措施。这些措施在理论上是有充分根据的，而且通过手工计算或使用标准的选择模型软件计算起来很简单;模拟表明，它们比传统的测量方法更有优势。

引用次数: 0

Corrigendum: A Residual-Based Differential Item Functioning Detection Framework in Item Response Theory 更正:项目反应理论中基于残差的差异项目功能检测框架

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2023-02-26 DOI: 10.1111/jedm.12362

Hwanggyu Lim, Edison M. Choe, Kyung T. Han

In the original article, it was written that “Then the MLE scoring and DIF analysis with RDIF statistics were performed using the est_score and rdif functions, respectively, in the R (R Core Team, 2019) package irtplay (p.90).” However, the irtplay package has been removed from the CRAN repository due to intellectual property (IP) violation issues. Instead, a new R package called irtQ (Lim & Wells, 2023) has been released as a successor to irtplay. All IP issues have been resolved in irtQ, ensuring that the package is compliant with industry standards. https://doi.org/10.1111/jedm.12313

We would like to inform that the same functions of est_score and rdif used in the original study are also included in irtQ. Thus, it can be used as a replacement for irtplay. We apologize for any confusion caused by the previous version of the article.

在最初的文章中，它是这样写的:“然后在R (R Core Team, 2019)包irtplay中分别使用est_score和RDIF函数执行MLE评分和使用RDIF统计的DIF分析(第90页)。”但是，由于知识产权(IP)侵犯问题，该irtplay包已从CRAN存储库中删除。取而代之的是一个名为irtQ (Lim &《威尔斯，2023年》(Wells, 2023)作为irtplay的继承者发布。irtQ解决了所有IP问题，确保封装符合行业标准。https://doi.org/10.1111/jedm.12313We想告知，irtQ中也包含了原始研究中使用的est_score和rdif的相同函数。因此，它可以作为一种替代节目。对于之前的文章版本所造成的混乱，我们深表歉意。

引用次数: 0

Using Linkage Sets to Improve Connectedness in Rater Response Model Estimation 利用链接集提高评价响应模型估计中的连通性

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2023-02-19 DOI: 10.1111/jedm.12360

Jodi M. Casabianca, John R. Donoghue, Hyo Jeong Shin, Szu-Fu Chao, Ikkyu Choi

Using item-response theory to model rater effects provides an alternative solution for rater monitoring and diagnosis, compared to using standard performance metrics. In order to fit such models, the ratings data must be sufficiently connected in order to estimate rater effects. Due to popular rating designs used in large-scale testing scenarios, there tends to be a large proportion of missing data, yielding sparse matrices and estimation issues. In this article, we explore the impact of different types of connectedness, or linkage, brought about by using a linkage set—a collection of responses scored by most or all raters. We also explore the impact of the properties and composition of the linkage set, the different connectedness yielded from different rating designs, and the role of scores from automated scoring engines. In designing monitoring systems using the rater response version of the generalized partial credit model, the study results suggest use of a linkage set, especially a large one that is comprised of responses representing the full score scale. Results also show that a double-human-scoring design provides more connectedness than a design with one human and an automated scoring engine. Furthermore, scores from automated scoring engines do not provide adequate connectedness. We discuss considerations for operational implementation and further study.

与使用标准绩效指标相比，使用项目反应理论来模拟评分者效应为评分者监测和诊断提供了另一种解决方案。为了拟合这样的模型，评级数据必须充分连接，以便估计评级效应。由于在大规模测试场景中使用的流行评级设计，往往存在很大比例的缺失数据，从而产生稀疏矩阵和估计问题。在本文中，我们探讨了不同类型的连通性或联系的影响，通过使用联系集(由大多数或所有评分者评分的回答集合)带来的影响。我们还探讨了链接集的属性和组成的影响，不同评级设计产生的不同连通性，以及自动评分引擎得分的作用。在使用广义部分信用模型的评分反应版本设计监测系统时，研究结果建议使用链接集，特别是由代表满分量表的反应组成的大型链接集。结果还表明，与一个人和自动评分引擎的设计相比，双人评分设计提供了更多的连通性。此外，来自自动评分引擎的分数不能提供足够的连接性。我们讨论了操作实施和进一步研究的考虑。

{"title":"Using Linkage Sets to Improve Connectedness in Rater Response Model Estimation","authors":"Jodi M. Casabianca, John R. Donoghue, Hyo Jeong Shin, Szu-Fu Chao, Ikkyu Choi","doi":"10.1111/jedm.12360","DOIUrl":"10.1111/jedm.12360","url":null,"abstract":"Using item-response theory to model rater effects provides an alternative solution for rater monitoring and diagnosis, compared to using standard performance metrics. In order to fit such models, the ratings data must be sufficiently connected in order to estimate rater effects. Due to popular rating designs used in large-scale testing scenarios, there tends to be a large proportion of missing data, yielding sparse matrices and estimation issues. In this article, we explore the impact of different types of connectedness, or linkage, brought about by using a linkage set—a collection of responses scored by most or all raters. We also explore the impact of the properties and composition of the linkage set, the different connectedness yielded from different rating designs, and the role of scores from automated scoring engines. In designing monitoring systems using the rater response version of the generalized partial credit model, the study results suggest use of a linkage set, especially a large one that is comprised of responses representing the full score scale. Results also show that a double-human-scoring design provides more connectedness than a design with one human and an automated scoring engine. Furthermore, scores from automated scoring engines do not provide adequate connectedness. We discuss considerations for operational implementation and further study.","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 3","pages":"428-454"},"PeriodicalIF":1.3,"publicationDate":"2023-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47979944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Using Simulated Retests to Estimate the Reliability of Diagnostic Assessment Systems 用模拟复验估计诊断评估系统的可靠性

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2023-02-19 DOI: 10.1111/jedm.12359

W. Jake Thompson, Brooke Nash, Amy K. Clark, Jeffrey C. Hoover

As diagnostic classification models become more widely used in large-scale operational assessments, we must give consideration to the methods for estimating and reporting reliability. Researchers must explore alternatives to traditional reliability methods that are consistent with the design, scoring, and reporting levels of diagnostic assessment systems. In this article, we describe and evaluate a method for simulating retests to summarize reliability evidence at multiple reporting levels. We evaluate how the performance of reliability estimates from simulated retests compares to other measures of classification consistency and accuracy for diagnostic assessments that have previously been described in the literature, but which limit the level at which reliability can be reported. Overall, the findings show that reliability estimates from simulated retests are an accurate measure of reliability and are consistent with other measures of reliability for diagnostic assessments. We then apply this method to real data from the Examination for the Certificate of Proficiency in English to demonstrate the method in practice and compare reliability estimates from observed data. Finally, we discuss implications for the field and possible next directions.

随着诊断分类模型在大规模作战评估中的应用越来越广泛，我们必须考虑可靠性的估计和报告方法。研究人员必须探索与诊断评估系统的设计、评分和报告水平一致的传统可靠性方法的替代方案。在本文中，我们描述和评估了一种模拟复测的方法，以总结多个报告水平的可靠性证据。我们评估从模拟复测的可靠性估计的性能如何与以前在文献中描述的诊断评估的分类一致性和准确性的其他措施进行比较，但这限制了可靠性可以报告的水平。总的来说，研究结果表明，模拟复测的可靠性估计是可靠度的准确度量，并且与诊断评估的其他可靠性度量一致。然后，我们将该方法应用于英语水平证书考试的真实数据，以在实践中证明该方法，并比较观察数据的可靠性估计。最后，我们讨论了该领域的意义和可能的下一步方向。

{"title":"Using Simulated Retests to Estimate the Reliability of Diagnostic Assessment Systems","authors":"W. Jake Thompson, Brooke Nash, Amy K. Clark, Jeffrey C. Hoover","doi":"10.1111/jedm.12359","DOIUrl":"10.1111/jedm.12359","url":null,"abstract":"As diagnostic classification models become more widely used in large-scale operational assessments, we must give consideration to the methods for estimating and reporting reliability. Researchers must explore alternatives to traditional reliability methods that are consistent with the design, scoring, and reporting levels of diagnostic assessment systems. In this article, we describe and evaluate a method for simulating retests to summarize reliability evidence at multiple reporting levels. We evaluate how the performance of reliability estimates from simulated retests compares to other measures of classification consistency and accuracy for diagnostic assessments that have previously been described in the literature, but which limit the level at which reliability can be reported. Overall, the findings show that reliability estimates from simulated retests are an accurate measure of reliability and are consistent with other measures of reliability for diagnostic assessments. We then apply this method to real data from the Examination for the Certificate of Proficiency in English to demonstrate the method in practice and compare reliability estimates from observed data. Finally, we discuss implications for the field and possible next directions.","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 3","pages":"455-475"},"PeriodicalIF":1.3,"publicationDate":"2023-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47801652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Exploration of an Improved Aggregate Student Growth Measure Using Data from Two States 利用两个州的数据探索一种改进的学生综合成长测量方法

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2023-01-31 DOI: 10.1111/jedm.12354

Katherine E. Castellano, Daniel F. McCaffrey, J. R. Lockwood

The simple average of student growth scores is often used in accountability systems, but it can be problematic for decision making. When computed using a small/moderate number of students, it can be sensitive to the sample, resulting in inaccurate representations of growth of the students, low year-to-year stability, and inequities for low-incidence groups. An alternative designed to address these issues is to use an Empirical Best Linear Prediction (EBLP), which is a weighted average of growth score data from other years and/or subjects. We apply both approaches to two statewide datasets to answer empirical questions about their performance. The EBLP outperforms the simple average in accuracy and cross-year stability with the exception that accuracy was not necessarily improved for very large districts in one of the states. In such exceptions, we show a beneficial alternative may be to use a hybrid approach in which very large districts receive the simple average and all others receive the EBLP. We find that adding more growth score data to the computation of the EBLP can improve accuracy, but not necessarily for larger schools/districts. We review key decision points in aggregate growth reporting and in specifying an EBLP weighted average in practice.

学生成长分数的简单平均值通常用于问责制，但它可能会对决策产生问题。当使用少量/中等数量的学生进行计算时，它可能对样本很敏感，导致学生增长的不准确表示，较低的年度稳定性和低发病率组的不公平。解决这些问题的另一种方法是使用经验最佳线性预测(EBLP)，它是其他年份和/或主题的增长得分数据的加权平均值。我们将这两种方法应用于两个全州范围的数据集，以回答有关其性能的实证问题。EBLP在准确性和跨年稳定性方面优于简单平均值，但在一个州的非常大的地区，准确性不一定得到改善。在这种例外情况下，我们提出了一种有益的替代方案，可能是使用混合方法，其中非常大的地区接受简单平均，而所有其他地区接受EBLP。我们发现，在EBLP的计算中加入更多的成长分数数据可以提高准确性，但对于较大的学校/学区来说不一定。我们回顾了实践中总增长报告和指定EBLP加权平均值的关键决策点。

{"title":"An Exploration of an Improved Aggregate Student Growth Measure Using Data from Two States","authors":"Katherine E. Castellano, Daniel F. McCaffrey, J. R. Lockwood","doi":"10.1111/jedm.12354","DOIUrl":"10.1111/jedm.12354","url":null,"abstract":"The simple average of student growth scores is often used in accountability systems, but it can be problematic for decision making. When computed using a small/moderate number of students, it can be sensitive to the sample, resulting in inaccurate representations of growth of the students, low year-to-year stability, and inequities for low-incidence groups. An alternative designed to address these issues is to use an Empirical Best Linear Prediction (EBLP), which is a weighted average of growth score data from other years and/or subjects. We apply both approaches to two statewide datasets to answer empirical questions about their performance. The EBLP outperforms the simple average in accuracy and cross-year stability with the exception that accuracy was not necessarily improved for very large districts in one of the states. In such exceptions, we show a beneficial alternative may be to use a hybrid approach in which very large districts receive the simple average and all others receive the EBLP. We find that adding more growth score data to the computation of the EBLP can improve accuracy, but not necessarily for larger schools/districts. We review key decision points in aggregate growth reporting and in specifying an EBLP weighted average in practice.","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 2","pages":"173-201"},"PeriodicalIF":1.3,"publicationDate":"2023-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41556588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Classification Accuracy and Consistency of Compensatory Composite Test Scores 补偿性综合测试成绩的分类准确性和一致性

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2023-01-28 DOI: 10.1111/jedm.12357

J. Carl Setzer, Ying Cheng, Cheng Liu

Test scores are often used to make decisions about examinees, such as in licensure and certification testing, as well as in many educational contexts. In some cases, these decisions are based upon compensatory scores, such as those from multiple sections or components of an exam. Classification accuracy and classification consistency are two psychometric characteristics of test scores that are often reported when decisions are based on those scores, and several techniques currently exist for estimating both accuracy and consistency. However, research on classification accuracy and consistency on compensatory test scores is scarce. This study demonstrates two techniques that can be used to estimate classification accuracy and consistency when test scores are used in a compensatory manner. First, a simulation study demonstrates that both methods provide very similar results under the studied conditions. Second, we demonstrate how the two methods could be used with a high-stakes licensure exam.

考试成绩经常被用来决定考生，比如在执照和认证考试中，以及在许多教育环境中。在某些情况下，这些决定是基于补偿性分数，例如来自考试的多个部分或组成部分的分数。分类准确性和分类一致性是测试分数的两个心理测量特征，当决策基于这些分数时，通常会报告这些特征，目前存在几种评估准确性和一致性的技术。然而，对代偿测试分数的分类准确性和一致性的研究却很少。本研究展示了两种技术，可以用来估计分类的准确性和一致性时，测试成绩使用补偿的方式。首先，仿真研究表明，在研究条件下，两种方法的结果非常相似。其次，我们将演示如何在高风险的执照考试中使用这两种方法。

引用次数: 0

Editorial for JEM issue 59-4 JEM第59-4期社论

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2023-01-06 DOI: 10.1111/jedm.12356

Sandip Sinharay

引用次数: 0

Specifying the Three Ws in Educational Measurement: Who Uses Which Scores for What Purpose? 指定教育测量中的三个w:谁使用哪个分数用于什么目的?

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2022-12-25 DOI: 10.1111/jedm.12355

Andrew Ho

I argue that understanding and improving educational measurement requires specificity about actors, scores, and purpose: Who uses which scores for what purpose? I show how this specificity complements Briggs’ frameworks for educational measurement that he presented in his 2022 address as president of the National Council on Measurement in Education.

我认为理解和改进教育测量需要明确参与者、分数和目的:谁将哪个分数用于什么目的?我展示了这种特殊性是如何补充布里格斯在2022年作为国家教育测量委员会主席的演讲中提出的教育测量框架的。

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Journal of Educational Measurement

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀