Educational and Psychological Measurement最新文献

英文中文

A Comparison of Person-Fit Indices to Detect Social Desirability Bias. 检测社会期望偏差的人适合指数的比较。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2023-10-01 Epub Date: 2022-10-18 DOI: 10.1177/00131644221129577

Sanaz Nazari, Walter L Leite, A Corinne Huggins-Manley

Social desirability bias (SDB) has been a major concern in educational and psychological assessments when measuring latent variables because it has the potential to introduce measurement error and bias in assessments. Person-fit indices can detect bias in the form of misfitted response vectors. The objective of this study was to compare the performance of 14 person-fit indices to identify SDB in simulated responses. The area under the curve (AUC) of receiver operating characteristic (ROC) curve analysis was computed to evaluate the predictive power of these statistics. The findings showed that the agreement statistic $(A)$ outperformed all other person-fit indices, while the disagreement statistic $(D)$ , dependability statistic $(E)$ , and the number of Guttman errors $(G)$ also demonstrated high AUCs to detect SDB. Recommendations for practitioners to use these fit indices are provided.

在测量潜在变量时，社会期望偏差（SDB）一直是教育和心理评估中的一个主要问题，因为它有可能在评估中引入测量误差和偏差。个人拟合指数可以检测不匹配响应向量形式的偏差。本研究的目的是比较14个人适合指数的表现，以识别模拟反应中的SDB。计算受试者工作特性（ROC）曲线分析的曲线下面积（AUC），以评估这些统计数据的预测能力。研究结果表明，一致性统计（A）优于所有其他人的拟合指数，而不一致性统计学（D）、可靠性统计（E）和古特曼错误数（G）也显示出检测SDB的高AUC。建议从业者使用这些拟合指数。

{"title":"A Comparison of Person-Fit Indices to Detect Social Desirability Bias.","authors":"Sanaz Nazari, Walter L Leite, A Corinne Huggins-Manley","doi":"10.1177/00131644221129577","DOIUrl":"10.1177/00131644221129577","url":null,"abstract":"Social desirability bias (SDB) has been a major concern in educational and psychological assessments when measuring latent variables because it has the potential to introduce measurement error and bias in assessments. Person-fit indices can detect bias in the form of misfitted response vectors. The objective of this study was to compare the performance of 14 person-fit indices to identify SDB in simulated responses. The area under the curve (AUC) of receiver operating characteristic (ROC) curve analysis was computed to evaluate the predictive power of these statistics. The findings showed that the agreement statistic <math><mrow><mo>(</mo><mi>A</mi><mo>)</mo></mrow></math> outperformed all other person-fit indices, while the disagreement statistic <math><mrow><mo>(</mo><mi>D</mi><mo>)</mo></mrow></math>, dependability statistic <math><mrow><mo>(</mo><mi>E</mi><mo>)</mo></mrow></math>, and the number of Guttman errors <math><mrow><mo>(</mo><mi>G</mi><mo>)</mo></mrow></math> also demonstrated high AUCs to detect SDB. Recommendations for practitioners to use these fit indices are provided.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"907-928"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470160/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10208755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Detecting Rating Scale Malfunctioning With the Partial Credit Model and Generalized Partial Credit Model. 用部分信用模型和广义部分信用模型检测评定量表的故障。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2023-10-01 Epub Date: 2022-08-12 DOI: 10.1177/00131644221116292

Stefanie A Wind

Rating scale analysis techniques provide researchers with practical tools for examining the degree to which ordinal rating scales (e.g., Likert-type scales or performance assessment rating scales) function in psychometrically useful ways. When rating scales function as expected, researchers can interpret ratings in the intended direction (i.e., lower ratings mean "less" of a construct than higher ratings), distinguish between categories in the scale (i.e., each category reflects a unique level of the construct), and compare ratings across elements of the measurement instrument, such as individual items. Although researchers have used these techniques in a variety of contexts, studies are limited that systematically explore their sensitivity to problematic rating scale characteristics (i.e., "rating scale malfunctioning"). I used a real data analysis and a simulation study to systematically explore the sensitivity of rating scale analysis techniques based on two popular polytomous item response theory (IRT) models: the partial credit model (PCM) and the generalized partial credit model (GPCM). Overall, results indicated that both models provide valuable information about rating scale threshold ordering and precision that can help researchers understand how their rating scales are functioning and identify areas for further investigation or revision. However, there were some differences between models in their sensitivity to rating scale malfunctioning in certain conditions. Implications for research and practice are discussed.

评定量表分析技术为研究人员提供了实用的工具，用于检查有序评定量表（如Likert型量表或绩效评估评定量表）以心理测量学有用的方式发挥作用的程度。当评级量表按预期发挥作用时，研究人员可以按预期方向解释评级（即，较低的评级意味着结构的“更少”，而不是较高的评级），区分量表中的类别（即，每个类别反映了结构的独特水平），并比较测量工具各元素（如单个项目）的评级。尽管研究人员在各种情况下使用了这些技术，但系统地探索他们对有问题的评定量表特征（即“评定量表故障”）的敏感性的研究是有限的。我使用真实数据分析和模拟研究，系统地探讨了基于两个流行的多模项目反应理论（IRT）模型的评级量表分析技术的敏感性：部分信用模型（PCM）和广义部分信用模型。总的来说，结果表明，这两个模型都提供了关于评分量表阈值排序和精度的有价值的信息，可以帮助研究人员了解他们的评分量表是如何运作的，并确定需要进一步调查或修订的领域。然而，在某些条件下，模型对评级量表故障的敏感性存在一些差异。讨论了对研究和实践的启示。

{"title":"Detecting Rating Scale Malfunctioning With the Partial Credit Model and Generalized Partial Credit Model.","authors":"Stefanie A Wind","doi":"10.1177/00131644221116292","DOIUrl":"10.1177/00131644221116292","url":null,"abstract":"Rating scale analysis techniques provide researchers with practical tools for examining the degree to which ordinal rating scales (e.g., Likert-type scales or performance assessment rating scales) function in psychometrically useful ways. When rating scales function as expected, researchers can interpret ratings in the intended direction (i.e., lower ratings mean \"less\" of a construct than higher ratings), distinguish between categories in the scale (i.e., each category reflects a unique level of the construct), and compare ratings across elements of the measurement instrument, such as individual items. Although researchers have used these techniques in a variety of contexts, studies are limited that systematically explore their sensitivity to problematic rating scale characteristics (i.e., \"rating scale malfunctioning\"). I used a real data analysis and a simulation study to systematically explore the sensitivity of rating scale analysis techniques based on two popular polytomous item response theory (IRT) models: the partial credit model (PCM) and the generalized partial credit model (GPCM). Overall, results indicated that both models provide valuable information about rating scale threshold ordering and precision that can help researchers understand how their rating scales are functioning and identify areas for further investigation or revision. However, there were some differences between models in their sensitivity to rating scale malfunctioning in certain conditions. Implications for research and practice are discussed.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"953-983"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470161/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10506045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Equidistant Response Options on Likert-Type Instruments: Testing the Interval Scaling Assumption Using Mplus. Likert型仪器上的等距响应选项：使用Mplus测试区间标度假设。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2023-10-01 Epub Date: 2022-10-27 DOI: 10.1177/00131644221130482

Georgios Sideridis, Ioannis Tsaousis, Hanan Ghamdi

The purpose of the present study was to provide the means to evaluate the "interval-scaling" assumption that governs the use of parametric statistics and continuous data estimators in self-report instruments that utilize Likert-type scaling. Using simulated and real data, the methodology to test for this important assumption is evaluated using the popular software Mplus 8.8. Evidence on meeting the assumption is provided using the Wald test and the equidistant index. It is suggested that routine evaluations of self-report instruments engage the present methodology so that the most appropriate estimator will be implemented when testing the construct validity of self-report instruments.

本研究的目的是提供一种方法来评估“区间标度”假设，该假设控制了在使用Likert型标度的自我报告工具中使用参数统计和连续数据估计量。使用模拟和真实数据，使用流行的软件Mplus 8.8评估了测试这一重要假设的方法。使用Wald检验和等距指数提供了满足该假设的证据。建议采用现有方法对自我报告工具进行常规评估，以便在测试自我报告工具的结构有效性时使用最合适的估计器。

引用次数: 0

Position of Correct Option and Distractors Impacts Responses to Multiple-Choice Items: Evidence From a National Test. 正确选项和分心因素的位置影响对多项选择项目的反应：来自国家测试的证据。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2023-10-01 Epub Date: 2022-11-12 DOI: 10.1177/00131644221132335

Séverin Lions, Pablo Dartnell, Gabriela Toledo, María Inés Godoy, Nora Córdova, Daniela Jiménez, Julie Lemarié

Even though the impact of the position of response options on answers to multiple-choice items has been investigated for decades, it remains debated. Research on this topic is inconclusive, perhaps because too few studies have obtained experimental data from large-sized samples in a real-world context and have manipulated the position of both correct response and distractors. Since multiple-choice tests' outcomes can be strikingly consequential and option position effects constitute a potential source of measurement error, these effects should be clarified. In this study, two experiments in which the position of correct response and distractors was carefully manipulated were performed within a Chilean national high-stakes standardized test, responded by 195,715 examinees. Results show small but clear and systematic effects of options position on examinees' responses in both experiments. They consistently indicate that a five-option item is slightly easier when the correct response is in A rather than E and when the most attractive distractor is after and far away from the correct response. They clarify and extend previous findings, showing that the appeal of all options is influenced by position. The existence and nature of a potential interference phenomenon between the options' processing are discussed, and implications for test development are considered.

尽管回答选项的位置对多项选择题答案的影响已经调查了几十年，但它仍然存在争议。关于这一主题的研究没有结论，可能是因为很少有研究从现实世界中的大样本中获得实验数据，并且操纵了正确反应和干扰物的位置。由于多项选择测试的结果可能具有显著的后果性，而选项位置效应构成了测量误差的潜在来源，因此应该澄清这些影响。在这项研究中，在智利国家高风险标准化测试中，对195715名考生进行了两项实验，在这两项实验中，正确反应和干扰物的位置被仔细操纵。结果显示，在两个实验中，选项位置对考生反应的影响都很小，但清晰而系统。他们一致表示，当正确的回答是a而不是E时，以及当最有吸引力的干扰因素在正确的回答之后并远离正确的回答时，五选项项目会稍微容易一些。他们澄清并扩展了以前的调查结果，表明所有选择的吸引力都受到立场的影响。讨论了选项处理之间潜在干扰现象的存在和性质，并考虑了对测试开发的影响。

{"title":"Position of Correct Option and Distractors Impacts Responses to Multiple-Choice Items: Evidence From a National Test.","authors":"Séverin Lions, Pablo Dartnell, Gabriela Toledo, María Inés Godoy, Nora Córdova, Daniela Jiménez, Julie Lemarié","doi":"10.1177/00131644221132335","DOIUrl":"10.1177/00131644221132335","url":null,"abstract":"Even though the impact of the position of response options on answers to multiple-choice items has been investigated for decades, it remains debated. Research on this topic is inconclusive, perhaps because too few studies have obtained experimental data from large-sized samples in a real-world context and have manipulated the position of both correct response and distractors. Since multiple-choice tests' outcomes can be strikingly consequential and option position effects constitute a potential source of measurement error, these effects should be clarified. In this study, two experiments in which the position of correct response and distractors was carefully manipulated were performed within a Chilean national high-stakes standardized test, responded by 195,715 examinees. Results show small but clear and systematic effects of options position on examinees' responses in both experiments. They consistently indicate that a five-option item is slightly easier when the correct response is in A rather than E and when the most attractive distractor is after and far away from the correct response. They clarify and extend previous findings, showing that the appeal of all options is influenced by position. The existence and nature of a potential interference phenomenon between the options' processing are discussed, and implications for test development are considered.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"861-884"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470158/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10306861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Impact and Detection of Uniform Differential Item Functioning for Continuous Item Response Models. 一致微分项目函数对连续项目响应模型的影响和检测。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2023-10-01 Epub Date: 2022-07-21 DOI: 10.1177/00131644221111993

W Holmes Finch

Psychometricians have devoted much research and attention to categorical item responses, leading to the development and widespread use of item response theory for the estimation of model parameters and identification of items that do not perform in the same way for examinees from different population subgroups (e.g., differential item functioning [DIF]). With the increasing use of computer-based measurement, use of items with a continuous response modality is becoming more common. Models for use with these items have been developed and refined in recent years, but less attention has been devoted to investigating DIF for these continuous response models (CRMs). Therefore, the purpose of this simulation study was to compare the performance of three potential methods for assessing DIF for CRMs, including regression, the MIMIC model, and factor invariance testing. Study results revealed that the MIMIC model provided a combination of Type I error control and relatively high power for detecting DIF. Implications of these findings are discussed.

心理测量学家对分类项目反应进行了大量研究和关注，导致项目反应理论的发展和广泛使用，用于估计模型参数，并识别不同人群亚组考生表现不同的项目（例如，差异项目功能[DIF]）。随着越来越多地使用基于计算机的测量，使用具有连续反应模式的项目变得越来越普遍。近年来，已经开发和完善了用于这些项目的模型，但很少关注研究这些连续响应模型（CRM）的DIF。因此，本模拟研究的目的是比较评估CRM DIF的三种潜在方法的性能，包括回归、MIMIC模型和因子不变性测试。研究结果表明，MIMIC模型为检测DIF提供了I型误差控制和相对高功率的组合。讨论了这些发现的含义。

引用次数: 0

Detecting Preknowledge Cheating via Innovative Measures: A Mixture Hierarchical Model for Jointly Modeling Item Responses, Response Times, and Visual Fixation Counts. 通过创新方法检测预知作弊:一种混合层次模型，用于共同建模项目反应、反应时间和视觉注视计数。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2023-10-01 Epub Date: 2022-11-16 DOI: 10.1177/00131644221136142

Kaiwen Man, Jeffrey R Harring

Preknowledge cheating jeopardizes the validity of inferences based on test results. Many methods have been developed to detect preknowledge cheating by jointly analyzing item responses and response times. Gaze fixations, an essential eye-tracker measure, can be utilized to help detect aberrant testing behavior with improved accuracy beyond using product and process data types in isolation. As such, this study proposes a mixture hierarchical model that integrates item responses, response times, and visual fixation counts collected from an eye-tracker (a) to detect aberrant test takers who have different levels of preknowledge and (b) to account for nuances in behavioral patterns between normally-behaved and aberrant examinees. A Bayesian approach to estimating model parameters is carried out via an MCMC algorithm. Finally, the proposed model is applied to experimental data to illustrate how the model can be used to identify test takers having preknowledge on the test items.

先验知识欺骗会危及基于测试结果的推断的有效性。已经开发了许多方法来通过联合分析项目响应和响应时间来检测先验知识作弊。凝视是一种重要的眼动仪测量方法，除了单独使用产品和过程数据类型外，它还可以用来帮助检测异常测试行为，提高准确性。因此，这项研究提出了一个混合层次模型，该模型集成了从眼动仪收集的项目反应、反应时间和视觉注视计数，（a）检测具有不同预知识水平的异常考生，（b）解释正常考生和异常考生之间行为模式的细微差别。通过MCMC算法实现了估计模型参数的贝叶斯方法。最后，将所提出的模型应用于实验数据，以说明如何使用该模型来识别对测试项目有先验知识的考生。

引用次数: 0

The NEAT Equating Via Chaining Random Forests in the Context of Small Sample Sizes: A Machine-Learning Method. 在小样本量的背景下，通过链接随机森林的NEAT等式：一种机器学习方法。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2023-10-01 Epub Date: 2022-09-04 DOI: 10.1177/00131644221120899

Zhehan Jiang, Yuting Han, Lingling Xu, Dexin Shi, Ren Liu, Jinying Ouyang, Fen Cai

The part of responses that is absent in the nonequivalent groups with anchor test (NEAT) design can be managed to a planned missing scenario. In the context of small sample sizes, we present a machine learning (ML)-based imputation technique called chaining random forests (CRF) to perform equating tasks within the NEAT design. Specifically, seven CRF-based imputation equating methods are proposed based on different data augmentation methods. The equating performance of the proposed methods is examined through a simulation study. Five factors are considered: (a) test length (20, 30, 40, 50), (b) sample size per test form (50 versus 100), (c) ratio of common/anchor items (0.2 versus 0.3), and (d) equivalent versus nonequivalent groups taking the two forms (no mean difference versus a mean difference of 0.5), and (e) three different types of anchors (random, easy, and hard), resulting in 96 conditions. In addition, five traditional equating methods, (1) Tucker method; (2) Levine observed score method; (3) equipercentile equating method; (4) circle-arc method; and (5) concurrent calibration based on Rasch model, were also considered, plus seven CRF-based imputation equating methods for a total of 12 methods in this study. The findings suggest that benefiting from the advantages of ML techniques, CRF-based methods that incorporate the equating result of the Tucker method, such as IMP_total_Tucker, IMP_pair_Tucker, and IMP_Tucker_cirlce methods, can yield more robust and trustable estimates for the "missingness" in an equating task and therefore result in more accurate equated scores than other counterparts in short-length tests with small samples.

锚定测试（NEAT）设计的非等效组中不存在的部分响应可以管理到计划缺失的场景。在小样本量的背景下，我们提出了一种基于机器学习（ML）的插补技术，称为链式随机森林（CRF），以在NEAT设计中执行等式任务。具体而言，基于不同的数据扩充方法，提出了七种基于CRF的插补等值方法。通过仿真研究检验了所提出方法的等效性能。考虑了五个因素：（a）测试长度（20、30、40、50），（b）每个测试形式的样本量（50对100），（c）常见/锚定项目的比率（0.2对0.3），以及（d）采用两种形式的等效组与非等效组（无平均差异与0.5的平均差异），和（e）三种不同类型的锚定（随机、简单和坚硬），导致96种条件。此外，还有五种传统的等值方法，（1）塔克法；（2） Levine观察评分法；（3）等百分比等值法；（4）圆弧法；和（5）基于Rasch模型的并行校准，加上本研究中总共12种方法的7种基于CRF的插补等值方法。研究结果表明，得益于ML技术的优势，基于CRF的方法结合了Tucker方法的等式结果，如IMP_total_Tucker、IMP_pair_Tucker和IMP_Tucker_cirlce方法，可以对等式任务中的“缺失”产生更稳健和可信的估计，因此在小样本的短长度测试中，与其他同行相比，可以获得更准确的等式分数。

{"title":"The NEAT Equating Via Chaining Random Forests in the Context of Small Sample Sizes: A Machine-Learning Method.","authors":"Zhehan Jiang, Yuting Han, Lingling Xu, Dexin Shi, Ren Liu, Jinying Ouyang, Fen Cai","doi":"10.1177/00131644221120899","DOIUrl":"10.1177/00131644221120899","url":null,"abstract":"The part of responses that is absent in the nonequivalent groups with anchor test (NEAT) design can be managed to a planned missing scenario. In the context of small sample sizes, we present a machine learning (ML)-based imputation technique called chaining random forests (CRF) to perform equating tasks within the NEAT design. Specifically, seven CRF-based imputation equating methods are proposed based on different data augmentation methods. The equating performance of the proposed methods is examined through a simulation study. Five factors are considered: (a) test length (20, 30, 40, 50), (b) sample size per test form (50 versus 100), (c) ratio of common/anchor items (0.2 versus 0.3), and (d) equivalent versus nonequivalent groups taking the two forms (no mean difference versus a mean difference of 0.5), and (e) three different types of anchors (random, easy, and hard), resulting in 96 conditions. In addition, five traditional equating methods, (1) Tucker method; (2) Levine observed score method; (3) equipercentile equating method; (4) circle-arc method; and (5) concurrent calibration based on Rasch model, were also considered, plus seven CRF-based imputation equating methods for a total of 12 methods in this study. The findings suggest that benefiting from the advantages of ML techniques, CRF-based methods that incorporate the equating result of the Tucker method, such as IMP_total_Tucker, IMP_pair_Tucker, and IMP_Tucker_cirlce methods, can yield more robust and trustable estimates for the \"missingness\" in an equating task and therefore result in more accurate equated scores than other counterparts in short-length tests with small samples.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"984-1006"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470159/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10357823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generalized Mantel-Haenszel Estimators for Simultaneous Differential Item Functioning Tests. 同时微分项函数检验的广义Mantel-Haenszel估计。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2023-10-01 Epub Date: 2022-10-15 DOI: 10.1177/00131644221128341

Ivy Liu, Thomas Suesse, Samuel Harvey, Peter Yongqi Gu, Daniel Fernández, John Randal

The Mantel-Haenszel estimator is one of the most popular techniques for measuring differential item functioning (DIF). A generalization of this estimator is applied to the context of DIF to compare items by taking the covariance of odds ratio estimators between dependent items into account. Unlike the Item Response Theory, the method does not rely on the local item independence assumption which is likely to be violated when one item provides clues about the answer of another item. Furthermore, we use these (co)variance estimators to construct a hypothesis test to assess DIF for multiple items simultaneously. A simulation study is presented to assess the performance of several tests. Finally, the use of these DIF tests is illustrated via application to two real data sets.

Mantel-Haenszel估计量是测量差异项目功能（DIF）的最流行的技术之一。该估计量的推广应用于DIF的上下文，通过考虑依赖项目之间的比值比估计量的协方差来比较项目。与项目反应理论不同，该方法不依赖于局部项目独立性假设，当一个项目提供关于另一个项目答案的线索时，这一假设可能会被违反。此外，我们使用这些（协）方差估计量来构建假设检验，以同时评估多个项目的DIF。为了评估几个测试的性能，进行了模拟研究。最后，通过对两个真实数据集的应用，说明了这些DIF测试的使用。

引用次数: 0

Detecting Cheating in Large-Scale Assessment: The Transfer of Detectors to New Tests. 大规模评估中的作弊检测：检测器向新测试的转移。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2023-10-01 Epub Date: 2022-11-04 DOI: 10.1177/00131644221132723

Jochen Ranger, Nico Schmidt, Anett Wolgast

Recent approaches to the detection of cheaters in tests employ detectors from the field of machine learning. Detectors based on supervised learning algorithms achieve high accuracy but require labeled data sets with identified cheaters for training. Labeled data sets are usually not available at an early stage of the assessment period. In this article, we discuss the approach of adapting a detector that was trained previously with a labeled training data set to a new unlabeled data set. The training and the new data set may contain data from different tests. The adaptation of detectors to new data or tasks is denominated as transfer learning in the field of machine learning. We first discuss the conditions under which a detector of cheating can be transferred. We then investigate whether the conditions are met in a real data set. We finally evaluate the benefits of transferring a detector of cheating. We find that a transferred detector has higher accuracy than an unsupervised detector of cheating. A naive transfer that consists of a simple reuse of the detector increases the accuracy considerably. A transfer via a self-labeling (SETRED) algorithm increases the accuracy slightly more than the naive transfer. The findings suggest that the detection of cheating might be improved by using existing detectors of cheating at an early stage of an assessment period.

最近在测试中检测作弊者的方法使用了机器学习领域的检测器。基于监督学习算法的检测器实现了高精度，但需要带有已识别作弊者的标记数据集进行训练。标记的数据集通常在评估期的早期阶段不可用。在本文中，我们讨论了将先前使用标记的训练数据集训练的检测器调整为新的未标记数据集的方法。训练和新的数据集可以包含来自不同测试的数据。在机器学习领域，检测器对新数据或任务的适应被称为迁移学习。我们首先讨论作弊检测器可以转移的条件。然后，我们调查在真实数据集中是否满足这些条件。我们最后评估了转移作弊检测器的好处。我们发现，转移检测器比无监督的作弊检测器具有更高的准确性。一个简单的转移，包括检测器的简单重用，大大提高了精度。通过自标记（SETRED）算法的转移比原始转移略微提高了准确性。研究结果表明，在评估期的早期阶段，使用现有的作弊检测器可能会提高作弊的检测能力。

{"title":"Detecting Cheating in Large-Scale Assessment: The Transfer of Detectors to New Tests.","authors":"Jochen Ranger, Nico Schmidt, Anett Wolgast","doi":"10.1177/00131644221132723","DOIUrl":"10.1177/00131644221132723","url":null,"abstract":"Recent approaches to the detection of cheaters in tests employ detectors from the field of machine learning. Detectors based on supervised learning algorithms achieve high accuracy but require labeled data sets with identified cheaters for training. Labeled data sets are usually not available at an early stage of the assessment period. In this article, we discuss the approach of adapting a detector that was trained previously with a labeled training data set to a new unlabeled data set. The training and the new data set may contain data from different tests. The adaptation of detectors to new data or tasks is denominated as transfer learning in the field of machine learning. We first discuss the conditions under which a detector of cheating can be transferred. We then investigate whether the conditions are met in a real data set. We finally evaluate the benefits of transferring a detector of cheating. We find that a transferred detector has higher accuracy than an unsupervised detector of cheating. A naive transfer that consists of a simple reuse of the detector increases the accuracy considerably. A transfer via a self-labeling (SETRED) algorithm increases the accuracy slightly more than the naive transfer. The findings suggest that the detection of cheating might be improved by using existing detectors of cheating at an early stage of an assessment period.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"1033-1058"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470164/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10525104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal Data Fusion to Detect Preknowledge Test-Taking Behavior Using Machine Learning 利用机器学习检测预见性应试行为的多模态数据融合

3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2023-09-19 DOI: 10.1177/00131644231193625

Kaiwen Man

In various fields, including college admission, medical board certifications, and military recruitment, high-stakes decisions are frequently made based on scores obtained from large-scale assessments. These decisions necessitate precise and reliable scores that enable valid inferences to be drawn about test-takers. However, the ability of such tests to provide reliable, accurate inference on a test-taker’s performance could be jeopardized by aberrant test-taking practices, for instance, practicing real items prior to the test. As a result, it is crucial for administrators of such assessments to develop strategies that detect potential aberrant test-takers after data collection. The aim of this study is to explore the implementation of machine learning methods in combination with multimodal data fusion strategies that integrate bio-information technology, such as eye-tracking, and psychometric measures, including response times and item responses, to detect aberrant test-taking behaviors in technology-assisted remote testing settings.

在大学入学、医学委员会认证、征兵等各个领域，高风险的决定往往是根据大规模评估得出的分数做出的。这些决定需要精确和可靠的分数，以便对考生进行有效的推断。然而，这种测试为考生的表现提供可靠、准确推断的能力可能会因异常的考试做法而受到损害，例如，在考试前练习真实的题目。因此，对于这些评估的管理者来说，制定策略，在数据收集后发现潜在的异常考生是至关重要的。本研究的目的是探索将机器学习方法与多模态数据融合策略相结合的实施，该策略整合了生物信息技术，如眼球追踪和心理测量，包括反应时间和项目反应，以检测技术辅助远程测试设置中的异常考试行为。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Educational and Psychological Measurement

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀