首页 > 最新文献

Journal of Educational Measurement最新文献

英文 中文
Anchoring Validity Evidence for Automated Essay Scoring 锚定有效性证据的自动作文评分
IF 1.3 4区 心理学 Q1 Psychology Pub Date : 2022-05-15 DOI: 10.1111/jedm.12336
Mark D. Shermis

One of the challenges of discussing validity arguments for machine scoring of essays centers on the absence of a commonly held definition and theory of good writing. At best, the algorithms attempt to measure select attributes of writing and calibrate them against human ratings with the goal of accurate prediction of scores for new essays. Sometimes these attributes are based on the fundamentals of writing (e.g., fluency), but quite often they are based on locally developed rubrics that may be confounded with specific content coverage expectations. This lack of transparency makes it difficult to provide systematic evidence that machine scoring is assessing writing, but slices or correlates of writing performance.

讨论论文机器评分的有效性论点的挑战之一集中在缺乏一个普遍持有的定义和理论的好写作。在最好的情况下,算法试图衡量写作的选择属性,并将它们与人类评分进行校准,目标是准确预测新文章的分数。有时,这些属性是基于写作的基础(例如,流畅性),但更多时候,它们是基于当地开发的标准,可能会与具体的内容覆盖预期相混淆。由于缺乏透明度,很难提供系统的证据来证明机器评分是在评估写作,而是在评估写作表现的片段或相关性。
{"title":"Anchoring Validity Evidence for Automated Essay Scoring","authors":"Mark D. Shermis","doi":"10.1111/jedm.12336","DOIUrl":"10.1111/jedm.12336","url":null,"abstract":"<p>One of the challenges of discussing validity arguments for machine scoring of essays centers on the absence of a commonly held definition and theory of good writing. At best, the algorithms attempt to measure select attributes of writing and calibrate them against human ratings with the goal of accurate prediction of scores for new essays. Sometimes these attributes are based on the fundamentals of writing (e.g., fluency), but quite often they are based on locally developed rubrics that may be confounded with specific content coverage expectations. This lack of transparency makes it difficult to provide systematic evidence that machine scoring is assessing writing, but slices or correlates of writing performance.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46704579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Historical Perspectives on Score Comparability Issues Raised by Innovations in Testing 从历史的角度看考试创新引起的分数可比性问题
IF 1.3 4区 心理学 Q1 Psychology Pub Date : 2022-05-11 DOI: 10.1111/jedm.12318
Peter Baldwin, Brian E. Clauser

While score comparability across test forms typically relies on common (or randomly equivalent) examinees or items, innovations in item formats, test delivery, and efforts to extend the range of score interpretation may require a special data collection before examinees or items can be used in this way—or may be incompatible with common examinee or item designs altogether. When comparisons are necessary under these nonroutine conditions, forms still must be connected by something and this article focuses on these form-invariant connective somethings. A conceptual framework for thinking about the problem of score comparability in this way is given followed by a description of three classes of connectives. Examples from the history of innovations in testing are given for each class.

虽然考试形式之间的分数可比性通常依赖于共同的(或随机相等的)考生或项目,但在以这种方式使用考生或项目之前,项目格式的创新、考试交付和扩大分数解释范围的努力可能需要特殊的数据收集,或者可能与共同的考生或项目设计完全不兼容。当在这些非常规条件下需要比较时,形式仍然必须通过某些东西连接起来,本文主要讨论这些形式不变的连接物。以这种方式给出了思考分数可比性问题的概念框架,然后描述了三类连接词。每个班级都给出了测试创新历史上的例子。
{"title":"Historical Perspectives on Score Comparability Issues Raised by Innovations in Testing","authors":"Peter Baldwin,&nbsp;Brian E. Clauser","doi":"10.1111/jedm.12318","DOIUrl":"10.1111/jedm.12318","url":null,"abstract":"<p>While score comparability across test forms typically relies on common (or randomly equivalent) examinees or items, innovations in item formats, test delivery, and efforts to extend the range of score interpretation may require a special data collection before examinees or items can be used in this way—or may be incompatible with common examinee or item designs altogether. When comparisons are necessary under these nonroutine conditions, forms still must be connected by <i>something</i> and this article focuses on these form-invariant connective <i>somethings</i>. A conceptual framework for thinking about the problem of score comparability in this way is given followed by a description of three classes of connectives. Examples from the history of innovations in testing are given for each class.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44486843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Recent Challenges to Maintaining Score Comparability: A Commentary 保持分数可比性的最新挑战:评论
IF 1.3 4区 心理学 Q1 Psychology Pub Date : 2022-05-10 DOI: 10.1111/jedm.12319
Neil J. Dorans, Shelby J. Haberman
{"title":"Recent Challenges to Maintaining Score Comparability: A Commentary","authors":"Neil J. Dorans,&nbsp;Shelby J. Haberman","doi":"10.1111/jedm.12319","DOIUrl":"10.1111/jedm.12319","url":null,"abstract":"","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45896082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Validating Performance Standards via Latent Class Analysis 通过潜在类分析验证性能标准
IF 1.3 4区 心理学 Q1 Psychology Pub Date : 2022-05-05 DOI: 10.1111/jedm.12325
Salih Binici, Ismail Cuhadar

Validity of performance standards is a key element for the defensibility of standard setting results, and validating performance standards requires collecting multiple pieces of evidence at every step during the standard setting process. This study employs a statistical procedure, latent class analysis, to set performance standards and compares latent class analysis results with previously established performance standards via the modified-Angoff method for cross-validation. The context of the study is an operational large-scale science assessment administered in one of the southern states in the United States. Results show that the number of classes that emerged in the latent class analysis concurs with the number of existing performance levels. In addition, there is a substantial level of agreement between latent class analysis results and modified-Angoff method in terms of classifying students into the same performance levels. Overall, the findings establish evidence for the validity of the performance standards identified via the modified-Angoff method. Practical implications of the study findings are discussed.

绩效标准的有效性是标准制定结果的可辩护性的关键因素,验证绩效标准需要在标准制定过程的每一步收集多个证据。本研究采用统计程序潜类分析来设定性能标准,并通过修正angoff方法将潜类分析结果与先前建立的性能标准进行交叉验证。该研究的背景是在美国南部的一个州进行的一项可操作的大规模科学评估。结果表明,在潜在类分析中出现的类的数量与现有性能水平的数量一致。此外,在将学生划分为相同的成绩水平方面,潜在类分析结果与修正angoff方法之间存在相当程度的一致性。总体而言,研究结果为通过改进的angoff方法确定的绩效标准的有效性提供了证据。讨论了研究结果的实际意义。
{"title":"Validating Performance Standards via Latent Class Analysis","authors":"Salih Binici,&nbsp;Ismail Cuhadar","doi":"10.1111/jedm.12325","DOIUrl":"10.1111/jedm.12325","url":null,"abstract":"<p>Validity of performance standards is a key element for the defensibility of standard setting results, and validating performance standards requires collecting multiple pieces of evidence at every step during the standard setting process. This study employs a statistical procedure, latent class analysis, to set performance standards and compares latent class analysis results with previously established performance standards via the modified-Angoff method for cross-validation. The context of the study is an operational large-scale science assessment administered in one of the southern states in the United States. Results show that the number of classes that emerged in the latent class analysis concurs with the number of existing performance levels. In addition, there is a substantial level of agreement between latent class analysis results and modified-Angoff method in terms of classifying students into the same performance levels. Overall, the findings establish evidence for the validity of the performance standards identified via the modified-Angoff method. Practical implications of the study findings are discussed.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43539035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Score Comparability Issues with At-Home Testing and How to Address Them 评分可比性问题与在家测试和如何解决他们
IF 1.3 4区 心理学 Q1 Psychology Pub Date : 2022-05-04 DOI: 10.1111/jedm.12324
Gautam Puhan, Sooyeon Kim

As a result of the COVID-19 pandemic, at-home testing has become a popular delivery mode in many testing programs. When programs offer at-home testing to expand their service, the score comparability between test takers testing remotely and those testing in a test center is critical. This article summarizes statistical procedures that could be used to evaluate potential mode effects at both the item level and the total score levels. Using operational data from a licensure test, we also compared linking relationships between the test center and at-home testing groups to determine the reporting score conversion from a subpopulation invariance perspective.

由于COVID-19大流行,在家检测已成为许多检测项目中流行的交付模式。当项目提供在家考试来扩展他们的服务时,远程考试和在考试中心考试的考生之间的分数可比性是至关重要的。本文总结了可用于在项目水平和总分水平上评估潜在模式效应的统计程序。使用来自许可证测试的操作数据,我们还比较了测试中心和家庭测试组之间的联系关系,以确定从亚总体不变性角度报告的分数转换。
{"title":"Score Comparability Issues with At-Home Testing and How to Address Them","authors":"Gautam Puhan,&nbsp;Sooyeon Kim","doi":"10.1111/jedm.12324","DOIUrl":"10.1111/jedm.12324","url":null,"abstract":"<p>As a result of the COVID-19 pandemic, at-home testing has become a popular delivery mode in many testing programs. When programs offer at-home testing to expand their service, the score comparability between test takers testing remotely and those testing in a test center is critical. This article summarizes statistical procedures that could be used to evaluate potential mode effects at both the item level and the total score levels. Using operational data from a licensure test, we also compared linking relationships between the test center and at-home testing groups to determine the reporting score conversion from a subpopulation invariance perspective.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43479468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
The Impact of Cheating on Score Comparability via Pool-Based IRT Pre-equating 作弊对分数可比性的影响——基于池的IRT预均衡
IF 1.3 4区 心理学 Q1 Psychology Pub Date : 2022-05-01 DOI: 10.1111/jedm.12321
Jinghua Liu, Kirk Becker

For any testing programs that administer multiple forms across multiple years, maintaining score comparability via equating is essential. With continuous testing and high-stakes results, especially with less secure online administrations, testing programs must consider the potential for cheating on their exams. This study used empirical and simulated data to examine the impact of item exposure and prior knowledge on the estimation of item difficulty and test taker's ability via pool-based IRT preequating. Raw-to-theta transformations were derived from two groups of test takers with and without possible prior knowledge of exposed items, and these were compared to a criterion raw to theta transformation. Results indicated that item exposure has a large impact on item difficulty, not only altering the difficulty of exposed items, but also altering the difficulty of unexposed items. Item exposure makes test takers with prior knowledge appear more able. Further, theta estimation bias for test takers without prior knowledge increases when more test takers with possible prior knowledge are in the calibration population. Score inflation occurs for test takers with and without prior knowledge, especially for those with lower abilities.

对于任何在多年中管理多种形式的考试项目,通过相等来保持分数的可比性是必不可少的。随着持续的考试和高风险的结果,特别是不太安全的在线管理,考试项目必须考虑到考试作弊的可能性。本研究运用实证和模拟数据,通过基于池的IRT预均衡,考察了项目暴露和先验知识对项目难度和考生能力估计的影响。原始到theta的转换是从两组有或没有可能事先了解暴露项目的测试者中得出的,并且将这些与原始到theta转换的标准进行比较。结果表明,项目暴露对项目难度有较大影响,不仅改变了被暴露项目的难度,也改变了未被暴露项目的难度。项目暴露使具有先验知识的考生表现得更有能力。此外,当校准人群中有更多可能具有先验知识的考生时,没有先验知识的考生的theta估计偏差会增加。分数膨胀发生在有或没有先验知识的考生身上,尤其是那些能力较低的考生。
{"title":"The Impact of Cheating on Score Comparability via Pool-Based IRT Pre-equating","authors":"Jinghua Liu,&nbsp;Kirk Becker","doi":"10.1111/jedm.12321","DOIUrl":"10.1111/jedm.12321","url":null,"abstract":"<p>For any testing programs that administer multiple forms across multiple years, maintaining score comparability via equating is essential. With continuous testing and high-stakes results, especially with less secure online administrations, testing programs must consider the potential for cheating on their exams. This study used empirical and simulated data to examine the impact of item exposure and prior knowledge on the estimation of item difficulty and test taker's ability via pool-based IRT preequating. Raw-to-theta transformations were derived from two groups of test takers with and without possible prior knowledge of exposed items, and these were compared to a criterion raw to theta transformation. Results indicated that item exposure has a large impact on item difficulty, not only altering the difficulty of exposed items, but also altering the difficulty of unexposed items. Item exposure makes test takers with prior knowledge appear more able. Further, theta estimation bias for test takers without prior knowledge increases when more test takers with possible prior knowledge are in the calibration population. Score inflation occurs for test takers with and without prior knowledge, especially for those with lower abilities.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46066972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Score Comparability between Online Proctored and In-Person Credentialing Exams 在线监考和现场考试之间的分数可比性
IF 1.3 4区 心理学 Q1 Psychology Pub Date : 2022-04-27 DOI: 10.1111/jedm.12320
Paul Jones, Ye Tong, Jinghua Liu, Joshua Borglum, Vince Primoli

This article studied two methods to detect mode effects in two credentialing exams. In Study 1, we used a “modal scale comparison approach,” where the same pool of items was calibrated separately, without transformation, within two TC cohorts (TC1 and TC2) and one OP cohort (OP1) matched on their pool-based scale score distributions. The calibrations from all three groups were used to score the TC2 cohort, designated the validation sample. The TC1 item parameters and TC1-based thetas and pass rates were more like the native TC2 values than the OP1-based values, indicating mode effects, but the score and pass/fail decision differences were small. In Study 2, we used a “cross-modal repeater approach” in which test takers who failed their first attempt in one modality took the test again in either the same or different modality. The two pairs of repeater groups (TC → TC: TC → OP, and OP → OP: OP → TC) were matched exactly on their first attempt scores. Results showed increased pass rate and greater score variability in all conditions involving OP, with mode effects noticeable in both the TC → OP condition and less-strongly in the OP → TC condition. Limitations of the study and implications for exam developers were discussed.

本文研究了两种检测两种认证考试模式效应的方法。在研究1中,我们使用了“模态量表比较方法”,其中在两个TC队列(TC1和TC2)和一个OP队列(OP1)中分别校准相同的项目池,而不进行转换,其基于池的量表得分分布相匹配。使用所有三组的校准值对TC2队列进行评分,指定验证样本。TC1项目参数和基于TC1的theta和通过率比基于op1的值更接近原生TC2值,表明模式效应,但得分和通过/不通过决策差异较小。在研究2中,我们使用了“跨模态重复测试方法”,即在第一次测试中失败的应试者用相同或不同的模态再次进行测试。两对重复组(TC→TC: TC→OP和OP→OP: OP→TC)的第一次尝试分数完全匹配。结果显示,在所有涉及OP的条件下,通过率增加,得分变异性更大,模式效应在TC→OP条件下都很明显,而在OP→TC条件下则不那么强烈。讨论了本研究的局限性和对考试开发者的启示。
{"title":"Score Comparability between Online Proctored and In-Person Credentialing Exams","authors":"Paul Jones,&nbsp;Ye Tong,&nbsp;Jinghua Liu,&nbsp;Joshua Borglum,&nbsp;Vince Primoli","doi":"10.1111/jedm.12320","DOIUrl":"10.1111/jedm.12320","url":null,"abstract":"<p>This article studied two methods to detect mode effects in two credentialing exams. In Study 1, we used a “modal scale comparison approach,” where the same pool of items was calibrated separately, without transformation, within two TC cohorts (TC1 and TC2) and one OP cohort (OP1) matched on their pool-based scale score distributions. The calibrations from all three groups were used to score the TC2 cohort, designated the validation sample. The TC1 item parameters and TC1-based thetas and pass rates were more like the native TC2 values than the OP1-based values, indicating mode effects, but the score and pass/fail decision differences were small. In Study 2, we used a “cross-modal repeater approach” in which test takers who failed their first attempt in one modality took the test again in either the same or different modality. The two pairs of repeater groups (TC → TC: TC → OP, and OP → OP: OP → TC) were matched exactly on their first attempt scores. Results showed increased pass rate and greater score variability in all conditions involving OP, with mode effects noticeable in both the TC → OP condition and less-strongly in the OP → TC condition. Limitations of the study and implications for exam developers were discussed.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43064453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Random Responders in the TIMSS 2015 Student Questionnaire: A Threat to Validity? TIMSS 2015学生问卷中的随机应答者:对效度的威胁?
IF 1.3 4区 心理学 Q1 Psychology Pub Date : 2022-04-26 DOI: 10.1111/jedm.12317
Saskia van Laar, Johan Braeken

The low-stakes character of international large-scale educational assessments implies that a participating student might at times provide unrelated answers as if s/he was not even reading the items and choosing a response option randomly throughout. Depending on the severity of this invalid response behavior, interpretations of the assessment results are at risk of being invalidated. Not much is known about the prevalence nor impact of such random responders in the context of international large-scale educational assessments. Following a mixture item response theory (IRT) approach, an initial investigation of both issues is conducted for the Confidence in and Value of Mathematics/Science (VoM/VoS) scales in the Trends in International Mathematics and Science Study (TIMSS) 2015 student questionnaire. We end with a call to facilitate further mapping of invalid response behavior in this context by the inclusion of instructed response items and survey completion speed indicators in the assessments and a habit of sensitivity checks in all secondary data studies.

国际大规模教育评估的低风险特征意味着参与的学生有时可能会提供不相关的答案,就好像他/她甚至没有阅读项目,而是随机选择一个回答选项。根据这种无效响应行为的严重程度,对评估结果的解释有被无效的风险。在国际大规模教育评估的背景下,这种随机反应者的流行程度和影响尚不清楚。采用混合项目反应理论(IRT)方法,对国际数学与科学趋势研究(TIMSS) 2015学生问卷中的数学/科学信心和价值(VoM/VoS)量表进行了这两个问题的初步调查。最后,我们呼吁通过在评估中纳入指示响应项目和调查完成速度指标,以及在所有次要数据研究中习惯进行敏感性检查,促进在这种情况下进一步映射无效响应行为。
{"title":"Random Responders in the TIMSS 2015 Student Questionnaire: A Threat to Validity?","authors":"Saskia van Laar,&nbsp;Johan Braeken","doi":"10.1111/jedm.12317","DOIUrl":"https://doi.org/10.1111/jedm.12317","url":null,"abstract":"<p>The low-stakes character of international large-scale educational assessments implies that a participating student might at times provide unrelated answers as if s/he was not even reading the items and choosing a response option randomly throughout. Depending on the severity of this invalid response behavior, interpretations of the assessment results are at risk of being invalidated. Not much is known about the prevalence nor impact of such <i>random responders</i> in the context of international large-scale educational assessments. Following a mixture item response theory (IRT) approach, an initial investigation of both issues is conducted for the Confidence in and Value of Mathematics/Science (VoM/VoS) scales in the Trends in International Mathematics and Science Study (TIMSS) 2015 student questionnaire. We end with a call to facilitate further mapping of invalid response behavior in this context by the inclusion of instructed response items and survey completion speed indicators in the assessments and a habit of sensitivity checks in all secondary data studies.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12317","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"137552821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detecting Differential Item Functioning Using Posterior Predictive Model Checking: A Comparison of Discrepancy Statistics 用后验预测模型检验检测差异项目功能:差异统计的比较
IF 1.3 4区 心理学 Q1 Psychology Pub Date : 2022-04-25 DOI: 10.1111/jedm.12316
Seang-Hwane Joo, Philseok Lee

This study proposes a new Bayesian differential item functioning (DIF) detection method using posterior predictive model checking (PPMC). Item fit measures including infit, outfit, observed score distribution (OSD), and Q1 were considered as discrepancy statistics for the PPMC DIF methods. The performance of the PPMC DIF method was evaluated via a Monte Carlo simulation manipulating sample size, DIF size, DIF type, DIF percentage, and subpopulation trait distribution. Parametric DIF methods, such as Lord's chi-square and Raju's area approaches, were also included in the simulation design in order to compare the performance of the proposed PPMC DIF methods to those previously existing. Based on Type I error and power analysis, we found that PPMC DIF methods showed better-controlled Type I error rates than the existing methods and comparable power to detect uniform DIF. The implications and recommendations for applied researchers are discussed.

本文提出了一种新的基于后验预测模型检验的贝叶斯差分项目功能(DIF)检测方法。项目拟合措施包括infit、outfit、观察得分分布(OSD)和Q1被认为是PPMC DIF方法的差异统计。通过蒙特卡罗模拟对样本大小、DIF大小、DIF类型、DIF百分比和亚种群性状分布进行了评价。参数DIF方法,如Lord卡方法和Raju面积法,也包括在仿真设计中,以比较所提出的PPMC DIF方法与先前存在的DIF方法的性能。基于I型误差和功率分析,我们发现PPMC DIF方法比现有方法具有更好的I型错误率控制,并且具有相当的检测均匀DIF的能力。讨论了对应用研究人员的启示和建议。
{"title":"Detecting Differential Item Functioning Using Posterior Predictive Model Checking: A Comparison of Discrepancy Statistics","authors":"Seang-Hwane Joo,&nbsp;Philseok Lee","doi":"10.1111/jedm.12316","DOIUrl":"https://doi.org/10.1111/jedm.12316","url":null,"abstract":"<p>This study proposes a new Bayesian differential item functioning (DIF) detection method using posterior predictive model checking (PPMC). Item fit measures including infit, outfit, observed score distribution (OSD), and Q1 were considered as discrepancy statistics for the PPMC DIF methods. The performance of the PPMC DIF method was evaluated via a Monte Carlo simulation manipulating sample size, DIF size, DIF type, DIF percentage, and subpopulation trait distribution. Parametric DIF methods, such as Lord's chi-square and Raju's area approaches, were also included in the simulation design in order to compare the performance of the proposed PPMC DIF methods to those previously existing. Based on Type I error and power analysis, we found that PPMC DIF methods showed better-controlled Type I error rates than the existing methods and comparable power to detect uniform DIF. The implications and recommendations for applied researchers are discussed.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"137981441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Two IRT Characteristic Curve Linking Methods Weighted by Information 两种信息加权的IRT特征曲线连接方法
IF 1.3 4区 心理学 Q1 Psychology Pub Date : 2022-04-17 DOI: 10.1111/jedm.12315
Shaojie Wang, Minqiang Zhang, Won-Chan Lee, Feifei Huang, Zonglong Li, Yixing Li, Sufang Yu

Traditional IRT characteristic curve linking methods ignore parameter estimation errors, which may undermine the accuracy of estimated linking constants. Two new linking methods are proposed that take into account parameter estimation errors. The item- (IWCC) and test-information-weighted characteristic curve (TWCC) methods employ weighting components in the loss function from traditional methods by their corresponding item and test information, respectively. Monte Carlo simulation was conducted to evaluate the performances of the new linking methods and compare them with traditional ones. Ability difference between linking groups, sample size, and test length were manipulated under the common-item nonequivalent groups design. Results showed that the two information-weighted characteristic curve methods outperformed traditional methods, in general. TWCC was found to be more accurate and stable than IWCC. A pseudo-form pseudo-group analysis was also performed, and similar results were observed. Finally, guidelines for practice and future directions are discussed.

传统的IRT特征曲线连接方法忽略了参数估计误差,这可能会影响连接常数估计的准确性。提出了两种考虑参数估计误差的连接方法。项目加权特征曲线法(IWCC)和测试信息加权特征曲线法(TWCC)分别利用传统方法中对应的项目和测试信息对损失函数进行加权。通过蒙特卡罗仿真对新连接方法的性能进行了评价,并与传统连接方法进行了比较。在共同项目非等效组设计下,对连接组之间的能力差异、样本量和测试长度进行处理。结果表明,两种加权特征曲线方法总体上优于传统方法。TWCC比IWCC更准确、更稳定。伪形式伪组分析也进行,并观察到类似的结果。最后,对实践指导和未来发展方向进行了讨论。
{"title":"Two IRT Characteristic Curve Linking Methods Weighted by Information","authors":"Shaojie Wang,&nbsp;Minqiang Zhang,&nbsp;Won-Chan Lee,&nbsp;Feifei Huang,&nbsp;Zonglong Li,&nbsp;Yixing Li,&nbsp;Sufang Yu","doi":"10.1111/jedm.12315","DOIUrl":"10.1111/jedm.12315","url":null,"abstract":"<p>Traditional IRT characteristic curve linking methods ignore parameter estimation errors, which may undermine the accuracy of estimated linking constants. Two new linking methods are proposed that take into account parameter estimation errors. The item- (IWCC) and test-information-weighted characteristic curve (TWCC) methods employ weighting components in the loss function from traditional methods by their corresponding item and test information, respectively. Monte Carlo simulation was conducted to evaluate the performances of the new linking methods and compare them with traditional ones. Ability difference between linking groups, sample size, and test length were manipulated under the common-item nonequivalent groups design. Results showed that the two information-weighted characteristic curve methods outperformed traditional methods, in general. TWCC was found to be more accurate and stable than IWCC. A pseudo-form pseudo-group analysis was also performed, and similar results were observed. Finally, guidelines for practice and future directions are discussed.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48483173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Educational Measurement
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1