首页 > 最新文献

Journal of Educational Measurement最新文献

英文 中文
Theory-Driven IRT Modeling of Vocabulary Development: Matthew Effects and the Case for Unipolar IRT 词汇发展的理论驱动IRT模型:马修效应和单极IRT案例
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-04-11 DOI: 10.1111/jedm.12433
Qi (Helen) Huang, Daniel M. Bolt, Xiangyi Liao

Item response theory (IRT) encompasses a broader class of measurement models than is commonly appreciated by practitioners in educational measurement. For measures of vocabulary and its development, we show how psychological theory might in certain instances support unipolar IRT modeling as a superior alternative to the more traditional bipolar IRT models fit in practice. Although corresponding model choices make unipolar IRT statistically equivalent with bipolar IRT, adopting the unipolar approach substantially alters the resulting metric for proficiency. This shift can have substantial implications for educational research and practices that depend heavily on interval-level score interpretations. As an example, we illustrate through simulation how the perspective of unipolar IRT may account for inconsistencies seen across empirical studies in the observation (or lack thereof) of Matthew effects in reading/vocabulary development (i.e., growth being positively correlated with baseline proficiency), despite theoretical expectations for their presence. Additionally, a unipolar measurement perspective can reflect the anticipated diversification of vocabulary as proficiency level increases. Implications of unipolar IRT representations for constructing tests of vocabulary proficiency and evaluating measurement error are discussed.

项目反应理论(IRT)包含了比教育测量从业者通常理解的更广泛的测量模型。对于词汇量及其发展的测量,我们展示了心理学理论如何在某些情况下支持单极IRT模型作为更传统的双极IRT模型的优越替代方案,以适应实践。虽然相应的模型选择使单极IRT与双极IRT在统计上等同,但采用单极方法实质上改变了熟练度的结果度量。这种转变可能对严重依赖于区间分数解释的教育研究和实践产生重大影响。作为一个例子,我们通过模拟说明了单极IRT的观点如何解释在观察(或缺乏)阅读/词汇发展中的马修效应(即,增长与基线熟练程度呈正相关)的实证研究中所看到的不一致性,尽管理论期望它们的存在。此外,单极测量视角可以反映词汇量随熟练程度的提高而预期的多样化。本文讨论了单极IRT表征在构建词汇能力测试和评估测量误差方面的意义。
{"title":"Theory-Driven IRT Modeling of Vocabulary Development: Matthew Effects and the Case for Unipolar IRT","authors":"Qi (Helen) Huang,&nbsp;Daniel M. Bolt,&nbsp;Xiangyi Liao","doi":"10.1111/jedm.12433","DOIUrl":"https://doi.org/10.1111/jedm.12433","url":null,"abstract":"<p>Item response theory (IRT) encompasses a broader class of measurement models than is commonly appreciated by practitioners in educational measurement. For measures of vocabulary and its development, we show how psychological theory might in certain instances support unipolar IRT modeling as a superior alternative to the more traditional bipolar IRT models fit in practice. Although corresponding model choices make unipolar IRT statistically equivalent with bipolar IRT, adopting the unipolar approach substantially alters the resulting metric for proficiency. This shift can have substantial implications for educational research and practices that depend heavily on interval-level score interpretations. As an example, we illustrate through simulation how the perspective of unipolar IRT may account for inconsistencies seen across empirical studies in the observation (or lack thereof) of Matthew effects in reading/vocabulary development (i.e., growth being positively correlated with baseline proficiency), despite theoretical expectations for their presence. Additionally, a unipolar measurement perspective can reflect the anticipated diversification of vocabulary as proficiency level increases. Implications of unipolar IRT representations for constructing tests of vocabulary proficiency and evaluating measurement error are discussed.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 2","pages":"199-224"},"PeriodicalIF":1.4,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12433","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144524868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Another Look at Yen's Q3: Is .2 an Appropriate Cut-Off? 再看日元Q3: 0.2是一个合适的临界值吗?
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-04-04 DOI: 10.1111/jedm.12432
Kelsey Nason, Christine DeMars

This study examined the widely used threshold of .2 for Yen's Q3, an index for violations of local independence. Specifically, a simulation was conducted to investigate whether Q3 values were related to the magnitude of bias in estimates of reliability, item parameters, and examinee ability. Results showed that Q3 values below the typical cut-off yielded meaningful bias in estimates. Practical implications and limitations are discussed.

这项研究检验了广泛使用的0.2的日元Q3阈值,这是一个侵犯地方独立的指数。具体而言,我们进行了模拟,以调查Q3值是否与可靠性、项目参数和被试能力估计的偏差大小有关。结果表明,低于典型截止值的Q3值在估计中产生了有意义的偏差。讨论了实际意义和局限性。
{"title":"Another Look at Yen's Q3: Is .2 an Appropriate Cut-Off?","authors":"Kelsey Nason,&nbsp;Christine DeMars","doi":"10.1111/jedm.12432","DOIUrl":"https://doi.org/10.1111/jedm.12432","url":null,"abstract":"<p>This study examined the widely used threshold of .2 for Yen's Q3, an index for violations of local independence. Specifically, a simulation was conducted to investigate whether Q3 values were related to the magnitude of bias in estimates of reliability, item parameters, and examinee ability. Results showed that Q3 values below the typical cut-off yielded meaningful bias in estimates. Practical implications and limitations are discussed.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 2","pages":"345-359"},"PeriodicalIF":1.4,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12432","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144524843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Influence of Intersectional Routing Modules between Dimensions on Measurement Precision in Multidimensional Multistage Testing 多维多阶段测试中尺寸间交叉布线模块对测量精度的影响
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-04-03 DOI: 10.1111/jedm.12430
Yi-Ling Wu, Yao-Hsuan Huang, Chia-Wen Chen, Po-Hsi Chen

Multistage testing (MST), a variant of computerized adaptive testing (CAT), differs from conventional CAT in that it is adapted at the module level rather than at the individual item level. Typically, all examinees begin the MST with a linear test form in the first stage, commonly known as the routing stage. In 2020, Han introduced an innovative concept known as Intersectional Routing (ISR), which allows module selection in the first stage of the MST based on the examinees' estimated scores. These scores were predicted using a variety of information, including background data and other correlated latent traits.

In this study, we extend Han's ISR framework to a multidimensional test comprising multiple unidimensional subtests. In a multidimensional test, the correlation coefficients between the latent traits can be estimated by fitting a multidimensional item response theory model. Our extension allows module selection in the first stage of each subtest to consider information from all the other subtests via the known correlation matrix. The results of simulation studies showed that our extension improved the measurement compared with typical MST designs in conditions with moderate intercorrelations across module designs. The practical insights were given in the empirical analysis.

多阶段测试(MST)是计算机化自适应测试(CAT)的一种变体,与传统的CAT不同之处在于它在模块层面而不是在单个项目层面进行调整。通常,所有考生在第一阶段(通常称为路由阶段)以线性测试形式开始MST。2020年,韩提出了一个名为“交叉路由”(ISR)的创新概念,允许在MST的第一阶段根据考生的预估分数选择模块。这些分数是使用各种信息预测的,包括背景数据和其他相关的潜在特征。在本研究中,我们将Han的ISR框架扩展到包含多个单维子测试的多维测试。在多维测验中,可以通过拟合多维项目反应理论模型来估计潜在特质之间的相关系数。我们的扩展允许在每个子测试的第一阶段选择模块,通过已知的相关矩阵考虑来自所有其他子测试的信息。仿真研究结果表明,与典型的MST设计相比,我们的扩展在模块设计之间具有中等相关性的条件下改善了测量结果。在实证分析中给出了实践见解。
{"title":"Influence of Intersectional Routing Modules between Dimensions on Measurement Precision in Multidimensional Multistage Testing","authors":"Yi-Ling Wu,&nbsp;Yao-Hsuan Huang,&nbsp;Chia-Wen Chen,&nbsp;Po-Hsi Chen","doi":"10.1111/jedm.12430","DOIUrl":"https://doi.org/10.1111/jedm.12430","url":null,"abstract":"<div>\u0000 \u0000 <section>\u0000 \u0000 <p>Multistage testing (MST), a variant of computerized adaptive testing (CAT), differs from conventional CAT in that it is adapted at the module level rather than at the individual item level. Typically, all examinees begin the MST with a linear test form in the first stage, commonly known as the routing stage. In 2020, Han introduced an innovative concept known as Intersectional Routing (ISR), which allows module selection in the first stage of the MST based on the examinees' estimated scores. These scores were predicted using a variety of information, including background data and other correlated latent traits.</p>\u0000 \u0000 <p>In this study, we extend Han's ISR framework to a multidimensional test comprising multiple unidimensional subtests. In a multidimensional test, the correlation coefficients between the latent traits can be estimated by fitting a multidimensional item response theory model. Our extension allows module selection in the first stage of each subtest to consider information from all the other subtests via the known correlation matrix. The results of simulation studies showed that our extension improved the measurement compared with typical MST designs in conditions with moderate intercorrelations across module designs. The practical insights were given in the empirical analysis.</p>\u0000 </section>\u0000 </div>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 3","pages":"438-467"},"PeriodicalIF":1.6,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12430","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145341442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Keystroke Dynamics to Detect Nonoriginal Text 使用击键动力学来检测非原始文本
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-04-03 DOI: 10.1111/jedm.12431
Paul Deane, Mo Zhang, Jiangang Hao, Chen Li

Keystroke analysis has often been used for security purposes, most often to authenticate users and identify impostors. This paper examines the use of keystroke analysis to distinguish between the behavior of writers who are composing an original text, vs. copying or otherwise reproducing a non-original texts. Recent advances in text generation using large language models makes the use of behavioral cues to identify plagiarism more pressing, since users seeking an advantage on a writing assessment may be able to submit unique AI-generated texts. We examine the use of keystroke log analysis to detect non-original text under three conditions: a laboratory study, where participants were either copying a known text or drafting an original essay, and two studies from operational assessments, where it was possible to identify essays that were non-original by refernece to their content. Our results indicate that it is possible to achieve accuracies inexcess of 94% under ideal conditions where the nature of each writing sessionis known in advance, and greater than 89% in operational conditions where proxies for non-original status, such as similarity to other submitted essays, must be used.

击键分析通常用于安全目的,最常用于验证用户身份和识别冒名顶替者。本文研究了使用击键分析来区分撰写原创文本的作者与复制或以其他方式复制非原创文本的作者的行为。使用大型语言模型的文本生成的最新进展使得使用行为线索来识别剽窃更加紧迫,因为在写作评估中寻求优势的用户可能能够提交独特的人工智能生成文本。我们研究了在三种情况下使用击键日志分析来检测非原创文本:一种是实验室研究,参与者要么复制已知文本,要么起草原创文章;另一种是操作评估研究,通过参考其内容可以识别非原创文章。我们的结果表明,在事先知道每个写作阶段的性质的理想条件下,有可能实现超过94%的准确性,并且在必须使用非原始状态代理(例如与其他提交的论文的相似性)的操作条件下,有可能实现超过89%的准确性。
{"title":"Using Keystroke Dynamics to Detect Nonoriginal Text","authors":"Paul Deane,&nbsp;Mo Zhang,&nbsp;Jiangang Hao,&nbsp;Chen Li","doi":"10.1111/jedm.12431","DOIUrl":"https://doi.org/10.1111/jedm.12431","url":null,"abstract":"<div>\u0000 \u0000 <section>\u0000 \u0000 <p>Keystroke analysis has often been used for security purposes, most often to authenticate users and identify impostors. This paper examines the use of keystroke analysis to distinguish between the behavior of writers who are composing an original text, vs. copying or otherwise reproducing a non-original texts. Recent advances in text generation using large language models makes the use of behavioral cues to identify plagiarism more pressing, since users seeking an advantage on a writing assessment may be able to submit unique AI-generated texts. We examine the use of keystroke log analysis to detect non-original text under three conditions: a laboratory study, where participants were either copying a known text or drafting an original essay, and two studies from operational assessments, where it was possible to identify essays that were non-original by refernece to their content. Our results indicate that it is possible to achieve accuracies inexcess of 94% under ideal conditions where the nature of each writing sessionis known in advance, and greater than 89% in operational conditions where proxies for non-original status, such as similarity to other submitted essays, must be used.</p>\u0000 </section>\u0000 </div>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146091158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Keystroke Dynamics to Detect Nonoriginal Text 使用击键动力学来检测非原始文本
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-04-03 DOI: 10.1111/jedm.12431
Paul Deane, Mo Zhang, Jiangang Hao, Chen Li

Keystroke analysis has often been used for security purposes, most often to authenticate users and identify impostors. This paper examines the use of keystroke analysis to distinguish between the behavior of writers who are composing an original text, vs. copying or otherwise reproducing a non-original texts. Recent advances in text generation using large language models makes the use of behavioral cues to identify plagiarism more pressing, since users seeking an advantage on a writing assessment may be able to submit unique AI-generated texts. We examine the use of keystroke log analysis to detect non-original text under three conditions: a laboratory study, where participants were either copying a known text or drafting an original essay, and two studies from operational assessments, where it was possible to identify essays that were non-original by refernece to their content. Our results indicate that it is possible to achieve accuracies inexcess of 94% under ideal conditions where the nature of each writing sessionis known in advance, and greater than 89% in operational conditions where proxies for non-original status, such as similarity to other submitted essays, must be used.

击键分析通常用于安全目的,最常用于验证用户身份和识别冒名顶替者。本文研究了使用击键分析来区分撰写原创文本的作者与复制或以其他方式复制非原创文本的作者的行为。使用大型语言模型的文本生成的最新进展使得使用行为线索来识别剽窃更加紧迫,因为在写作评估中寻求优势的用户可能能够提交独特的人工智能生成文本。我们研究了在三种情况下使用击键日志分析来检测非原创文本:一种是实验室研究,参与者要么复制已知文本,要么起草原创文章;另一种是操作评估研究,通过参考其内容可以识别非原创文章。我们的结果表明,在事先知道每个写作阶段的性质的理想条件下,有可能实现超过94%的准确性,并且在必须使用非原始状态代理(例如与其他提交的论文的相似性)的操作条件下,有可能实现超过89%的准确性。
{"title":"Using Keystroke Dynamics to Detect Nonoriginal Text","authors":"Paul Deane,&nbsp;Mo Zhang,&nbsp;Jiangang Hao,&nbsp;Chen Li","doi":"10.1111/jedm.12431","DOIUrl":"https://doi.org/10.1111/jedm.12431","url":null,"abstract":"<div>\u0000 \u0000 <section>\u0000 \u0000 <p>Keystroke analysis has often been used for security purposes, most often to authenticate users and identify impostors. This paper examines the use of keystroke analysis to distinguish between the behavior of writers who are composing an original text, vs. copying or otherwise reproducing a non-original texts. Recent advances in text generation using large language models makes the use of behavioral cues to identify plagiarism more pressing, since users seeking an advantage on a writing assessment may be able to submit unique AI-generated texts. We examine the use of keystroke log analysis to detect non-original text under three conditions: a laboratory study, where participants were either copying a known text or drafting an original essay, and two studies from operational assessments, where it was possible to identify essays that were non-original by refernece to their content. Our results indicate that it is possible to achieve accuracies inexcess of 94% under ideal conditions where the nature of each writing sessionis known in advance, and greater than 89% in operational conditions where proxies for non-original status, such as similarity to other submitted essays, must be used.</p>\u0000 </section>\u0000 </div>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146091041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Comparison of Anchor Selection Strategies for DIF Analysis DIF分析锚点选择策略的比较
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-03-20 DOI: 10.1111/jedm.12429
Haeju Lee, Kyung Yong Kim
<p>When no prior information of differential item functioning (DIF) exists for items in a test, either the rank-based or iterative purification procedure might be preferred. The rank-based purification selects anchor items based on a preliminary DIF test. For a preliminary DIF test, likelihood ratio test (LRT) based approaches (e.g., all-others-as-anchors: AOAA and one-item-anchor: OIA) and an improved version of Lord's Wald test (i.e., anchor-all-test-all: AATA) have been used in research studies. However, both LRT- and Wald-based procedures often select DIF items as anchor items and as a result, inflate Type <span></span><math> <semantics> <mi>I</mi> <annotation>${mathrm{{mathrm I}}}$</annotation> </semantics></math> error rates. To overcome this issue, minimum test statistics (<span></span><math> <semantics> <mrow> <mi>Min</mi> <mspace></mspace> <msup> <mi>G</mi> <mn>2</mn> </msup> </mrow> <annotation>${mathrm{Min}};{G^2}$</annotation> </semantics></math>/<span></span><math> <semantics> <msup> <mi>χ</mi> <mn>2</mn> </msup> <annotation>${chi ^2}$</annotation> </semantics></math>) or items with nonsignificant test statistics and large discrimination parameter estimates (<span></span><math> <semantics> <mi>NonsigMax</mi> <annotation>${mathrm{NonsigMax}}$</annotation> </semantics></math><i>A</i>) have been suggested in the literature to select anchor items. Nevertheless, little research has been done comparing combinations of the three anchor selection procedures paired with the two anchor selection criteria. Thus, the performance of the six rank-based strategies was compared in this study in terms of accuracy, power, and Type <span></span><math> <semantics> <mi>I</mi> <annotation>${mathrm{{mathrm I}}}$</annotation> </semantics></math> error rates. Among the rank-based strategies, the AOAA-based strategies demonstrated greater robustness across various conditions compared to the AATA- and OIA-based strategies. Additionally, the <span></span><math> <semantics> <mrow> <mrow> <mi>Min</mi> <mspace></mspace> </mrow> <msup> <mi>G</mi> <mn>2</mn> </msup> </mrow> <annotation>${mathrm{Min;}}{G^2}$</annotation> </semantics></math>/<span></span><math> <se
当测试中的项目不存在差异项目功能(DIF)的先验信息时,基于等级的或迭代的纯化过程可能是首选的。基于等级的纯化根据初步的DIF测试选择锚点项目。对于初步的DIF检验,研究中使用了基于似然比检验(LRT)的方法(例如,所有其他人作为锚:AOAA和一项锚:OIA)和改进版本的Lord's Wald检验(即锚点全部测试:AATA)。然而,基于LRT和基于wald的过程通常都选择DIF项作为锚定项,因此,膨胀Type I ${ mathm {{ mathm I}}}$错误率。为了克服这个问题,最小检验统计量(Min g2 ${ mathm {Min}};{G^2}$ / χ 2 ${chi ^2}$)或具有不显著检验统计量和大判别参数估计的项目(NonsigMax ${mathrm{NonsigMax}}$ A)已经在文献中被建议选择锚项。然而,很少有研究对三种锚点选择程序与两种锚点选择标准的组合进行比较。因此,在本研究中,比较了六种基于排名的策略在准确率、功率和类型I ${ mathm {{ mathm I}}}$错误率方面的表现。在基于排名的策略中,与基于AATA和基于oia的策略相比,基于aoaa的策略在各种条件下表现出更强的鲁棒性。此外,Min g2 ${ mathm {Min;}}{G^2}$ / χ 2 ${chi ^2}$准则在各种条件下均表现出较好的性能与NonsigMax ${mathrm{NonsigMax}}$ A标准比较。
{"title":"A Comparison of Anchor Selection Strategies for DIF Analysis","authors":"Haeju Lee,&nbsp;Kyung Yong Kim","doi":"10.1111/jedm.12429","DOIUrl":"https://doi.org/10.1111/jedm.12429","url":null,"abstract":"&lt;p&gt;When no prior information of differential item functioning (DIF) exists for items in a test, either the rank-based or iterative purification procedure might be preferred. The rank-based purification selects anchor items based on a preliminary DIF test. For a preliminary DIF test, likelihood ratio test (LRT) based approaches (e.g., all-others-as-anchors: AOAA and one-item-anchor: OIA) and an improved version of Lord's Wald test (i.e., anchor-all-test-all: AATA) have been used in research studies. However, both LRT- and Wald-based procedures often select DIF items as anchor items and as a result, inflate Type &lt;span&gt;&lt;/span&gt;&lt;math&gt;\u0000 &lt;semantics&gt;\u0000 &lt;mi&gt;I&lt;/mi&gt;\u0000 &lt;annotation&gt;${mathrm{{mathrm I}}}$&lt;/annotation&gt;\u0000 &lt;/semantics&gt;&lt;/math&gt; error rates. To overcome this issue, minimum test statistics (&lt;span&gt;&lt;/span&gt;&lt;math&gt;\u0000 &lt;semantics&gt;\u0000 &lt;mrow&gt;\u0000 &lt;mi&gt;Min&lt;/mi&gt;\u0000 &lt;mspace&gt;&lt;/mspace&gt;\u0000 &lt;msup&gt;\u0000 &lt;mi&gt;G&lt;/mi&gt;\u0000 &lt;mn&gt;2&lt;/mn&gt;\u0000 &lt;/msup&gt;\u0000 &lt;/mrow&gt;\u0000 &lt;annotation&gt;${mathrm{Min}};{G^2}$&lt;/annotation&gt;\u0000 &lt;/semantics&gt;&lt;/math&gt;/&lt;span&gt;&lt;/span&gt;&lt;math&gt;\u0000 &lt;semantics&gt;\u0000 &lt;msup&gt;\u0000 &lt;mi&gt;χ&lt;/mi&gt;\u0000 &lt;mn&gt;2&lt;/mn&gt;\u0000 &lt;/msup&gt;\u0000 &lt;annotation&gt;${chi ^2}$&lt;/annotation&gt;\u0000 &lt;/semantics&gt;&lt;/math&gt;) or items with nonsignificant test statistics and large discrimination parameter estimates (&lt;span&gt;&lt;/span&gt;&lt;math&gt;\u0000 &lt;semantics&gt;\u0000 &lt;mi&gt;NonsigMax&lt;/mi&gt;\u0000 &lt;annotation&gt;${mathrm{NonsigMax}}$&lt;/annotation&gt;\u0000 &lt;/semantics&gt;&lt;/math&gt;&lt;i&gt;A&lt;/i&gt;) have been suggested in the literature to select anchor items. Nevertheless, little research has been done comparing combinations of the three anchor selection procedures paired with the two anchor selection criteria. Thus, the performance of the six rank-based strategies was compared in this study in terms of accuracy, power, and Type &lt;span&gt;&lt;/span&gt;&lt;math&gt;\u0000 &lt;semantics&gt;\u0000 &lt;mi&gt;I&lt;/mi&gt;\u0000 &lt;annotation&gt;${mathrm{{mathrm I}}}$&lt;/annotation&gt;\u0000 &lt;/semantics&gt;&lt;/math&gt; error rates. Among the rank-based strategies, the AOAA-based strategies demonstrated greater robustness across various conditions compared to the AATA- and OIA-based strategies. Additionally, the &lt;span&gt;&lt;/span&gt;&lt;math&gt;\u0000 &lt;semantics&gt;\u0000 &lt;mrow&gt;\u0000 &lt;mrow&gt;\u0000 &lt;mi&gt;Min&lt;/mi&gt;\u0000 &lt;mspace&gt;&lt;/mspace&gt;\u0000 &lt;/mrow&gt;\u0000 &lt;msup&gt;\u0000 &lt;mi&gt;G&lt;/mi&gt;\u0000 &lt;mn&gt;2&lt;/mn&gt;\u0000 &lt;/msup&gt;\u0000 &lt;/mrow&gt;\u0000 &lt;annotation&gt;${mathrm{Min;}}{G^2}$&lt;/annotation&gt;\u0000 &lt;/semantics&gt;&lt;/math&gt;/&lt;span&gt;&lt;/span&gt;&lt;math&gt;\u0000 &lt;se","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 2","pages":"311-344"},"PeriodicalIF":1.4,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12429","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144525159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modeling Missing Response Data in Item Response Theory: Addressing Missing Not at Random Mechanism with Monotone Missing Characteristics 项目反应理论中缺失反应数据的建模:用单调缺失特征寻址缺失非随机机制
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-02-24 DOI: 10.1111/jedm.12428
Jiwei Zhang, Jing Lu, Zhaoyuan Zhang

Item nonresponses frequently occurs in educational and psychological assessments, and if not appropriately handled, it can undermine the reliability of the results. This study introduces a missing data model based on the missing not at random (MNAR) mechanism, incorporating the monotonic missingness assumption to capture individual-level missingness patterns and behavioral dynamics. In specific, the cumulative number of missing indicators allows to consider the tendency of current item's missingness based on the previous missingnesses, which reduces the number of nuisance parameters for modeling missing data mechanisms. Two Bayesian model evaluation criteria were developed to distinguish between missing at random (MAR) and MNAR mechanisms by imposing specific parameter constraints. Additionally, the study introduces a highly efficient Bayesian slice sampling algorithm to estimate the model parameters. Four simulation studies were conducted to show the performance of the proposed model. The PISA 2015 science data was carried out to further illustrate the application of the proposed approach.

项目无反应经常发生在教育和心理评估中,如果处理不当,它会破坏结果的可靠性。本文引入了一个基于非随机缺失(MNAR)机制的缺失数据模型,结合单调缺失假设来捕捉个体层面的缺失模式和行为动态。具体来说,缺失指标的累积数量允许考虑当前项目在之前缺失的基础上的缺失趋势,这减少了对缺失数据机制建模的麻烦参数的数量。建立了两个贝叶斯模型评价标准,通过施加特定的参数约束来区分随机缺失(MAR)和MNAR机制。此外,该研究还引入了一种高效的贝叶斯切片采样算法来估计模型参数。进行了四次仿真研究,以证明所提出模型的性能。2015年的PISA科学数据进一步说明了所提出方法的应用。
{"title":"Modeling Missing Response Data in Item Response Theory: Addressing Missing Not at Random Mechanism with Monotone Missing Characteristics","authors":"Jiwei Zhang,&nbsp;Jing Lu,&nbsp;Zhaoyuan Zhang","doi":"10.1111/jedm.12428","DOIUrl":"10.1111/jedm.12428","url":null,"abstract":"<p>Item nonresponses frequently occurs in educational and psychological assessments, and if not appropriately handled, it can undermine the reliability of the results. This study introduces a missing data model based on the missing not at random (MNAR) mechanism, incorporating the monotonic missingness assumption to capture individual-level missingness patterns and behavioral dynamics. In specific, the cumulative number of missing indicators allows to consider the tendency of current item's missingness based on the previous missingnesses, which reduces the number of nuisance parameters for modeling missing data mechanisms. Two Bayesian model evaluation criteria were developed to distinguish between missing at random (MAR) and MNAR mechanisms by imposing specific parameter constraints. Additionally, the study introduces a highly efficient Bayesian slice sampling algorithm to estimate the model parameters. Four simulation studies were conducted to show the performance of the proposed model. The PISA 2015 science data was carried out to further illustrate the application of the proposed approach.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146057746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Vulnerability of AI-Based Scoring Systems to Gaming Strategies: A Case Study 基于ai的计分系统对游戏策略的脆弱性:案例研究
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-02-20 DOI: 10.1111/jedm.12427
Peter Baldwin, Victoria Yaneva, Kai North, Le An Ha, Yiyun Zhou, Alex J. Mechaber, Brian E. Clauser

Recent developments in the use of large-language models have led to substantial improvements in the accuracy of content-based automated scoring of free-text responses. The reported accuracy levels suggest that automated systems could have widespread applicability in assessment. However, before they are used in operational testing, other aspects of their performance warrant examination. In this study, we explore the potential for examinees to inflate their scores by gaming the ACTA automated scoring system. We explore a range of strategies including responding with words selected from the item stem and responding with multiple answers. These responses would be easily identified as incorrect by a human rater but may result in false-positive classifications from an automated system. Our results show that the rate at which these strategies produce responses that are scored as correct varied across items and across strategies but that several vulnerabilities exist.

最近在使用大语言模型方面的发展使得基于内容的自由文本自动评分的准确性有了实质性的提高。报告的准确性水平表明自动化系统在评估中具有广泛的适用性。然而,在它们用于操作测试之前,它们性能的其他方面需要进行检查。在这项研究中,我们探讨了考生通过利用ACTA自动评分系统来提高分数的可能性。我们探索了一系列的策略,包括用从条目中选择的词来回答和用多个答案来回答。这些回答很容易被人工评分员识别为不正确,但可能导致自动系统的误报分类。我们的结果表明,这些策略产生的正确得分的比率在不同的项目和不同的策略中有所不同,但存在一些漏洞。
{"title":"The Vulnerability of AI-Based Scoring Systems to Gaming Strategies: A Case Study","authors":"Peter Baldwin,&nbsp;Victoria Yaneva,&nbsp;Kai North,&nbsp;Le An Ha,&nbsp;Yiyun Zhou,&nbsp;Alex J. Mechaber,&nbsp;Brian E. Clauser","doi":"10.1111/jedm.12427","DOIUrl":"https://doi.org/10.1111/jedm.12427","url":null,"abstract":"<p>Recent developments in the use of large-language models have led to substantial improvements in the accuracy of content-based automated scoring of free-text responses. The reported accuracy levels suggest that automated systems could have widespread applicability in assessment. However, before they are used in operational testing, other aspects of their performance warrant examination. In this study, we explore the potential for examinees to inflate their scores by gaming the ACTA automated scoring system. We explore a range of strategies including responding with words selected from the item stem and responding with multiple answers. These responses would be easily identified as incorrect by a human rater but may result in false-positive classifications from an automated system. Our results show that the rate at which these strategies produce responses that are scored as correct varied across items and across strategies but that several vulnerabilities exist.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 1","pages":"172-194"},"PeriodicalIF":1.4,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Item Parameter Predictions for Reducing Calibration Sample Requirements—A Case Study Based on a High-Stakes Admission Test 使用项目参数预测减少校准样本需求-基于高风险入学测试的案例研究
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-02-16 DOI: 10.1111/jedm.12426
Esther Ulitzsch, Dmitry Belov, Oliver Lüdtke, Alexander Robitzsch

In item difficulty modeling (IDM), item parameters are predicted from the items' linguistic features, aiming to ultimately render item calibration redundant. Current IDM applications, however, commonly do not yield the required prediction accuracy. To immediately exploit even somewhat inaccurate IDM predictions, we blend IDM with established Bayesian estimation techniques. We propose a two-step approach where (1) IDM predictions are obtained and (2) employed to construct informative prior distributions. We evaluate the approach in a case study on small-sample calibration of the 3PL in a high-stakes test. First, concerning implementation, we find computationally efficient penalized maximum likelihood estimation to be comparable to the best-performing MCMC-based approach. Second, we investigate sample size reductions achievable with state-of-the-art IDM predictions, finding negligible gains compared to merely considering the historical distribution of parameters. Third, we evaluate the prediction accuracy required for a targeted sample size reduction by gradually increasing simulated IDM prediction accuracies. We find that required accuracies can counterbalance each other, allowing calibration sample size to be reduced when either high-quality item difficulty predictions or good predictions of item discriminations and pseudo-guessing parameters are available. We argue that these evaluations provide new, portable IDM benchmarks quantifying performance in terms of achievable sample size reductions.

在项目难度建模(IDM)中,根据项目的语言特征来预测项目参数,目的是最终使项目校准冗余。然而,目前的IDM应用通常不能产生所需的预测精度。为了立即利用甚至有些不准确的IDM预测,我们将IDM与已建立的贝叶斯估计技术混合在一起。我们提出了一种两步方法,其中(1)获得IDM预测,(2)用于构建信息先验分布。我们在高风险测试中对第三方物流的小样本校准的案例研究中评估了该方法。首先,关于实现,我们发现计算效率高的惩罚最大似然估计与性能最好的基于mcmc的方法相当。其次,我们研究了使用最先进的IDM预测可以实现的样本量减少,发现与仅考虑参数的历史分布相比,收益可以忽略不计。第三,我们通过逐渐增加模拟IDM预测精度来评估目标样本量减少所需的预测精度。我们发现所需的准确度可以相互抵消,当高质量的项目难度预测或项目判别和伪猜测参数的良好预测可用时,允许减少校准样本量。我们认为,这些评估提供了新的、便携的IDM基准,可以根据可实现的样本量减少来量化性能。
{"title":"Using Item Parameter Predictions for Reducing Calibration Sample Requirements—A Case Study Based on a High-Stakes Admission Test","authors":"Esther Ulitzsch,&nbsp;Dmitry Belov,&nbsp;Oliver Lüdtke,&nbsp;Alexander Robitzsch","doi":"10.1111/jedm.12426","DOIUrl":"10.1111/jedm.12426","url":null,"abstract":"<p>In item difficulty modeling (IDM), item parameters are predicted from the items' linguistic features, aiming to ultimately render item calibration redundant. Current IDM applications, however, commonly do not yield the required prediction accuracy. To immediately exploit even somewhat inaccurate IDM predictions, we blend IDM with established Bayesian estimation techniques. We propose a two-step approach where (1) IDM predictions are obtained and (2) employed to construct informative prior distributions. We evaluate the approach in a case study on small-sample calibration of the 3PL in a high-stakes test. First, concerning implementation, we find computationally efficient penalized maximum likelihood estimation to be comparable to the best-performing MCMC-based approach. Second, we investigate sample size reductions achievable with state-of-the-art IDM predictions, finding negligible gains compared to merely considering the historical distribution of parameters. Third, we evaluate the prediction accuracy required for a targeted sample size reduction by gradually increasing simulated IDM prediction accuracies. We find that required accuracies can counterbalance each other, allowing calibration sample size to be reduced when either high-quality item difficulty predictions or good predictions of item discriminations and pseudo-guessing parameters are available. We argue that these evaluations provide new, portable IDM benchmarks quantifying performance in terms of achievable sample size reductions.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146096515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Multilabel Neural Network to Score High-Dimensional Assessments for Different Use Foci: An Example with College Major Preference Assessment 使用多标签神经网络为不同使用重点的高维评估打分:以大学专业偏好评估为例
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-01-14 DOI: 10.1111/jedm.12424
Shun-Fu Hu, Amery D. Wu, Jake Stone

Scoring high-dimensional assessments (e.g., > 15 traits) can be a challenging task. This paper introduces the multilabel neural network (MNN) as a scoring method for high-dimensional assessments. Additionally, it demonstrates how MNN can score the same test responses to maximize different performance metrics, such as accuracy, recall, or precision, to suit users' varying needs. These two objectives are illustrated with an example of scoring the short version of the College Majors Preference assessment (Short CMPA) to match the results of whether the 50 college majors would be in one's top three, as determined by the Long CMPA. The results reveal that MNN significantly outperforms the simple-sum ranking method (i.e., ranking the 50 majors' subscale scores) in targeting recall (.95 vs. .68) and precision (.53 vs. .38), while gaining an additional 3% in accuracy (.94 vs. .91). These findings suggest that, when executed properly, MNN can be a flexible and practical tool for scoring numerous traits and addressing various use foci.

对高维评估(例如,>;15个特征)可能是一项具有挑战性的任务。本文介绍了多标签神经网络(MNN)作为高维评价的评分方法。此外,它还演示了MNN如何对相同的测试响应进行评分,以最大化不同的性能指标,例如准确性、召回率或精度,以满足用户的不同需求。这两个目标通过一个简短版的大学专业偏好评估(short CMPA)的例子来说明,以匹配50个大学专业是否会进入前三名的结果,由长CMPA决定。结果表明,MNN在目标召回(recall)方面显著优于简单和排序方法(即对50个专业的子量表分数进行排序)。95 vs. 68)和精度(。53 vs. 38),同时获得额外3%的准确性(。94 vs. 91)。这些发现表明,如果执行得当,MNN可以成为一种灵活实用的工具,用于评分许多特征和解决各种使用焦点。
{"title":"Using Multilabel Neural Network to Score High-Dimensional Assessments for Different Use Foci: An Example with College Major Preference Assessment","authors":"Shun-Fu Hu,&nbsp;Amery D. Wu,&nbsp;Jake Stone","doi":"10.1111/jedm.12424","DOIUrl":"https://doi.org/10.1111/jedm.12424","url":null,"abstract":"<p>Scoring high-dimensional assessments (e.g., &gt; 15 traits) can be a challenging task. This paper introduces the multilabel neural network (MNN) as a scoring method for high-dimensional assessments. Additionally, it demonstrates how MNN can score the same test responses to maximize different performance metrics, such as accuracy, recall, or precision, to suit users' varying needs. These two objectives are illustrated with an example of scoring the short version of the College Majors Preference assessment (Short CMPA) to match the results of whether the 50 college majors would be in one's top three, as determined by the Long CMPA. The results reveal that MNN significantly outperforms the simple-sum ranking method (i.e., ranking the 50 majors' subscale scores) in targeting recall (.95 vs. .68) and precision (.53 vs. .38), while gaining an additional 3% in accuracy (.94 vs. .91). These findings suggest that, when executed properly, MNN can be a flexible and practical tool for scoring numerous traits and addressing various use foci.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 1","pages":"120-144"},"PeriodicalIF":1.4,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143689091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Educational Measurement
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1