首页 > 最新文献

Applied Measurement in Education最新文献

英文 中文
Maintaining Score Scales Over Time: A Comparison of Five Scoring Methods 随着时间的推移保持分数尺度:五种评分方法的比较
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2023-01-02 DOI: 10.1080/08957347.2023.2172015
S. Y. Kim, Won‐Chan Lee
ABSTRACT This study evaluates various scoring methods including number-correct scoring, IRT theta scoring, and hybrid scoring in terms of scale-score stability over time. A simulation study was conducted to examine the relative performance of five scoring methods in terms of preserving the first two moments of scale scores for a population in a chain of linking with multiple test forms. Simulation factors included 1) the number of forms linked back to the initial form, 2) the pattern in mean shift, and 3) the proportion of common items. Results showed that scoring methods that operate with number-correct scores generally outperform those that are based on IRT proficiency estimators ( ) in terms of reproducing the mean and standard deviation of scale scores. Scoring methods performed differently as a function of patterns in a group proficiency change.
摘要本研究根据量表得分随时间的稳定性评估了各种评分方法,包括数字正确评分、IRTθ评分和混合评分。进行了一项模拟研究,以检验五种评分方法在保留多个测试形式链接链中人群的量表得分前两个矩方面的相对性能。模拟因素包括1)链接回初始表格的表格数量,2)均值偏移的模式,以及3)常见项目的比例。结果表明,在再现量表得分的平均值和标准差方面,使用数字正确得分的评分方法通常优于基于IRT熟练度估计量()的评分方法。评分方法在小组熟练程度变化中作为模式的函数表现不同。
{"title":"Maintaining Score Scales Over Time: A Comparison of Five Scoring Methods","authors":"S. Y. Kim, Won‐Chan Lee","doi":"10.1080/08957347.2023.2172015","DOIUrl":"https://doi.org/10.1080/08957347.2023.2172015","url":null,"abstract":"ABSTRACT This study evaluates various scoring methods including number-correct scoring, IRT theta scoring, and hybrid scoring in terms of scale-score stability over time. A simulation study was conducted to examine the relative performance of five scoring methods in terms of preserving the first two moments of scale scores for a population in a chain of linking with multiple test forms. Simulation factors included 1) the number of forms linked back to the initial form, 2) the pattern in mean shift, and 3) the proportion of common items. Results showed that scoring methods that operate with number-correct scores generally outperform those that are based on IRT proficiency estimators ( ) in terms of reproducing the mean and standard deviation of scale scores. Scoring methods performed differently as a function of patterns in a group proficiency change.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"36 1","pages":"60 - 79"},"PeriodicalIF":1.5,"publicationDate":"2023-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46970807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accuracy and Sensitivity of Coefficient Alpha and Its Alternatives with Unidimensional and Contaminated Scales 一维和污染尺度下系数Alpha及其替代方案的精度和灵敏度
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2023-01-02 DOI: 10.1080/08957347.2023.2172016
Leifeng Xiao, K. Hau
ABSTRACT We compared coefficient alpha with five alternatives (omega total, omega RT, omega h, GLB, and coefficient H) in two simulation studies. Results showed for unidimensional scales, (a) all indices except omega h performed similarly well for most conditions; (b) alpha is still good; (c) GLB and coefficient H overestimated reliability with small samples and short scales, and (d) sensitivity to scale quality reduced with longer scales. For contaminated scales, (a) all indices except omega h were reasonably unbiased with non-severe contamination; (b) alpha, omega total, and GLB were more sensitive in picking up contamination with shorter scales, whereas omega RT and omega h were not; and (c) coefficient H could not pick up contaminated items among high-quality items. For applied researchers, (a) supplementary information of scale characteristics helps choose the appropriate index; (b) comparing different scales with one golden standard is inappropriate; (c) omega h should not be used alone.
摘要:在两项模拟研究中,我们将系数α与五种备选方案(ω-总量、ω-RT、ω-h、GLB和系数h)进行了比较。结果表明,对于一维尺度,(a)除ωh外的所有指数在大多数条件下表现相似;(b) 阿尔法仍然很好;(c) GLB和系数H高估了小样本和短尺度的可靠性,以及(d)对尺度质量的敏感性随着尺度的延长而降低。对于污染量表,(a)除omega h外的所有指数在非严重污染情况下都是合理无偏的;(b) α、ω-总量和GLB在较短尺度的污染中更敏感,而ω-RT和ω-h则不敏感;以及(c)系数H不能在高质量项目中拾取被污染的项目。对于应用研究者来说,(a)量表特征的补充信息有助于选择合适的指标;(b) 用一个黄金标准来比较不同的尺度是不合适的;(c) omega h不应单独使用。
{"title":"Accuracy and Sensitivity of Coefficient Alpha and Its Alternatives with Unidimensional and Contaminated Scales","authors":"Leifeng Xiao, K. Hau","doi":"10.1080/08957347.2023.2172016","DOIUrl":"https://doi.org/10.1080/08957347.2023.2172016","url":null,"abstract":"ABSTRACT We compared coefficient alpha with five alternatives (omega total, omega RT, omega h, GLB, and coefficient H) in two simulation studies. Results showed for unidimensional scales, (a) all indices except omega h performed similarly well for most conditions; (b) alpha is still good; (c) GLB and coefficient H overestimated reliability with small samples and short scales, and (d) sensitivity to scale quality reduced with longer scales. For contaminated scales, (a) all indices except omega h were reasonably unbiased with non-severe contamination; (b) alpha, omega total, and GLB were more sensitive in picking up contamination with shorter scales, whereas omega RT and omega h were not; and (c) coefficient H could not pick up contaminated items among high-quality items. For applied researchers, (a) supplementary information of scale characteristics helps choose the appropriate index; (b) comparing different scales with one golden standard is inappropriate; (c) omega h should not be used alone.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"36 1","pages":"31 - 44"},"PeriodicalIF":1.5,"publicationDate":"2023-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48520089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Using Bayesian Networks for Cognitive Assessment of Student Understanding of Buoyancy: A Granular Hierarchy Model 使用贝叶斯网络对学生浮力理解的认知评估:一个粒度层次模型
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2023-01-02 DOI: 10.1080/08957347.2023.2172014
L. Wang, Sun Xiao Jian, Yan Lou Liu, Tao Xin
ABSTRACT Cognitive diagnostic assessment based on Bayesian networks (BN) is developed in this paper to evaluate student understanding of the physical concept of buoyancy. we propose a three-order granular-hierarchy BN model which accounts for both fine-grained attributes and high-level proficiencies. Conditional independence in the BN structure is tested and utilized to validate the proposed model. The proficiency relationships are verified and the initial Q-matrix is refined. Then, an optimized granular hierarchy model is constructed based on the updated Q-matrix. All variants of the constructed models are evaluated on the basis of the prediction accuracy and the goodness-of-fit test. The experimental results demonstrate that the optimized granular-hierarchy model has the best prediction and model-fitting performance. In general, the BN method not only can provide more flexible modeling approach, but also can help validate or refine the proficiency model and the Q-matrix and this method has its unique advantage in cognitive diagnosis.
本文开发了基于贝叶斯网络(BN)的认知诊断评估,以评估学生对浮力物理概念的理解。我们提出了一个三阶细粒度层次BN模型,该模型同时考虑了细粒度属性和高级熟练度。对BN结构中的条件独立性进行了测试,并用于验证所提出的模型。验证了熟练度关系,并细化了初始Q矩阵。然后,基于更新后的Q矩阵构造了一个优化的粒度层次模型。基于预测精度和拟合优度检验来评估所构建模型的所有变体。实验结果表明,优化后的颗粒层次模型具有最好的预测和模型拟合性能。总的来说,BN方法不仅可以提供更灵活的建模方法,而且可以帮助验证或细化熟练度模型和Q矩阵,该方法在认知诊断中具有独特的优势。
{"title":"Using Bayesian Networks for Cognitive Assessment of Student Understanding of Buoyancy: A Granular Hierarchy Model","authors":"L. Wang, Sun Xiao Jian, Yan Lou Liu, Tao Xin","doi":"10.1080/08957347.2023.2172014","DOIUrl":"https://doi.org/10.1080/08957347.2023.2172014","url":null,"abstract":"ABSTRACT Cognitive diagnostic assessment based on Bayesian networks (BN) is developed in this paper to evaluate student understanding of the physical concept of buoyancy. we propose a three-order granular-hierarchy BN model which accounts for both fine-grained attributes and high-level proficiencies. Conditional independence in the BN structure is tested and utilized to validate the proposed model. The proficiency relationships are verified and the initial Q-matrix is refined. Then, an optimized granular hierarchy model is constructed based on the updated Q-matrix. All variants of the constructed models are evaluated on the basis of the prediction accuracy and the goodness-of-fit test. The experimental results demonstrate that the optimized granular-hierarchy model has the best prediction and model-fitting performance. In general, the BN method not only can provide more flexible modeling approach, but also can help validate or refine the proficiency model and the Q-matrix and this method has its unique advantage in cognitive diagnosis.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"36 1","pages":"45 - 59"},"PeriodicalIF":1.5,"publicationDate":"2023-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49350798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Are Large Admissions Test Coaching Effects Widespread? A Longitudinal Analysis of Admissions Test Scores 大型招生考试辅导效果广泛吗?招生考试成绩的纵向分析
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2023-01-02 DOI: 10.1080/08957347.2023.2172018
Jeffrey A. Dahlke, P. Sackett, N. Kuncel
ABSTRACT We examine longitudinal data from 120,384 students who took a version of the PSAT/SAT in the 9th, 10th, 11th, and 12th grades. We investigate score changes over time and show that socioeconomic status (SES) is related to the degree of score improvement. We note that the 9th and 10th grade PSAT are low-stakes tests, while the operational SAT is a high-stakes test. We posit that investments in coaching would be uncommon for early PSAT administrations, and would be concentrated on efforts to prepare for the operational SAT. We compare score improvements between 9th and 10th grade with improvements between 10th and 12th grade, examining results separately by level of SES. We find similar levels of score improvement in low-stakes and high-stakes settings, with 3.4% of high-SES and 1.1% of low-SES students showing larger-than-expected score improvements, which is inconsistent with claims that high-SES students have routine access to highly effective coaching.
摘要:我们研究了120384名学生的纵向数据,这些学生在9年级、10年级、11年级和12年级参加了PSAT/SAT考试。我们调查了分数随时间的变化,并表明社会经济地位(SES)与分数提高的程度有关。我们注意到,9年级和10年级的PSAT是低风险测试,而操作SAT是高风险测试。我们认为,对教练的投资在早期PSAT管理中并不常见,而是集中在为SAT操作做准备上。我们比较了9年级和10年级之间的分数提高与10年级和12年级之间的成绩提高,并按SES水平分别检查了结果。我们发现,在低风险和高风险环境中,分数的提高水平相似,3.4%的高社会经济地位学生和1.1%的低社会经济地位的学生表现出比预期更大的分数提高,这与社会经济地位高的学生可以常规获得高效辅导的说法不一致。
{"title":"Are Large Admissions Test Coaching Effects Widespread? A Longitudinal Analysis of Admissions Test Scores","authors":"Jeffrey A. Dahlke, P. Sackett, N. Kuncel","doi":"10.1080/08957347.2023.2172018","DOIUrl":"https://doi.org/10.1080/08957347.2023.2172018","url":null,"abstract":"ABSTRACT We examine longitudinal data from 120,384 students who took a version of the PSAT/SAT in the 9th, 10th, 11th, and 12th grades. We investigate score changes over time and show that socioeconomic status (SES) is related to the degree of score improvement. We note that the 9th and 10th grade PSAT are low-stakes tests, while the operational SAT is a high-stakes test. We posit that investments in coaching would be uncommon for early PSAT administrations, and would be concentrated on efforts to prepare for the operational SAT. We compare score improvements between 9th and 10th grade with improvements between 10th and 12th grade, examining results separately by level of SES. We find similar levels of score improvement in low-stakes and high-stakes settings, with 3.4% of high-SES and 1.1% of low-SES students showing larger-than-expected score improvements, which is inconsistent with claims that high-SES students have routine access to highly effective coaching.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"36 1","pages":"1 - 13"},"PeriodicalIF":1.5,"publicationDate":"2023-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42421288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dissecting knowledge, guessing, and blunder in multiple choice assessments. 剖析多项选择评估中的知识、猜测和失误。
IF 1.1 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2023-01-01 Epub Date: 2023-02-21 DOI: 10.1080/08957347.2023.2172017
Rashid M Abu-Ghazalah, David N Dubins, Gregory M K Poon

Multiple choice results are inherently probabilistic outcomes, as correct responses reflect a combination of knowledge and guessing, while incorrect responses additionally reflect blunder, a confidently committed mistake. To objectively resolve knowledge from responses in an MC test structure, we evaluated probabilistic models that explicitly account for guessing, knowledge and blunder using eight assessments (>9,000 responses) from an undergraduate biotechnology curriculum. A Bayesian implementation of the models, aimed at assessing their robustness to prior beliefs in examinee knowledge, showed that explicit estimators of knowledge are markedly sensitive to prior beliefs with scores as sole input. To overcome this limitation, we examined self-ranked confidence as a proxy knowledge indicator. For our test set, three levels of confidence resolved test performance. Responses rated as least confident were correct more frequently than expected from random selection, reflecting partial knowledge, but were balanced by blunder among the most confident responses. By translating evidence-based guessing and blunder rates to pass marks that statistically qualify a desired level of examinee knowledge, our approach finds practical utility in test analysis and design.

多项选择的结果本质上是概率结果,因为正确答案反映了知识和猜测的结合,而错误答案则额外反映了失误,即自信犯下的错误。为了客观地从多选题测试结构中的答案中分辨出知识点,我们使用一个本科生物技术课程的八个评估(超过 9,000 个答案)对明确考虑了猜测、知识点和失误的概率模型进行了评估。模型的贝叶斯实施旨在评估其对考生知识的先验信念的稳健性,结果表明,以分数为唯一输入的知识显式估算器对先验信念非常敏感。为了克服这一局限性,我们将自我排序的信心作为知识的替代指标。在我们的测试集中,三个信心等级决定了测试成绩。被评为最不自信的回答的正确率高于随机选择的预期正确率,这反映了部分知识,但在最自信的回答中被失误所平衡。通过将基于证据的猜测率和失误率转化为及格分数,在统计学上对考生所需的知识水平进行鉴定,我们的方法在测试分析和设计中具有实用价值。
{"title":"Dissecting knowledge, guessing, and blunder in multiple choice assessments.","authors":"Rashid M Abu-Ghazalah, David N Dubins, Gregory M K Poon","doi":"10.1080/08957347.2023.2172017","DOIUrl":"10.1080/08957347.2023.2172017","url":null,"abstract":"<p><p>Multiple choice results are inherently probabilistic outcomes, as correct responses reflect a combination of knowledge and guessing, while incorrect responses additionally reflect blunder, a confidently committed mistake. To objectively resolve knowledge from responses in an MC test structure, we evaluated probabilistic models that explicitly account for guessing, knowledge and blunder using eight assessments (>9,000 responses) from an undergraduate biotechnology curriculum. A Bayesian implementation of the models, aimed at assessing their robustness to prior beliefs in examinee knowledge, showed that explicit estimators of knowledge are markedly sensitive to prior beliefs with scores as sole input. To overcome this limitation, we examined self-ranked confidence as a proxy knowledge indicator. For our test set, three levels of confidence resolved test performance. Responses rated as least confident were correct more frequently than expected from random selection, reflecting partial knowledge, but were balanced by blunder among the most confident responses. By translating evidence-based guessing and blunder rates to pass marks that statistically qualify a desired level of examinee knowledge, our approach finds practical utility in test analysis and design.</p>","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"36 1","pages":"80-98"},"PeriodicalIF":1.1,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10201919/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9522330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Personality Aspects and the Underprediction of Women’s Academic Performance 人格因素与女性学习成绩预测不足
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-10-02 DOI: 10.1080/08957347.2022.2155652
You Zhou, P. Sackett, Thomas Brothen
ABSTRACT We sought to replicate prior findings that admissions tests’ underprediction of female college performance was driven in part by the omission of Big 5 personality factors from the predictive model, using 5,400 college students. We investigated gender differences in an elaborated model subdividing the Big 5 into ten aspects. We found differences at the aspect level that were not found at the factor level, and some aspects had unique relationships with academic outcomes. The findings demonstrated the effect of omitted variables on predictive bias.
摘要本研究以5400名大学生为研究对象,试图复制先前的研究结果,即入学考试对女性大学成绩的低估部分是由于预测模型中遗漏了大五人格因素。我们在一个详细的模型中研究了性别差异,该模型将五大因素细分为十个方面。我们发现方面水平的差异在因素水平上没有发现,一些方面与学术成果有独特的关系。研究结果证明了忽略变量对预测偏差的影响。
{"title":"Personality Aspects and the Underprediction of Women’s Academic Performance","authors":"You Zhou, P. Sackett, Thomas Brothen","doi":"10.1080/08957347.2022.2155652","DOIUrl":"https://doi.org/10.1080/08957347.2022.2155652","url":null,"abstract":"ABSTRACT We sought to replicate prior findings that admissions tests’ underprediction of female college performance was driven in part by the omission of Big 5 personality factors from the predictive model, using 5,400 college students. We investigated gender differences in an elaborated model subdividing the Big 5 into ten aspects. We found differences at the aspect level that were not found at the factor level, and some aspects had unique relationships with academic outcomes. The findings demonstrated the effect of omitted variables on predictive bias.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"35 1","pages":"287 - 299"},"PeriodicalIF":1.5,"publicationDate":"2022-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46522162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Examination of Individual Ability Estimation and Classification Accuracy Under Rapid Guessing Misidentifications 快速猜错识别下的个人能力估计与分类准确度检验
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-10-02 DOI: 10.1080/08957347.2022.2155653
Joseph A. Rios
ABSTRACT To mitigate the deleterious effects of rapid guessing (RG) on ability estimates, several rescoring procedures have been proposed. Underlying many of these procedures is the assumption that RG is accurately identified. At present, there have been minimal investigations examining the utility of rescoring approaches when RG is misclassified, and individual scores are reported. To address this limitation, the present simulation study investigates the effect of RG misclassifications on individual examinee ability estimate bias and classification accuracy when using effort-moderated (EM) scoring. This objective is accomplished by manipulating simulee ability level, RG rate, as well as misclassification type and percentage. Results showed that EM scoring significantly improved ability inferences for examinees engaging in RG; however, the effectiveness of this approach was largely dependent on misclassification type. Specifically, across ability levels, bias tended to be on average lower when falsely classifying effortful responses as RG. Although EM scoring improved bias, it was susceptible to elevated false-positive classifications of ability under high RG.
摘要:为了减轻快速猜测(RG)对能力估计的有害影响,提出了几种评分方法。许多这些程序的基础是假设RG是准确识别的。目前,在RG被错误分类时,对评分方法的效用进行了最少的调查,并且报告了个人分数。为了解决这一限制,本模拟研究探讨了在使用努力调节(EM)评分时,RG错误分类对个体考生能力估计偏差和分类准确性的影响。这一目标是通过操纵模拟能力水平、RG率以及错误分类类型和百分比来实现的。结果表明,EM评分显著提高了参与RG的考生的能力推断;然而,这种方法的有效性在很大程度上取决于误分类类型。具体来说,在不同的能力水平上,当错误地将努力反应归类为RG时,偏见往往平均较低。虽然EM评分改善了偏倚,但在高RG下,它容易引起能力假阳性分类的增加。
{"title":"An Examination of Individual Ability Estimation and Classification Accuracy Under Rapid Guessing Misidentifications","authors":"Joseph A. Rios","doi":"10.1080/08957347.2022.2155653","DOIUrl":"https://doi.org/10.1080/08957347.2022.2155653","url":null,"abstract":"ABSTRACT To mitigate the deleterious effects of rapid guessing (RG) on ability estimates, several rescoring procedures have been proposed. Underlying many of these procedures is the assumption that RG is accurately identified. At present, there have been minimal investigations examining the utility of rescoring approaches when RG is misclassified, and individual scores are reported. To address this limitation, the present simulation study investigates the effect of RG misclassifications on individual examinee ability estimate bias and classification accuracy when using effort-moderated (EM) scoring. This objective is accomplished by manipulating simulee ability level, RG rate, as well as misclassification type and percentage. Results showed that EM scoring significantly improved ability inferences for examinees engaging in RG; however, the effectiveness of this approach was largely dependent on misclassification type. Specifically, across ability levels, bias tended to be on average lower when falsely classifying effortful responses as RG. Although EM scoring improved bias, it was susceptible to elevated false-positive classifications of ability under high RG.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"35 1","pages":"300 - 312"},"PeriodicalIF":1.5,"publicationDate":"2022-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42107151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparison of Methods for Identifying Differential Step Functioning with Polytomous Item Response Data 基于多同构项目响应数据的差分阶跃函数识别方法比较
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-10-02 DOI: 10.1080/08957347.2022.2155650
Holmes W. Finch
ABSTRACT Much research has been devoted to identification of differential item functioning (DIF), which occurs when the item responses for individuals from two groups differ after they are conditioned on the latent trait being measured by the scale. There has been less work examining differential step functioning (DSF), which is present for polytomous items when the conditional likelihood of responses to specific categories differ between groups. DSF impacts estimation of the measured trait and reduces the effectiveness of standard DIF detection methods. The purpose of this simulation study was to extend upon earlier work by comparing several methods for detecting the presence of DSF in polytomous items, including an approach based on the lasso estimation of the generalized partial credit model. Results show that the lasso GPCM technique controlled the Type I error rate while yielding power rates somewhat lower than logistic regression and the MIMIC model, which were not able to control the Type I error rate in some conditions. An empirical example is also presented, and implications of this study for practice are discussed.
差异项目功能(DIF)是指两组个体在被量表测量潜在特质后,其项目反应存在差异,目前已有大量研究致力于鉴别差异项目功能。当对特定类别的反应的条件可能性在组之间不同时,对多同构项目的差异步骤功能(DSF)的检查工作较少。DSF影响被测特征的估计,降低了标准DIF检测方法的有效性。本模拟研究的目的是在早期工作的基础上,通过比较几种检测多同义项目中DSF存在的方法,包括一种基于广义部分信用模型的套索估计的方法。结果表明,lasso GPCM技术控制了I型错误率,但产生的功率率略低于logistic回归和MIMIC模型,在某些情况下不能控制I型错误率。最后给出了一个实证例子,并讨论了本研究对实践的启示。
{"title":"Comparison of Methods for Identifying Differential Step Functioning with Polytomous Item Response Data","authors":"Holmes W. Finch","doi":"10.1080/08957347.2022.2155650","DOIUrl":"https://doi.org/10.1080/08957347.2022.2155650","url":null,"abstract":"ABSTRACT Much research has been devoted to identification of differential item functioning (DIF), which occurs when the item responses for individuals from two groups differ after they are conditioned on the latent trait being measured by the scale. There has been less work examining differential step functioning (DSF), which is present for polytomous items when the conditional likelihood of responses to specific categories differ between groups. DSF impacts estimation of the measured trait and reduces the effectiveness of standard DIF detection methods. The purpose of this simulation study was to extend upon earlier work by comparing several methods for detecting the presence of DSF in polytomous items, including an approach based on the lasso estimation of the generalized partial credit model. Results show that the lasso GPCM technique controlled the Type I error rate while yielding power rates somewhat lower than logistic regression and the MIMIC model, which were not able to control the Type I error rate in some conditions. An empirical example is also presented, and implications of this study for practice are discussed.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"35 1","pages":"255 - 271"},"PeriodicalIF":1.5,"publicationDate":"2022-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47299711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance Decline as an Indicator of Generalized Test-Taking Disengagement 性能下降作为广义测试脱离的一个指标
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-10-02 DOI: 10.1080/08957347.2022.2155651
S. Wise, G. Kingsbury
ABSTRACT In achievement testing we assume that students will demonstrate their maximum performance as they encounter test items. Sometimes, however, student performance can decline during a test event, which implies that the test score does not represent maximum performance. This study describes a method for identifying significant performance decline and investigated its utility as an indicator of generalized test-taking disengagement. Analysis of data from a computerized adaptive interim achievement test showed that performance decline classifications exhibited characteristics similar to those from disengagement classifications based on rapid guessing. More importantly, performance decline was found to identify disengagement by many students who would not have been identified as disengaged based on rapid-guessing behavior.
摘要在成绩测试中,我们假设学生在遇到测试项目时会展示出他们的最佳表现。然而,有时学生的成绩会在考试期间下降,这意味着考试成绩并不代表最高成绩。本研究描述了一种识别显著性能下降的方法,并研究了其作为广义测试脱离指标的效用。对计算机自适应中期成绩测试数据的分析表明,成绩下降分类显示出与基于快速猜测的脱离分类相似的特征。更重要的是,许多学生发现成绩下降可以识别脱离,而这些学生根据快速猜测行为不会被识别为脱离。
{"title":"Performance Decline as an Indicator of Generalized Test-Taking Disengagement","authors":"S. Wise, G. Kingsbury","doi":"10.1080/08957347.2022.2155651","DOIUrl":"https://doi.org/10.1080/08957347.2022.2155651","url":null,"abstract":"ABSTRACT In achievement testing we assume that students will demonstrate their maximum performance as they encounter test items. Sometimes, however, student performance can decline during a test event, which implies that the test score does not represent maximum performance. This study describes a method for identifying significant performance decline and investigated its utility as an indicator of generalized test-taking disengagement. Analysis of data from a computerized adaptive interim achievement test showed that performance decline classifications exhibited characteristics similar to those from disengagement classifications based on rapid guessing. More importantly, performance decline was found to identify disengagement by many students who would not have been identified as disengaged based on rapid-guessing behavior.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"35 1","pages":"272 - 286"},"PeriodicalIF":1.5,"publicationDate":"2022-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42114164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
When Should Individual Ability Estimates Be Reported if Rapid Guessing Is Present? 如果存在快速猜测,应该在什么时候报告个人能力评估?
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-07-26 DOI: 10.1080/08957347.2022.2103138
Joseph A. Rios

ABSTRACT

Testing programs are confronted with the decision of whether to report individual scores for examinees that have engaged in rapid guessing (RG). As noted by the Standards for Educational and Psychological Testing, this decision should be based on a documented criterion that determines score exclusion. To this end, a number of heuristic criteria (e.g., exclude all examinees with RG rates of 10%) have been adopted in the literature. Given that these criteria lack strong methodological support, the objective of this simulation study was to evaluate their appropriateness in terms of individual ability estimate and classification accuracy when manipulating both assessment and RG characteristics. The findings provide evidence that employing a common criterion for all examinees may be an ineffective strategy because a given RG percentage may have differing degrees of biasing effects based on test difficulty, examinee ability, and RG pattern. These results suggest that practitioners may benefit from establishing context-specific exclusion criteria that consider test purpose, score use, and targeted examinee trait levels.

摘要考试程序面临着是否报告参与快速猜测(RG)的考生的个人分数的决定。正如教育和心理测试标准所指出的那样,这个决定应该基于确定分数排除的文件标准。为此,文献中采用了一些启发式标准(例如,排除RG率为10%的所有考生)。鉴于这些标准缺乏强有力的方法学支持,本模拟研究的目的是在操纵评估和RG特征时,评估其在个人能力估计和分类准确性方面的适当性。研究结果证明,对所有考生采用一个共同的标准可能是一个无效的策略,因为给定的RG百分比可能会根据考试难度、考生能力和RG模式产生不同程度的偏倚效应。这些结果表明,从业者可以从建立考虑考试目的、分数使用和目标考生特质水平的情境特定排除标准中获益。
{"title":"When Should Individual Ability Estimates Be Reported if Rapid Guessing Is Present?","authors":"Joseph A. Rios","doi":"10.1080/08957347.2022.2103138","DOIUrl":"https://doi.org/10.1080/08957347.2022.2103138","url":null,"abstract":"<p><b>ABSTRACT</b></p><p>Testing programs are confronted with the decision of whether to report individual scores for examinees that have engaged in rapid guessing (RG). As noted by the <i>Standards for Educational and Psychological Testing</i>, this decision should be based on a documented criterion that determines score exclusion. To this end, a number of heuristic criteria (e.g., exclude all examinees with RG rates of 10%) have been adopted in the literature. Given that these criteria lack strong methodological support, the objective of this simulation study was to evaluate their appropriateness in terms of individual ability estimate and classification accuracy when manipulating both assessment and RG characteristics. The findings provide evidence that employing a common criterion for all examinees may be an ineffective strategy because a given RG percentage may have differing degrees of biasing effects based on test difficulty, examinee ability, and RG pattern. These results suggest that practitioners may benefit from establishing context-specific exclusion criteria that consider test purpose, score use, and targeted examinee trait levels.</p>","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"65 11","pages":""},"PeriodicalIF":1.5,"publicationDate":"2022-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138495008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Applied Measurement in Education
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1