首页 > 最新文献

Journal of Educational and Behavioral Statistics最新文献

英文 中文
Forced-Choice Ranking Models for Raters’ Ranking Data 评分者排名数据的强迫选择排名模型
IF 2.4 3区 心理学 Q2 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-07-07 DOI: 10.3102/10769986221104207
Su-Pin Hung, Hung-Yu Huang
To address response style or bias in rating scales, forced-choice items are often used to request that respondents rank their attitudes or preferences among a limited set of options. The rating scales used by raters to render judgments on ratees’ performance also contribute to rater bias or errors; consequently, forced-choice items have recently been employed for raters to rate how a ratee performs in certain defined traits. This study develops forced-choice ranking models (FCRMs) for data analysis when performance is evaluated by external raters or experts in a forced-choice ranking format. The proposed FCRMs consider different degrees of raters’ leniency/severity when modeling the selection probability in the generalized unfolding item response theory framework. They include an additional topic facet when multiple tasks are evaluated and incorporate variations in leniency parameters to capture the interactions between ratees and raters. The simulation results indicate that the parameters of the new models can be satisfactorily recovered and that better parameter recovery is associated with more item blocks, larger sample sizes, and a complete ranking design. A technological creativity assessment is presented as an empirical example with which to demonstrate the applicability and implications of the new models.
为了解决评分量表中的回答风格或偏见,通常使用强迫选择项目来要求受访者在有限的一组选项中对他们的态度或偏好进行排名。评价员用来对评价者的表现作出判断的评价表也会造成评价员的偏见或错误;因此,评分员最近使用了强制选择项目来评估一个人在某些特定特征中的表现。本研究开发了用于数据分析的强迫选择排名模型(fcrm),当性能由外部评分者或专家以强迫选择排名格式进行评估时。在广义展开项目反应理论框架中,提出的fcrm模型在建模选择概率时考虑了不同程度的评分者的宽严程度。当评估多个任务时,它们包括一个额外的主题方面,并包含宽大参数的变化,以捕获评价者和评价者之间的交互。仿真结果表明,新模型能较好地恢复参数,且参数恢复越好,项目块越多,样本量越大,排序设计越完善。技术创造力评估是一个实证的例子,其中展示了新模型的适用性和影响。
{"title":"Forced-Choice Ranking Models for Raters’ Ranking Data","authors":"Su-Pin Hung, Hung-Yu Huang","doi":"10.3102/10769986221104207","DOIUrl":"https://doi.org/10.3102/10769986221104207","url":null,"abstract":"To address response style or bias in rating scales, forced-choice items are often used to request that respondents rank their attitudes or preferences among a limited set of options. The rating scales used by raters to render judgments on ratees’ performance also contribute to rater bias or errors; consequently, forced-choice items have recently been employed for raters to rate how a ratee performs in certain defined traits. This study develops forced-choice ranking models (FCRMs) for data analysis when performance is evaluated by external raters or experts in a forced-choice ranking format. The proposed FCRMs consider different degrees of raters’ leniency/severity when modeling the selection probability in the generalized unfolding item response theory framework. They include an additional topic facet when multiple tasks are evaluated and incorporate variations in leniency parameters to capture the interactions between ratees and raters. The simulation results indicate that the parameters of the new models can be satisfactorily recovered and that better parameter recovery is associated with more item blocks, larger sample sizes, and a complete ranking design. A technological creativity assessment is presented as an empirical example with which to demonstrate the applicability and implications of the new models.","PeriodicalId":48001,"journal":{"name":"Journal of Educational and Behavioral Statistics","volume":"47 1","pages":"603 - 634"},"PeriodicalIF":2.4,"publicationDate":"2022-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46199623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Assessing Inter-rater Reliability With Heterogeneous Variance Components Models: Flexible Approach Accounting for Contextual Variables 异质方差分量模型评估评分者间可靠性:考虑上下文变量的灵活方法
IF 2.4 3区 心理学 Q2 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-07-05 DOI: 10.3102/10769986221150517
Patrícia Martinková, František Bartoš, M. Brabec
Inter-rater reliability (IRR), which is a prerequisite of high-quality ratings and assessments, may be affected by contextual variables, such as the rater’s or ratee’s gender, major, or experience. Identification of such heterogeneity sources in IRR is important for the implementation of policies with the potential to decrease measurement error and to increase IRR by focusing on the most relevant subgroups. In this study, we propose a flexible approach for assessing IRR in cases of heterogeneity due to covariates by directly modeling differences in variance components. We use Bayes factors (BFs) to select the best performing model, and we suggest using Bayesian model averaging as an alternative approach for obtaining IRR and variance component estimates, allowing us to account for model uncertainty. We use inclusion BFs considering the whole model space to provide evidence for or against differences in variance components due to covariates. The proposed method is compared with other Bayesian and frequentist approaches in a simulation study, and we demonstrate its superiority in some situations. Finally, we provide real data examples from grant proposal peer review, demonstrating the usefulness of this method and its flexibility in the generalization of more complex designs.
评分者间可靠性(IRR)是高质量评分和评估的先决条件,可能会受到上下文变量的影响,如评分者或被评分者的性别、专业或经验。在内部收益率中识别这种异质性来源对于实施有可能通过关注最相关的子组来减少测量误差和增加内部收益率的政策非常重要。在这项研究中,我们提出了一种灵活的方法,通过直接建模方差分量的差异来评估由于协变量导致的异质性情况下的内部收益率。我们使用贝叶斯因子(BF)来选择性能最好的模型,并建议使用贝叶斯模型平均作为获得内部收益率和方差分量估计的替代方法,使我们能够考虑模型的不确定性。我们使用考虑整个模型空间的包含BF来提供支持或反对由于协变量引起的方差分量差异的证据。在模拟研究中,将所提出的方法与其他贝叶斯和频率论方法进行了比较,并在某些情况下证明了其优越性。最后,我们提供了赠款提案同行评审的真实数据示例,证明了该方法的有用性及其在更复杂设计的泛化中的灵活性。
{"title":"Assessing Inter-rater Reliability With Heterogeneous Variance Components Models: Flexible Approach Accounting for Contextual Variables","authors":"Patrícia Martinková, František Bartoš, M. Brabec","doi":"10.3102/10769986221150517","DOIUrl":"https://doi.org/10.3102/10769986221150517","url":null,"abstract":"Inter-rater reliability (IRR), which is a prerequisite of high-quality ratings and assessments, may be affected by contextual variables, such as the rater’s or ratee’s gender, major, or experience. Identification of such heterogeneity sources in IRR is important for the implementation of policies with the potential to decrease measurement error and to increase IRR by focusing on the most relevant subgroups. In this study, we propose a flexible approach for assessing IRR in cases of heterogeneity due to covariates by directly modeling differences in variance components. We use Bayes factors (BFs) to select the best performing model, and we suggest using Bayesian model averaging as an alternative approach for obtaining IRR and variance component estimates, allowing us to account for model uncertainty. We use inclusion BFs considering the whole model space to provide evidence for or against differences in variance components due to covariates. The proposed method is compared with other Bayesian and frequentist approaches in a simulation study, and we demonstrate its superiority in some situations. Finally, we provide real data examples from grant proposal peer review, demonstrating the usefulness of this method and its flexibility in the generalization of more complex designs.","PeriodicalId":48001,"journal":{"name":"Journal of Educational and Behavioral Statistics","volume":"48 1","pages":"349 - 383"},"PeriodicalIF":2.4,"publicationDate":"2022-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46141747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Pooling Interactions Into Error Terms in Multisite Experiments 在多站点实验中将相互作用合并为误差项
IF 2.4 3区 心理学 Q2 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-07-04 DOI: 10.3102/10769986221104800
Wendy Chan, L. Hedges
Multisite field experiments using the (generalized) randomized block design that assign treatments to individuals within sites are common in education and the social sciences. Under this design, there are two possible estimands of interest and they differ based on whether sites or blocks have fixed or random effects. When the average treatment effect is assumed to be identical across sites, it is common to omit site by treatment interactions and “pool” them into the error term in classical experimental design. However, prior work has not addressed the consequences of pooling when site by treatment interactions are not zero. This study assesses the impact of pooling on inference in the presence of nonzero site by treatment interactions. We derive the small sample distributions of the test statistics for treatment effects under pooling and illustrate the impacts on rejection rates when interactions are not zero. We use the results to offer recommendations to researchers conducting studies based on the multisite design.
在教育和社会科学中,使用(广义)随机分组设计的多站点现场实验在站点内为个体分配治疗方法是很常见的。在这种设计下,有两种可能的兴趣估计,它们根据站点或块是固定的还是随机的效果而不同。在经典实验设计中,当假设各个位点的平均处理效果相同时,通常会忽略处理相互作用的位点,并将其“汇集”到误差项中。然而,先前的工作并没有解决当位点与治疗相互作用不为零时池化的后果。本研究通过处理相互作用评估了在非零位点存在时池化对推断的影响。我们导出了池化处理效果的测试统计量的小样本分布,并说明了当相互作用不为零时对拒绝率的影响。我们利用这些结果为开展基于多站点设计的研究的研究人员提供建议。
{"title":"Pooling Interactions Into Error Terms in Multisite Experiments","authors":"Wendy Chan, L. Hedges","doi":"10.3102/10769986221104800","DOIUrl":"https://doi.org/10.3102/10769986221104800","url":null,"abstract":"Multisite field experiments using the (generalized) randomized block design that assign treatments to individuals within sites are common in education and the social sciences. Under this design, there are two possible estimands of interest and they differ based on whether sites or blocks have fixed or random effects. When the average treatment effect is assumed to be identical across sites, it is common to omit site by treatment interactions and “pool” them into the error term in classical experimental design. However, prior work has not addressed the consequences of pooling when site by treatment interactions are not zero. This study assesses the impact of pooling on inference in the presence of nonzero site by treatment interactions. We derive the small sample distributions of the test statistics for treatment effects under pooling and illustrate the impacts on rejection rates when interactions are not zero. We use the results to offer recommendations to researchers conducting studies based on the multisite design.","PeriodicalId":48001,"journal":{"name":"Journal of Educational and Behavioral Statistics","volume":"47 1","pages":"639 - 665"},"PeriodicalIF":2.4,"publicationDate":"2022-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44462932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Improving Accuracy and Stability of Aggregate Student Growth Measures Using Empirical Best Linear Prediction 使用经验最佳线性预测提高学生成长总量测量的准确性和稳定性
IF 2.4 3区 心理学 Q2 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-06-27 DOI: 10.3102/10769986221101624
J. R. Lockwood, K. Castellano, D. McCaffrey
Many states and school districts in the United States use standardized test scores to compute annual measures of student achievement progress and then use school-level averages of these growth measures for various reporting and diagnostic purposes. These aggregate growth measures can vary consequentially from year to year for the same school, complicating their use and interpretation. We develop a method, based on the theory of empirical best linear prediction, to improve the accuracy and stability of aggregate growth measures by pooling information across grades, years, and tested subjects for individual schools. We demonstrate the performance of the method using both simulation and application to 6 years of annual growth measures from a large, urban school district. We provide code for implementing the method in the package schoolgrowth for the R environment.
美国的许多州和学区使用标准化考试成绩来计算学生成绩进步的年度指标,然后将这些增长指标的学校平均值用于各种报告和诊断目的。对于同一所学校,这些总增长指标可能会因年份而异,使其使用和解释变得复杂。我们开发了一种基于经验最佳线性预测理论的方法,通过汇集各个学校的年级、年份和测试科目的信息,提高总体增长指标的准确性和稳定性。我们使用模拟和应用于一个大型城市学区6年的年度增长指标,展示了该方法的性能。我们提供了在R环境的一揽子学校成长中实现该方法的代码。
{"title":"Improving Accuracy and Stability of Aggregate Student Growth Measures Using Empirical Best Linear Prediction","authors":"J. R. Lockwood, K. Castellano, D. McCaffrey","doi":"10.3102/10769986221101624","DOIUrl":"https://doi.org/10.3102/10769986221101624","url":null,"abstract":"Many states and school districts in the United States use standardized test scores to compute annual measures of student achievement progress and then use school-level averages of these growth measures for various reporting and diagnostic purposes. These aggregate growth measures can vary consequentially from year to year for the same school, complicating their use and interpretation. We develop a method, based on the theory of empirical best linear prediction, to improve the accuracy and stability of aggregate growth measures by pooling information across grades, years, and tested subjects for individual schools. We demonstrate the performance of the method using both simulation and application to 6 years of annual growth measures from a large, urban school district. We provide code for implementing the method in the package schoolgrowth for the R environment.","PeriodicalId":48001,"journal":{"name":"Journal of Educational and Behavioral Statistics","volume":"47 1","pages":"544 - 575"},"PeriodicalIF":2.4,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46997238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Speed–Accuracy Trade-Off? Not So Fast: Marginal Changes in Speed Have Inconsistent Relationships With Accuracy in Real-World Settings 速度-精度权衡?不那么快:在现实世界中,速度的边际变化与准确性之间存在不一致的关系
IF 2.4 3区 心理学 Q2 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-06-08 DOI: 10.3102/10769986221099906
B. Domingue, K. Kanopka, B. Stenhaug, M. Sulik, Tanesia Beverly, Matthieu J. S. Brinkhuis, Ruhan Circi, Jessica Faul, Dandan Liao, Bruce McCandliss, Jelena Obradović, Chris Piech, Tenelle Porter, Project iLEAD Consortium, J. Soland, Jon Weeks, S. Wise, Jason D Yeatman
The speed–accuracy trade-off (SAT) suggests that time constraints reduce response accuracy. Its relevance in observational settings—where response time (RT) may not be constrained but respondent speed may still vary—is unclear. Using 29 data sets containing data from cognitive tasks, we use a flexible method for identification of the SAT (which we test in extensive simulation studies) to probe whether the SAT holds. We find inconsistent relationships between time and accuracy; marginal increases in time use for an individual do not necessarily predict increases in accuracy. Additionally, the speed–accuracy relationship may depend on the underlying difficulty of the interaction. We also consider the analysis of items and individuals; of particular interest is the observation that respondents who exhibit more within-person variation in response speed are typically of lower ability. We further find that RT is typically a weak predictor of response accuracy. Our findings document a range of empirical phenomena that should inform future modeling of RTs collected in observational settings.
速度-精度权衡(SAT)表明,时间限制会降低响应精度。它在观察环境中的相关性尚不清楚,在观察环境下,反应时间(RT)可能不受限制,但反应速度可能仍然不同。使用29个包含认知任务数据的数据集,我们使用一种灵活的SAT识别方法(我们在广泛的模拟研究中进行了测试)来探究SAT是否成立。我们发现时间和准确性之间的关系不一致;个体时间使用的边际增加并不一定能预测准确性的提高。此外,速度-准确性关系可能取决于交互的潜在难度。我们还考虑对项目和个人的分析;特别令人感兴趣的是观察到,反应速度表现出更多人内变化的受访者通常能力较低。我们进一步发现,RT通常是反应准确性的弱预测因子。我们的发现记录了一系列经验现象,这些现象应该为未来在观测环境中收集的RT建模提供信息。
{"title":"Speed–Accuracy Trade-Off? Not So Fast: Marginal Changes in Speed Have Inconsistent Relationships With Accuracy in Real-World Settings","authors":"B. Domingue, K. Kanopka, B. Stenhaug, M. Sulik, Tanesia Beverly, Matthieu J. S. Brinkhuis, Ruhan Circi, Jessica Faul, Dandan Liao, Bruce McCandliss, Jelena Obradović, Chris Piech, Tenelle Porter, Project iLEAD Consortium, J. Soland, Jon Weeks, S. Wise, Jason D Yeatman","doi":"10.3102/10769986221099906","DOIUrl":"https://doi.org/10.3102/10769986221099906","url":null,"abstract":"The speed–accuracy trade-off (SAT) suggests that time constraints reduce response accuracy. Its relevance in observational settings—where response time (RT) may not be constrained but respondent speed may still vary—is unclear. Using 29 data sets containing data from cognitive tasks, we use a flexible method for identification of the SAT (which we test in extensive simulation studies) to probe whether the SAT holds. We find inconsistent relationships between time and accuracy; marginal increases in time use for an individual do not necessarily predict increases in accuracy. Additionally, the speed–accuracy relationship may depend on the underlying difficulty of the interaction. We also consider the analysis of items and individuals; of particular interest is the observation that respondents who exhibit more within-person variation in response speed are typically of lower ability. We further find that RT is typically a weak predictor of response accuracy. Our findings document a range of empirical phenomena that should inform future modeling of RTs collected in observational settings.","PeriodicalId":48001,"journal":{"name":"Journal of Educational and Behavioral Statistics","volume":"47 1","pages":"576 - 602"},"PeriodicalIF":2.4,"publicationDate":"2022-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48011887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
What Is Actually Equated in “Test Equating”? A Didactic Note 在“测试等价”中,什么是真正等价的?说教笔记
IF 2.4 3区 心理学 Q2 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-06-01 DOI: 10.3102/10769986211072308
Wim J. van der Linden
The current literature on test equating generally defines it as the process necessary to obtain score comparability between different test forms. The definition is in contrast with Lord’s foundational paper which viewed equating as the process required to obtain comparability of measurement scale between forms. The distinction between the notions of scale and score is not trivial. The difference is explained by connecting these notions with standard statistical concepts as probability experiment, sample space, and random variable. The probability experiment underlying equating test forms with random scores immediately gives us the equating transformation as a function mapping the scale of one form into the other and thus supports the point of view taken by Lord. However, both Lord’s view and the current literature appear to rely on the idea of an experiment with random examinees which implies a different notion of test scores. It is shown how an explicit choice between the two experiments is not just important for our theoretical understanding of key notions in test equating but also has important practical consequences.
目前关于考试等值的文献通常将其定义为获得不同考试形式之间分数可比性所必需的过程。这一定义与Lord的基础论文形成了鲜明对比,Lord的基本论文将等式视为获得形式之间测量量表可比性所需的过程。量表和分数概念之间的区别并非微不足道。通过将这些概念与概率实验、样本空间和随机变量等标准统计概念联系起来,可以解释这种差异。将测试表格与随机分数等同起来的概率实验立即为我们提供了将一种表格的比例映射到另一种表格中的函数的等同变换,从而支持了Lord的观点。然而,洛德的观点和当前的文献似乎都依赖于对随机考生进行实验的想法,这意味着对考试成绩的看法不同。这表明,在这两个实验之间做出明确的选择不仅对我们对测试等式中关键概念的理论理解很重要,而且具有重要的实际后果。
{"title":"What Is Actually Equated in “Test Equating”? A Didactic Note","authors":"Wim J. van der Linden","doi":"10.3102/10769986211072308","DOIUrl":"https://doi.org/10.3102/10769986211072308","url":null,"abstract":"The current literature on test equating generally defines it as the process necessary to obtain score comparability between different test forms. The definition is in contrast with Lord’s foundational paper which viewed equating as the process required to obtain comparability of measurement scale between forms. The distinction between the notions of scale and score is not trivial. The difference is explained by connecting these notions with standard statistical concepts as probability experiment, sample space, and random variable. The probability experiment underlying equating test forms with random scores immediately gives us the equating transformation as a function mapping the scale of one form into the other and thus supports the point of view taken by Lord. However, both Lord’s view and the current literature appear to rely on the idea of an experiment with random examinees which implies a different notion of test scores. It is shown how an explicit choice between the two experiments is not just important for our theoretical understanding of key notions in test equating but also has important practical consequences.","PeriodicalId":48001,"journal":{"name":"Journal of Educational and Behavioral Statistics","volume":"47 1","pages":"353 - 362"},"PeriodicalIF":2.4,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47910828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Two Statistical Tests for the Detection of Item Compromise 检测项目折衷的两个统计检验
IF 2.4 3区 心理学 Q2 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-05-11 DOI: 10.3102/10769986221094789
W. van der Linden
Two independent statistical tests of item compromise are presented, one based on the test takers’ responses and the other on their response times (RTs) on the same items. The tests can be used to monitor an item in real time during online continuous testing but are also applicable as part of post hoc forensic analysis. The two test statistics are simple intuitive quantities as the sum of the responses and RTs observed for the test takers on the item. Common features of the tests are ease of interpretation and computational simplicity. Both tests are uniformly most powerful under the assumption of known ability and speed parameters for the test takers. Examples of power functions for items with realistic parameter values suggest maximum power for 20–30 test takers with item preknowledge for the response-based test and 10–20 test takers for the RT-based test.
提出了两个独立的项目折衷统计测试,一个基于考生的回答,另一个基于他们对相同项目的回答时间。这些测试可以用于在线连续测试期间实时监控物品,但也适用于事后取证分析。这两个测试统计数据是简单直观的量,是考生对该项目的反应和RT的总和。测试的共同特点是易于解释和计算简单。在假设考生的能力和速度参数已知的情况下,这两种测试都是最有力的。具有真实参数值的项目的幂函数示例表明,20–30名具有项目先验知识的考生参加基于反应的测试,10–20名考生参加基于RT的测试。
{"title":"Two Statistical Tests for the Detection of Item Compromise","authors":"W. van der Linden","doi":"10.3102/10769986221094789","DOIUrl":"https://doi.org/10.3102/10769986221094789","url":null,"abstract":"Two independent statistical tests of item compromise are presented, one based on the test takers’ responses and the other on their response times (RTs) on the same items. The tests can be used to monitor an item in real time during online continuous testing but are also applicable as part of post hoc forensic analysis. The two test statistics are simple intuitive quantities as the sum of the responses and RTs observed for the test takers on the item. Common features of the tests are ease of interpretation and computational simplicity. Both tests are uniformly most powerful under the assumption of known ability and speed parameters for the test takers. Examples of power functions for items with realistic parameter values suggest maximum power for 20–30 test takers with item preknowledge for the response-based test and 10–20 test takers for the RT-based test.","PeriodicalId":48001,"journal":{"name":"Journal of Educational and Behavioral Statistics","volume":"47 1","pages":"485 - 504"},"PeriodicalIF":2.4,"publicationDate":"2022-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44049872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Critical View on the NEAT Equating Design: Statistical Modeling and Identifiability Problems 关于NEAT等式设计的批判性观点:统计建模和可识别性问题
IF 2.4 3区 心理学 Q2 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-04-29 DOI: 10.3102/10769986221090609
Ernesto San Martín, Jorge González
The nonequivalent groups with anchor test (NEAT) design is widely used in test equating. Under this design, two groups of examinees are administered different test forms with each test form containing a subset of common items. Because test takers from different groups are assigned only one test form, missing score data emerge by design rendering some of the score distributions unavailable. The partially observed score data formally lead to an identifiability problem, which has not been recognized as such in the equating literature and has been considered from different perspectives, all of them making different assumptions in order to estimate the unidentified score distributions. In this article, we formally specify the statistical model underlying the NEAT design and unveil the lack of identifiability of the parameters of interest that compose the equating transformation. We use the theory of partial identification to show alternatives to traditional practices that have been proposed to identify the score distributions when conducting equating under the NEAT design.
非等价群锚定试验(NEAT)设计在试验等值中得到了广泛的应用。在这种设计下,两组考生使用不同的测试表格,每个测试表格包含一个子集的通用项目。由于来自不同组的考生只被分配一份考试表格,因此通过设计使一些分数分布不可用,就会出现缺失的分数数据。部分观察到的分数数据正式导致了一个可识别性问题,该问题在等式文献中没有得到承认,并且从不同的角度进行了考虑,所有这些都做出了不同的假设,以估计未识别的分数分布。在这篇文章中,我们正式指定了NEAT设计的统计模型,并揭示了组成等式转换的感兴趣参数缺乏可识别性。我们使用部分识别理论来展示在NEAT设计下进行等值时,为识别分数分布而提出的传统实践的替代方案。
{"title":"A Critical View on the NEAT Equating Design: Statistical Modeling and Identifiability Problems","authors":"Ernesto San Martín, Jorge González","doi":"10.3102/10769986221090609","DOIUrl":"https://doi.org/10.3102/10769986221090609","url":null,"abstract":"The nonequivalent groups with anchor test (NEAT) design is widely used in test equating. Under this design, two groups of examinees are administered different test forms with each test form containing a subset of common items. Because test takers from different groups are assigned only one test form, missing score data emerge by design rendering some of the score distributions unavailable. The partially observed score data formally lead to an identifiability problem, which has not been recognized as such in the equating literature and has been considered from different perspectives, all of them making different assumptions in order to estimate the unidentified score distributions. In this article, we formally specify the statistical model underlying the NEAT design and unveil the lack of identifiability of the parameters of interest that compose the equating transformation. We use the theory of partial identification to show alternatives to traditional practices that have been proposed to identify the score distributions when conducting equating under the NEAT design.","PeriodicalId":48001,"journal":{"name":"Journal of Educational and Behavioral Statistics","volume":"47 1","pages":"406 - 437"},"PeriodicalIF":2.4,"publicationDate":"2022-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43615425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Statistical Inference for G-indices of Agreement 一致性g指数的统计推断
IF 2.4 3区 心理学 Q2 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-04-29 DOI: 10.3102/10769986221088561
D. Bonett
The limitations of Cohen’s κ are reviewed and an alternative G-index is recommended for assessing nominal-scale agreement. Maximum likelihood estimates, standard errors, and confidence intervals for a two-rater G-index are derived for one-group and two-group designs. A new G-index of agreement for multirater designs is proposed. Statistical inference methods for some important special cases of the multirater design also are derived. G-index meta-analysis methods are proposed and can be used to combine and compare agreement across two or more populations. Closed-form sample-size formulas to achieve desired confidence interval precision are proposed for two-rater and multirater designs. R functions are given for all results.
本文回顾了Cohen 's κ的局限性,并推荐了一种替代的g指数来评估名义尺度的一致性。最大似然估计,标准误差和置信区间的两个评级的g指数为一组和两组设计推导。提出了一种新的多参数设计一致性g指数。本文还推导了几种重要的特殊情况下的统计推断方法。提出了g指数荟萃分析方法,可用于组合和比较两个或多个人群的一致性。提出了用于双因子和多因子设计的封闭式样本大小公式,以达到所需的置信区间精度。所有结果都给出了R函数。
{"title":"Statistical Inference for G-indices of Agreement","authors":"D. Bonett","doi":"10.3102/10769986221088561","DOIUrl":"https://doi.org/10.3102/10769986221088561","url":null,"abstract":"The limitations of Cohen’s κ are reviewed and an alternative G-index is recommended for assessing nominal-scale agreement. Maximum likelihood estimates, standard errors, and confidence intervals for a two-rater G-index are derived for one-group and two-group designs. A new G-index of agreement for multirater designs is proposed. Statistical inference methods for some important special cases of the multirater design also are derived. G-index meta-analysis methods are proposed and can be used to combine and compare agreement across two or more populations. Closed-form sample-size formulas to achieve desired confidence interval precision are proposed for two-rater and multirater designs. R functions are given for all results.","PeriodicalId":48001,"journal":{"name":"Journal of Educational and Behavioral Statistics","volume":"47 1","pages":"438 - 458"},"PeriodicalIF":2.4,"publicationDate":"2022-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44008526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Latent Trait Item Response Models for Continuous Responses 连续反应的潜在特质-项目反应模型
IF 2.4 3区 心理学 Q2 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-04-08 DOI: 10.3102/10769986231184147
G. Tutz, Pascal Jordan
A general framework of latent trait item response models for continuous responses is given. In contrast to classical test theory (CTT) models, which traditionally distinguish between true scores and error scores, the responses are clearly linked to latent traits. It is shown that CTT models can be derived as special cases, but the model class is much wider. It provides, in particular, appropriate modeling of responses that are restricted in some way, for example, if responses are positive or are restricted to an interval. Restrictions of this sort are easily incorporated in the modeling framework. Restriction to an interval is typically ignored in common models yielding inappropriate models, for example, when modeling Likert-type data. The model also extends common response time models, which can be treated as special cases. The properties of the model class are derived and the role of the total score is investigated, which leads to a modified total score. Several applications illustrate the use of the model including an example, in which covariates that may modify the response are taken into account.
给出了用于连续反应的潜在特质项目反应模型的一般框架。经典测试理论(CTT)模型传统上区分真实分数和错误分数,与之相反,反应显然与潜在特征有关。结果表明,CTT模型可以作为特殊情况导出,但模型类别要广泛得多。它特别提供了在某种程度上受到限制的响应的适当建模,例如,如果响应是积极的或被限制在一个区间内。这种类型的限制很容易合并到建模框架中。在产生不适当模型的普通模型中,通常会忽略对间隔的限制,例如,在对likert类型数据建模时。该模型还扩展了常见的响应时间模型,可以将其视为特殊情况。推导了模型类的属性,研究了总分的作用,得到了修改后的总分。几个应用说明了该模型的使用,包括一个例子,其中考虑了可能修改响应的协变量。
{"title":"Latent Trait Item Response Models for Continuous Responses","authors":"G. Tutz, Pascal Jordan","doi":"10.3102/10769986231184147","DOIUrl":"https://doi.org/10.3102/10769986231184147","url":null,"abstract":"A general framework of latent trait item response models for continuous responses is given. In contrast to classical test theory (CTT) models, which traditionally distinguish between true scores and error scores, the responses are clearly linked to latent traits. It is shown that CTT models can be derived as special cases, but the model class is much wider. It provides, in particular, appropriate modeling of responses that are restricted in some way, for example, if responses are positive or are restricted to an interval. Restrictions of this sort are easily incorporated in the modeling framework. Restriction to an interval is typically ignored in common models yielding inappropriate models, for example, when modeling Likert-type data. The model also extends common response time models, which can be treated as special cases. The properties of the model class are derived and the role of the total score is investigated, which leads to a modified total score. Several applications illustrate the use of the model including an example, in which covariates that may modify the response are taken into account.","PeriodicalId":48001,"journal":{"name":"Journal of Educational and Behavioral Statistics","volume":"1 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2022-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47754564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Educational and Behavioral Statistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1