首页 > 最新文献

Journal of Educational and Behavioral Statistics最新文献

英文 中文
Cognitive Diagnosis Modeling Incorporating Response Times and Fixation Counts: Providing Comprehensive Feedback and Accurate Diagnosis 结合反应时间和注视次数的认知诊断模型:提供全面反馈和准确诊断
IF 2.4 3区 心理学 Q2 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-07-28 DOI: 10.3102/10769986221111085
P. Zhan, K. Man, Stefanie A. Wind, Jonathan Malone
Respondents’ problem-solving behaviors comprise behaviors that represent complicated cognitive processes that are frequently systematically tied to one another. Biometric data, such as visual fixation counts (FCs), which are an important eye-tracking indicator, can be combined with other types of variables that reflect different aspects of problem-solving behavior to quantify variability in problem-solving behavior. To provide comprehensive feedback and accurate diagnosis when using such multimodal data, the present study proposes a multimodal joint cognitive diagnosis model that accounts for latent attributes, latent ability, processing speed, and visual engagement by simultaneously modeling response accuracy (RA), response times, and FCs. We used two simulation studies to test the feasibility of the proposed model. Findings mainly suggest that the parameters of the proposed model can be well recovered and that modeling FCs, in addition to RA and response times, could increase the comprehensiveness of feedback on problem-solving-related cognitive characteristics as well as the accuracy of knowledge structure diagnosis. An empirical example is used to demonstrate the applicability and benefits of the proposed model. We discuss the implications of our findings as they relate to research and practice.
受访者解决问题的行为包括代表复杂认知过程的行为,这些认知过程经常系统地相互联系。生物特征数据,如视觉注视计数(FC),这是一个重要的眼睛跟踪指标,可以与反映解决问题行为不同方面的其他类型的变量相结合,以量化解决问题行为的可变性。为了在使用这种多模式数据时提供全面的反馈和准确的诊断,本研究提出了一种多模式联合认知诊断模型,该模型通过同时建模反应准确性(RA)、反应时间和FC来考虑潜在属性、潜在能力、处理速度和视觉参与。我们使用了两个模拟研究来测试所提出的模型的可行性。研究结果主要表明,所提出的模型的参数可以很好地恢复,建模FC,除了RA和响应时间外,还可以提高对问题解决相关认知特征的反馈的全面性以及知识结构诊断的准确性。通过实例验证了该模型的适用性和优点。我们讨论了我们的发现对研究和实践的影响。
{"title":"Cognitive Diagnosis Modeling Incorporating Response Times and Fixation Counts: Providing Comprehensive Feedback and Accurate Diagnosis","authors":"P. Zhan, K. Man, Stefanie A. Wind, Jonathan Malone","doi":"10.3102/10769986221111085","DOIUrl":"https://doi.org/10.3102/10769986221111085","url":null,"abstract":"Respondents’ problem-solving behaviors comprise behaviors that represent complicated cognitive processes that are frequently systematically tied to one another. Biometric data, such as visual fixation counts (FCs), which are an important eye-tracking indicator, can be combined with other types of variables that reflect different aspects of problem-solving behavior to quantify variability in problem-solving behavior. To provide comprehensive feedback and accurate diagnosis when using such multimodal data, the present study proposes a multimodal joint cognitive diagnosis model that accounts for latent attributes, latent ability, processing speed, and visual engagement by simultaneously modeling response accuracy (RA), response times, and FCs. We used two simulation studies to test the feasibility of the proposed model. Findings mainly suggest that the parameters of the proposed model can be well recovered and that modeling FCs, in addition to RA and response times, could increase the comprehensiveness of feedback on problem-solving-related cognitive characteristics as well as the accuracy of knowledge structure diagnosis. An empirical example is used to demonstrate the applicability and benefits of the proposed model. We discuss the implications of our findings as they relate to research and practice.","PeriodicalId":48001,"journal":{"name":"Journal of Educational and Behavioral Statistics","volume":"47 1","pages":"736 - 776"},"PeriodicalIF":2.4,"publicationDate":"2022-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47269107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Testing Differential Item Functioning Without Predefined Anchor Items Using Robust Regression 使用稳健回归测试没有预定义锚项目的差异项目功能
IF 2.4 3区 心理学 Q2 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-07-18 DOI: 10.3102/10769986221109208
Weimeng Wang, Yang Liu, Hongyun Liu
Differential item functioning (DIF) occurs when the probability of endorsing an item differs across groups for individuals with the same latent trait level. The presence of DIF items may jeopardize the validity of an instrument; therefore, it is crucial to identify DIF items in routine operations of educational assessment. While DIF detection procedures based on item response theory (IRT) have been widely used, a majority of IRT-based DIF tests assume predefined anchor (i.e., DIF-free) items. Not only is this assumption strong, but violations to it may also lead to erroneous inferences, for example, an inflated Type I error rate. We propose a general framework to define the effect sizes of DIF without a priori knowledge of anchor items. In particular, we quantify DIF by item-specific residuals from a regression model fitted to the true item parameters in respective groups. Moreover, the null distribution of the proposed test statistic using robust estimator can be derived analytically or approximated numerically even when there is a mix of DIF and non-DIF items, which yields asymptotically justified statistical inference. The Type I error rate and the power performance of the proposed procedure are evaluated and compared with the conventional likelihood-ratio DIF tests in a Monte Carlo experiment. Our simulation study has shown promising results in controlling Type I error rate and power of detecting DIF items. Even when there is a mix of DIF and non-DIF items, the true and false alarm rate can be well controlled when a robust regression estimator is used.
当具有相同潜在特征水平的个体在不同群体中认可某个项目的概率不同时,就会出现差异项目功能(DIF)。DIF项目的存在可能危及文书的有效性;因此,在教育评估的日常操作中识别DIF项目是至关重要的。虽然基于项目反应理论(IRT)的DIF检测程序已被广泛使用,但大多数基于IRT的DIF测试都假设了预定义的锚(即,无DIF)项目。这种假设不仅很强,而且违反它也可能导致错误的推断,例如,夸大的I型错误率。我们提出了一个通用框架来定义DIF的效果大小,而不需要锚项的先验知识。特别是,我们通过回归模型中的项目特异性残差来量化DIF,该回归模型与各组中的真实项目参数相拟合。此外,即使在DIF和非DIF项目混合的情况下,使用鲁棒估计器的测试统计量的零分布也可以通过分析或数值近似得出,这产生了渐近合理的统计推断。在蒙特卡洛实验中,评估了所提出程序的I型错误率和功率性能,并将其与传统的似然比DIF测试进行了比较。我们的仿真研究在控制I型错误率和检测DIF项目的能力方面显示出了有希望的结果。即使在DIF和非DIF项目混合的情况下,当使用稳健回归估计器时,也可以很好地控制真警率和假警率。
{"title":"Testing Differential Item Functioning Without Predefined Anchor Items Using Robust Regression","authors":"Weimeng Wang, Yang Liu, Hongyun Liu","doi":"10.3102/10769986221109208","DOIUrl":"https://doi.org/10.3102/10769986221109208","url":null,"abstract":"Differential item functioning (DIF) occurs when the probability of endorsing an item differs across groups for individuals with the same latent trait level. The presence of DIF items may jeopardize the validity of an instrument; therefore, it is crucial to identify DIF items in routine operations of educational assessment. While DIF detection procedures based on item response theory (IRT) have been widely used, a majority of IRT-based DIF tests assume predefined anchor (i.e., DIF-free) items. Not only is this assumption strong, but violations to it may also lead to erroneous inferences, for example, an inflated Type I error rate. We propose a general framework to define the effect sizes of DIF without a priori knowledge of anchor items. In particular, we quantify DIF by item-specific residuals from a regression model fitted to the true item parameters in respective groups. Moreover, the null distribution of the proposed test statistic using robust estimator can be derived analytically or approximated numerically even when there is a mix of DIF and non-DIF items, which yields asymptotically justified statistical inference. The Type I error rate and the power performance of the proposed procedure are evaluated and compared with the conventional likelihood-ratio DIF tests in a Monte Carlo experiment. Our simulation study has shown promising results in controlling Type I error rate and power of detecting DIF items. Even when there is a mix of DIF and non-DIF items, the true and false alarm rate can be well controlled when a robust regression estimator is used.","PeriodicalId":48001,"journal":{"name":"Journal of Educational and Behavioral Statistics","volume":"47 1","pages":"666 - 692"},"PeriodicalIF":2.4,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42754815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Zero and One Inflated Item Response Theory Models for Bounded Continuous Data 有界连续数据的零和一充气项目反应理论模型
IF 2.4 3区 心理学 Q2 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-07-15 DOI: 10.3102/10769986221108455
D. Molenaar, M. Curi, Jorge L. Bazán
Bounded continuous data are encountered in many applications of item response theory, including the measurement of mood, personality, and response times and in the analyses of summed item scores. Although different item response theory models exist to analyze such bounded continuous data, most models assume the data to be in an open interval and cannot accommodate data in a closed interval. As a result, ad hoc transformations are needed to prevent scores on the bounds of the observed variables. To motivate the present study, we demonstrate in real and simulated data that this practice of fitting open interval models to closed interval data can majorly affect parameter estimates even in cases with only 5% of the responses on one of the bounds of the observed variables. To address this problem, we propose a zero and one inflated item response theory modeling framework for bounded continuous responses in the closed interval. We illustrate how four existing models for bounded responses from the literature can be accommodated in the framework. The resulting zero and one inflated item response theory models are studied in a simulation study and a real data application to investigate parameter recovery, model fit, and the consequences of fitting the incorrect distribution to the data. We find that neglecting the bounded nature of the data biases parameters and that misspecification of the exact distribution may affect the results depending on the data generating model.
在项目反应理论的许多应用中都会遇到有界连续数据,包括情绪、个性和反应时间的测量以及项目总得分的分析。虽然存在不同的项目反应理论模型来分析这种有界连续数据,但大多数模型都假设数据处于开放区间,无法容纳封闭区间的数据。因此,需要特别的转换来防止在观察变量的边界上得分。为了激励本研究,我们在真实和模拟数据中证明,即使在观测变量的一个边界上只有5%的响应的情况下,将开放区间模型拟合到封闭区间数据的做法也会严重影响参数估计。为了解决这一问题,我们提出了一个零项和一项膨胀项的有界连续响应理论建模框架。我们将说明如何将文献中已有的四种有界响应模型纳入该框架。在模拟研究和实际数据应用中,研究了由此产生的零和一膨胀项目反应理论模型,以研究参数恢复,模型拟合以及拟合数据不正确分布的后果。我们发现,忽略数据偏差参数的有界性质和准确分布的错误说明可能会影响数据生成模型的结果。
{"title":"Zero and One Inflated Item Response Theory Models for Bounded Continuous Data","authors":"D. Molenaar, M. Curi, Jorge L. Bazán","doi":"10.3102/10769986221108455","DOIUrl":"https://doi.org/10.3102/10769986221108455","url":null,"abstract":"Bounded continuous data are encountered in many applications of item response theory, including the measurement of mood, personality, and response times and in the analyses of summed item scores. Although different item response theory models exist to analyze such bounded continuous data, most models assume the data to be in an open interval and cannot accommodate data in a closed interval. As a result, ad hoc transformations are needed to prevent scores on the bounds of the observed variables. To motivate the present study, we demonstrate in real and simulated data that this practice of fitting open interval models to closed interval data can majorly affect parameter estimates even in cases with only 5% of the responses on one of the bounds of the observed variables. To address this problem, we propose a zero and one inflated item response theory modeling framework for bounded continuous responses in the closed interval. We illustrate how four existing models for bounded responses from the literature can be accommodated in the framework. The resulting zero and one inflated item response theory models are studied in a simulation study and a real data application to investigate parameter recovery, model fit, and the consequences of fitting the incorrect distribution to the data. We find that neglecting the bounded nature of the data biases parameters and that misspecification of the exact distribution may affect the results depending on the data generating model.","PeriodicalId":48001,"journal":{"name":"Journal of Educational and Behavioral Statistics","volume":"47 1","pages":"693 - 735"},"PeriodicalIF":2.4,"publicationDate":"2022-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45894141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Forced-Choice Ranking Models for Raters’ Ranking Data 评分者排名数据的强迫选择排名模型
IF 2.4 3区 心理学 Q2 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-07-07 DOI: 10.3102/10769986221104207
Su-Pin Hung, Hung-Yu Huang
To address response style or bias in rating scales, forced-choice items are often used to request that respondents rank their attitudes or preferences among a limited set of options. The rating scales used by raters to render judgments on ratees’ performance also contribute to rater bias or errors; consequently, forced-choice items have recently been employed for raters to rate how a ratee performs in certain defined traits. This study develops forced-choice ranking models (FCRMs) for data analysis when performance is evaluated by external raters or experts in a forced-choice ranking format. The proposed FCRMs consider different degrees of raters’ leniency/severity when modeling the selection probability in the generalized unfolding item response theory framework. They include an additional topic facet when multiple tasks are evaluated and incorporate variations in leniency parameters to capture the interactions between ratees and raters. The simulation results indicate that the parameters of the new models can be satisfactorily recovered and that better parameter recovery is associated with more item blocks, larger sample sizes, and a complete ranking design. A technological creativity assessment is presented as an empirical example with which to demonstrate the applicability and implications of the new models.
为了解决评分量表中的回答风格或偏见,通常使用强迫选择项目来要求受访者在有限的一组选项中对他们的态度或偏好进行排名。评价员用来对评价者的表现作出判断的评价表也会造成评价员的偏见或错误;因此,评分员最近使用了强制选择项目来评估一个人在某些特定特征中的表现。本研究开发了用于数据分析的强迫选择排名模型(fcrm),当性能由外部评分者或专家以强迫选择排名格式进行评估时。在广义展开项目反应理论框架中,提出的fcrm模型在建模选择概率时考虑了不同程度的评分者的宽严程度。当评估多个任务时,它们包括一个额外的主题方面,并包含宽大参数的变化,以捕获评价者和评价者之间的交互。仿真结果表明,新模型能较好地恢复参数,且参数恢复越好,项目块越多,样本量越大,排序设计越完善。技术创造力评估是一个实证的例子,其中展示了新模型的适用性和影响。
{"title":"Forced-Choice Ranking Models for Raters’ Ranking Data","authors":"Su-Pin Hung, Hung-Yu Huang","doi":"10.3102/10769986221104207","DOIUrl":"https://doi.org/10.3102/10769986221104207","url":null,"abstract":"To address response style or bias in rating scales, forced-choice items are often used to request that respondents rank their attitudes or preferences among a limited set of options. The rating scales used by raters to render judgments on ratees’ performance also contribute to rater bias or errors; consequently, forced-choice items have recently been employed for raters to rate how a ratee performs in certain defined traits. This study develops forced-choice ranking models (FCRMs) for data analysis when performance is evaluated by external raters or experts in a forced-choice ranking format. The proposed FCRMs consider different degrees of raters’ leniency/severity when modeling the selection probability in the generalized unfolding item response theory framework. They include an additional topic facet when multiple tasks are evaluated and incorporate variations in leniency parameters to capture the interactions between ratees and raters. The simulation results indicate that the parameters of the new models can be satisfactorily recovered and that better parameter recovery is associated with more item blocks, larger sample sizes, and a complete ranking design. A technological creativity assessment is presented as an empirical example with which to demonstrate the applicability and implications of the new models.","PeriodicalId":48001,"journal":{"name":"Journal of Educational and Behavioral Statistics","volume":"47 1","pages":"603 - 634"},"PeriodicalIF":2.4,"publicationDate":"2022-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46199623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Assessing Inter-rater Reliability With Heterogeneous Variance Components Models: Flexible Approach Accounting for Contextual Variables 异质方差分量模型评估评分者间可靠性:考虑上下文变量的灵活方法
IF 2.4 3区 心理学 Q2 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-07-05 DOI: 10.3102/10769986221150517
Patrícia Martinková, František Bartoš, M. Brabec
Inter-rater reliability (IRR), which is a prerequisite of high-quality ratings and assessments, may be affected by contextual variables, such as the rater’s or ratee’s gender, major, or experience. Identification of such heterogeneity sources in IRR is important for the implementation of policies with the potential to decrease measurement error and to increase IRR by focusing on the most relevant subgroups. In this study, we propose a flexible approach for assessing IRR in cases of heterogeneity due to covariates by directly modeling differences in variance components. We use Bayes factors (BFs) to select the best performing model, and we suggest using Bayesian model averaging as an alternative approach for obtaining IRR and variance component estimates, allowing us to account for model uncertainty. We use inclusion BFs considering the whole model space to provide evidence for or against differences in variance components due to covariates. The proposed method is compared with other Bayesian and frequentist approaches in a simulation study, and we demonstrate its superiority in some situations. Finally, we provide real data examples from grant proposal peer review, demonstrating the usefulness of this method and its flexibility in the generalization of more complex designs.
评分者间可靠性(IRR)是高质量评分和评估的先决条件,可能会受到上下文变量的影响,如评分者或被评分者的性别、专业或经验。在内部收益率中识别这种异质性来源对于实施有可能通过关注最相关的子组来减少测量误差和增加内部收益率的政策非常重要。在这项研究中,我们提出了一种灵活的方法,通过直接建模方差分量的差异来评估由于协变量导致的异质性情况下的内部收益率。我们使用贝叶斯因子(BF)来选择性能最好的模型,并建议使用贝叶斯模型平均作为获得内部收益率和方差分量估计的替代方法,使我们能够考虑模型的不确定性。我们使用考虑整个模型空间的包含BF来提供支持或反对由于协变量引起的方差分量差异的证据。在模拟研究中,将所提出的方法与其他贝叶斯和频率论方法进行了比较,并在某些情况下证明了其优越性。最后,我们提供了赠款提案同行评审的真实数据示例,证明了该方法的有用性及其在更复杂设计的泛化中的灵活性。
{"title":"Assessing Inter-rater Reliability With Heterogeneous Variance Components Models: Flexible Approach Accounting for Contextual Variables","authors":"Patrícia Martinková, František Bartoš, M. Brabec","doi":"10.3102/10769986221150517","DOIUrl":"https://doi.org/10.3102/10769986221150517","url":null,"abstract":"Inter-rater reliability (IRR), which is a prerequisite of high-quality ratings and assessments, may be affected by contextual variables, such as the rater’s or ratee’s gender, major, or experience. Identification of such heterogeneity sources in IRR is important for the implementation of policies with the potential to decrease measurement error and to increase IRR by focusing on the most relevant subgroups. In this study, we propose a flexible approach for assessing IRR in cases of heterogeneity due to covariates by directly modeling differences in variance components. We use Bayes factors (BFs) to select the best performing model, and we suggest using Bayesian model averaging as an alternative approach for obtaining IRR and variance component estimates, allowing us to account for model uncertainty. We use inclusion BFs considering the whole model space to provide evidence for or against differences in variance components due to covariates. The proposed method is compared with other Bayesian and frequentist approaches in a simulation study, and we demonstrate its superiority in some situations. Finally, we provide real data examples from grant proposal peer review, demonstrating the usefulness of this method and its flexibility in the generalization of more complex designs.","PeriodicalId":48001,"journal":{"name":"Journal of Educational and Behavioral Statistics","volume":"48 1","pages":"349 - 383"},"PeriodicalIF":2.4,"publicationDate":"2022-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46141747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Pooling Interactions Into Error Terms in Multisite Experiments 在多站点实验中将相互作用合并为误差项
IF 2.4 3区 心理学 Q2 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-07-04 DOI: 10.3102/10769986221104800
Wendy Chan, L. Hedges
Multisite field experiments using the (generalized) randomized block design that assign treatments to individuals within sites are common in education and the social sciences. Under this design, there are two possible estimands of interest and they differ based on whether sites or blocks have fixed or random effects. When the average treatment effect is assumed to be identical across sites, it is common to omit site by treatment interactions and “pool” them into the error term in classical experimental design. However, prior work has not addressed the consequences of pooling when site by treatment interactions are not zero. This study assesses the impact of pooling on inference in the presence of nonzero site by treatment interactions. We derive the small sample distributions of the test statistics for treatment effects under pooling and illustrate the impacts on rejection rates when interactions are not zero. We use the results to offer recommendations to researchers conducting studies based on the multisite design.
在教育和社会科学中,使用(广义)随机分组设计的多站点现场实验在站点内为个体分配治疗方法是很常见的。在这种设计下,有两种可能的兴趣估计,它们根据站点或块是固定的还是随机的效果而不同。在经典实验设计中,当假设各个位点的平均处理效果相同时,通常会忽略处理相互作用的位点,并将其“汇集”到误差项中。然而,先前的工作并没有解决当位点与治疗相互作用不为零时池化的后果。本研究通过处理相互作用评估了在非零位点存在时池化对推断的影响。我们导出了池化处理效果的测试统计量的小样本分布,并说明了当相互作用不为零时对拒绝率的影响。我们利用这些结果为开展基于多站点设计的研究的研究人员提供建议。
{"title":"Pooling Interactions Into Error Terms in Multisite Experiments","authors":"Wendy Chan, L. Hedges","doi":"10.3102/10769986221104800","DOIUrl":"https://doi.org/10.3102/10769986221104800","url":null,"abstract":"Multisite field experiments using the (generalized) randomized block design that assign treatments to individuals within sites are common in education and the social sciences. Under this design, there are two possible estimands of interest and they differ based on whether sites or blocks have fixed or random effects. When the average treatment effect is assumed to be identical across sites, it is common to omit site by treatment interactions and “pool” them into the error term in classical experimental design. However, prior work has not addressed the consequences of pooling when site by treatment interactions are not zero. This study assesses the impact of pooling on inference in the presence of nonzero site by treatment interactions. We derive the small sample distributions of the test statistics for treatment effects under pooling and illustrate the impacts on rejection rates when interactions are not zero. We use the results to offer recommendations to researchers conducting studies based on the multisite design.","PeriodicalId":48001,"journal":{"name":"Journal of Educational and Behavioral Statistics","volume":"47 1","pages":"639 - 665"},"PeriodicalIF":2.4,"publicationDate":"2022-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44462932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Improving Accuracy and Stability of Aggregate Student Growth Measures Using Empirical Best Linear Prediction 使用经验最佳线性预测提高学生成长总量测量的准确性和稳定性
IF 2.4 3区 心理学 Q2 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-06-27 DOI: 10.3102/10769986221101624
J. R. Lockwood, K. Castellano, D. McCaffrey
Many states and school districts in the United States use standardized test scores to compute annual measures of student achievement progress and then use school-level averages of these growth measures for various reporting and diagnostic purposes. These aggregate growth measures can vary consequentially from year to year for the same school, complicating their use and interpretation. We develop a method, based on the theory of empirical best linear prediction, to improve the accuracy and stability of aggregate growth measures by pooling information across grades, years, and tested subjects for individual schools. We demonstrate the performance of the method using both simulation and application to 6 years of annual growth measures from a large, urban school district. We provide code for implementing the method in the package schoolgrowth for the R environment.
美国的许多州和学区使用标准化考试成绩来计算学生成绩进步的年度指标,然后将这些增长指标的学校平均值用于各种报告和诊断目的。对于同一所学校,这些总增长指标可能会因年份而异,使其使用和解释变得复杂。我们开发了一种基于经验最佳线性预测理论的方法,通过汇集各个学校的年级、年份和测试科目的信息,提高总体增长指标的准确性和稳定性。我们使用模拟和应用于一个大型城市学区6年的年度增长指标,展示了该方法的性能。我们提供了在R环境的一揽子学校成长中实现该方法的代码。
{"title":"Improving Accuracy and Stability of Aggregate Student Growth Measures Using Empirical Best Linear Prediction","authors":"J. R. Lockwood, K. Castellano, D. McCaffrey","doi":"10.3102/10769986221101624","DOIUrl":"https://doi.org/10.3102/10769986221101624","url":null,"abstract":"Many states and school districts in the United States use standardized test scores to compute annual measures of student achievement progress and then use school-level averages of these growth measures for various reporting and diagnostic purposes. These aggregate growth measures can vary consequentially from year to year for the same school, complicating their use and interpretation. We develop a method, based on the theory of empirical best linear prediction, to improve the accuracy and stability of aggregate growth measures by pooling information across grades, years, and tested subjects for individual schools. We demonstrate the performance of the method using both simulation and application to 6 years of annual growth measures from a large, urban school district. We provide code for implementing the method in the package schoolgrowth for the R environment.","PeriodicalId":48001,"journal":{"name":"Journal of Educational and Behavioral Statistics","volume":"47 1","pages":"544 - 575"},"PeriodicalIF":2.4,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46997238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Speed–Accuracy Trade-Off? Not So Fast: Marginal Changes in Speed Have Inconsistent Relationships With Accuracy in Real-World Settings 速度-精度权衡?不那么快:在现实世界中,速度的边际变化与准确性之间存在不一致的关系
IF 2.4 3区 心理学 Q2 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-06-08 DOI: 10.3102/10769986221099906
B. Domingue, K. Kanopka, B. Stenhaug, M. Sulik, Tanesia Beverly, Matthieu J. S. Brinkhuis, Ruhan Circi, Jessica Faul, Dandan Liao, Bruce McCandliss, Jelena Obradović, Chris Piech, Tenelle Porter, Project iLEAD Consortium, J. Soland, Jon Weeks, S. Wise, Jason D Yeatman
The speed–accuracy trade-off (SAT) suggests that time constraints reduce response accuracy. Its relevance in observational settings—where response time (RT) may not be constrained but respondent speed may still vary—is unclear. Using 29 data sets containing data from cognitive tasks, we use a flexible method for identification of the SAT (which we test in extensive simulation studies) to probe whether the SAT holds. We find inconsistent relationships between time and accuracy; marginal increases in time use for an individual do not necessarily predict increases in accuracy. Additionally, the speed–accuracy relationship may depend on the underlying difficulty of the interaction. We also consider the analysis of items and individuals; of particular interest is the observation that respondents who exhibit more within-person variation in response speed are typically of lower ability. We further find that RT is typically a weak predictor of response accuracy. Our findings document a range of empirical phenomena that should inform future modeling of RTs collected in observational settings.
速度-精度权衡(SAT)表明,时间限制会降低响应精度。它在观察环境中的相关性尚不清楚,在观察环境下,反应时间(RT)可能不受限制,但反应速度可能仍然不同。使用29个包含认知任务数据的数据集,我们使用一种灵活的SAT识别方法(我们在广泛的模拟研究中进行了测试)来探究SAT是否成立。我们发现时间和准确性之间的关系不一致;个体时间使用的边际增加并不一定能预测准确性的提高。此外,速度-准确性关系可能取决于交互的潜在难度。我们还考虑对项目和个人的分析;特别令人感兴趣的是观察到,反应速度表现出更多人内变化的受访者通常能力较低。我们进一步发现,RT通常是反应准确性的弱预测因子。我们的发现记录了一系列经验现象,这些现象应该为未来在观测环境中收集的RT建模提供信息。
{"title":"Speed–Accuracy Trade-Off? Not So Fast: Marginal Changes in Speed Have Inconsistent Relationships With Accuracy in Real-World Settings","authors":"B. Domingue, K. Kanopka, B. Stenhaug, M. Sulik, Tanesia Beverly, Matthieu J. S. Brinkhuis, Ruhan Circi, Jessica Faul, Dandan Liao, Bruce McCandliss, Jelena Obradović, Chris Piech, Tenelle Porter, Project iLEAD Consortium, J. Soland, Jon Weeks, S. Wise, Jason D Yeatman","doi":"10.3102/10769986221099906","DOIUrl":"https://doi.org/10.3102/10769986221099906","url":null,"abstract":"The speed–accuracy trade-off (SAT) suggests that time constraints reduce response accuracy. Its relevance in observational settings—where response time (RT) may not be constrained but respondent speed may still vary—is unclear. Using 29 data sets containing data from cognitive tasks, we use a flexible method for identification of the SAT (which we test in extensive simulation studies) to probe whether the SAT holds. We find inconsistent relationships between time and accuracy; marginal increases in time use for an individual do not necessarily predict increases in accuracy. Additionally, the speed–accuracy relationship may depend on the underlying difficulty of the interaction. We also consider the analysis of items and individuals; of particular interest is the observation that respondents who exhibit more within-person variation in response speed are typically of lower ability. We further find that RT is typically a weak predictor of response accuracy. Our findings document a range of empirical phenomena that should inform future modeling of RTs collected in observational settings.","PeriodicalId":48001,"journal":{"name":"Journal of Educational and Behavioral Statistics","volume":"47 1","pages":"576 - 602"},"PeriodicalIF":2.4,"publicationDate":"2022-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48011887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
What Is Actually Equated in “Test Equating”? A Didactic Note 在“测试等价”中,什么是真正等价的?说教笔记
IF 2.4 3区 心理学 Q2 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-06-01 DOI: 10.3102/10769986211072308
Wim J. van der Linden
The current literature on test equating generally defines it as the process necessary to obtain score comparability between different test forms. The definition is in contrast with Lord’s foundational paper which viewed equating as the process required to obtain comparability of measurement scale between forms. The distinction between the notions of scale and score is not trivial. The difference is explained by connecting these notions with standard statistical concepts as probability experiment, sample space, and random variable. The probability experiment underlying equating test forms with random scores immediately gives us the equating transformation as a function mapping the scale of one form into the other and thus supports the point of view taken by Lord. However, both Lord’s view and the current literature appear to rely on the idea of an experiment with random examinees which implies a different notion of test scores. It is shown how an explicit choice between the two experiments is not just important for our theoretical understanding of key notions in test equating but also has important practical consequences.
目前关于考试等值的文献通常将其定义为获得不同考试形式之间分数可比性所必需的过程。这一定义与Lord的基础论文形成了鲜明对比,Lord的基本论文将等式视为获得形式之间测量量表可比性所需的过程。量表和分数概念之间的区别并非微不足道。通过将这些概念与概率实验、样本空间和随机变量等标准统计概念联系起来,可以解释这种差异。将测试表格与随机分数等同起来的概率实验立即为我们提供了将一种表格的比例映射到另一种表格中的函数的等同变换,从而支持了Lord的观点。然而,洛德的观点和当前的文献似乎都依赖于对随机考生进行实验的想法,这意味着对考试成绩的看法不同。这表明,在这两个实验之间做出明确的选择不仅对我们对测试等式中关键概念的理论理解很重要,而且具有重要的实际后果。
{"title":"What Is Actually Equated in “Test Equating”? A Didactic Note","authors":"Wim J. van der Linden","doi":"10.3102/10769986211072308","DOIUrl":"https://doi.org/10.3102/10769986211072308","url":null,"abstract":"The current literature on test equating generally defines it as the process necessary to obtain score comparability between different test forms. The definition is in contrast with Lord’s foundational paper which viewed equating as the process required to obtain comparability of measurement scale between forms. The distinction between the notions of scale and score is not trivial. The difference is explained by connecting these notions with standard statistical concepts as probability experiment, sample space, and random variable. The probability experiment underlying equating test forms with random scores immediately gives us the equating transformation as a function mapping the scale of one form into the other and thus supports the point of view taken by Lord. However, both Lord’s view and the current literature appear to rely on the idea of an experiment with random examinees which implies a different notion of test scores. It is shown how an explicit choice between the two experiments is not just important for our theoretical understanding of key notions in test equating but also has important practical consequences.","PeriodicalId":48001,"journal":{"name":"Journal of Educational and Behavioral Statistics","volume":"47 1","pages":"353 - 362"},"PeriodicalIF":2.4,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47910828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Two Statistical Tests for the Detection of Item Compromise 检测项目折衷的两个统计检验
IF 2.4 3区 心理学 Q2 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-05-11 DOI: 10.3102/10769986221094789
W. van der Linden
Two independent statistical tests of item compromise are presented, one based on the test takers’ responses and the other on their response times (RTs) on the same items. The tests can be used to monitor an item in real time during online continuous testing but are also applicable as part of post hoc forensic analysis. The two test statistics are simple intuitive quantities as the sum of the responses and RTs observed for the test takers on the item. Common features of the tests are ease of interpretation and computational simplicity. Both tests are uniformly most powerful under the assumption of known ability and speed parameters for the test takers. Examples of power functions for items with realistic parameter values suggest maximum power for 20–30 test takers with item preknowledge for the response-based test and 10–20 test takers for the RT-based test.
提出了两个独立的项目折衷统计测试,一个基于考生的回答,另一个基于他们对相同项目的回答时间。这些测试可以用于在线连续测试期间实时监控物品,但也适用于事后取证分析。这两个测试统计数据是简单直观的量,是考生对该项目的反应和RT的总和。测试的共同特点是易于解释和计算简单。在假设考生的能力和速度参数已知的情况下,这两种测试都是最有力的。具有真实参数值的项目的幂函数示例表明,20–30名具有项目先验知识的考生参加基于反应的测试,10–20名考生参加基于RT的测试。
{"title":"Two Statistical Tests for the Detection of Item Compromise","authors":"W. van der Linden","doi":"10.3102/10769986221094789","DOIUrl":"https://doi.org/10.3102/10769986221094789","url":null,"abstract":"Two independent statistical tests of item compromise are presented, one based on the test takers’ responses and the other on their response times (RTs) on the same items. The tests can be used to monitor an item in real time during online continuous testing but are also applicable as part of post hoc forensic analysis. The two test statistics are simple intuitive quantities as the sum of the responses and RTs observed for the test takers on the item. Common features of the tests are ease of interpretation and computational simplicity. Both tests are uniformly most powerful under the assumption of known ability and speed parameters for the test takers. Examples of power functions for items with realistic parameter values suggest maximum power for 20–30 test takers with item preknowledge for the response-based test and 10–20 test takers for the RT-based test.","PeriodicalId":48001,"journal":{"name":"Journal of Educational and Behavioral Statistics","volume":"47 1","pages":"485 - 504"},"PeriodicalIF":2.4,"publicationDate":"2022-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44049872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Journal of Educational and Behavioral Statistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1