首页 > 最新文献

Applied Psychological Measurement最新文献

英文 中文
Controlling the Minimum Item Exposure Rate in Computerized Adaptive Testing: A Two-Stage Sympson–Hetter Procedure 计算机化自适应测验中控制最小项目曝光率:两阶段症状改善程序
4区 心理学 Q2 Social Sciences Pub Date : 2023-10-20 DOI: 10.1177/01466216231209756
Hsiu-Yi Chao, Jyun-Hong Chen
Computerized adaptive testing (CAT) can improve test efficiency, but it also causes the problem of unbalanced item usage within a pool. The effect of uneven item exposure rates can not only induce a test security problem due to overexposed items but also raise economic concerns about item pool development due to underexposed items. Therefore, this study proposes a two-stage Sympson–Hetter (TSH) method to enhance balanced item pool utilization by simultaneously controlling the minimum and maximum item exposure rates. The TSH method divides CAT into two stages. While the item exposure rates are controlled above a prespecified level (e.g., r min ) in the first stage to increase the exposure rates of the underexposed items, they are controlled below another prespecified level (e.g., r max ) in the second stage to prevent items from overexposure. To reduce the effect on trait estimation, TSH only administers a minimum sufficient number of underexposed items that are generally less discriminating in the first stage of CAT. The simulation study results indicate that the TSH method can effectively improve item pool usage without clearly compromising trait estimation precision in most conditions while maintaining the required level of test security.
计算机化自适应测试(CAT)可以提高测试效率,但也会导致池内项目使用不平衡的问题。不均匀的项目暴露率不仅会导致过度暴露的测试安全问题,还会引起暴露不足的项目池开发的经济问题。因此,本研究提出一种两阶段的TSH (Sympson-Hetter)方法,通过同时控制最小和最大项目曝光率来提高平衡的项目池利用率。TSH法将CAT分为两个阶段。在第一阶段,为了增加曝光不足的项目的曝光率,项目的曝光率被控制在一个预先规定的水平(例如,r min)以上,而在第二阶段,项目的曝光率被控制在另一个预先规定的水平(例如,r max)以下,以防止项目过度曝光。为了减少对性状估计的影响,TSH只管理最小数量的暴露不足项目,这些项目在CAT的第一阶段通常不太区分。仿真研究结果表明,在大多数情况下,TSH方法可以有效地提高题库使用率,而不会明显影响特征估计精度,同时保持所需的测试安全水平。
{"title":"Controlling the Minimum Item Exposure Rate in Computerized Adaptive Testing: A Two-Stage Sympson–Hetter Procedure","authors":"Hsiu-Yi Chao, Jyun-Hong Chen","doi":"10.1177/01466216231209756","DOIUrl":"https://doi.org/10.1177/01466216231209756","url":null,"abstract":"Computerized adaptive testing (CAT) can improve test efficiency, but it also causes the problem of unbalanced item usage within a pool. The effect of uneven item exposure rates can not only induce a test security problem due to overexposed items but also raise economic concerns about item pool development due to underexposed items. Therefore, this study proposes a two-stage Sympson–Hetter (TSH) method to enhance balanced item pool utilization by simultaneously controlling the minimum and maximum item exposure rates. The TSH method divides CAT into two stages. While the item exposure rates are controlled above a prespecified level (e.g., r min ) in the first stage to increase the exposure rates of the underexposed items, they are controlled below another prespecified level (e.g., r max ) in the second stage to prevent items from overexposure. To reduce the effect on trait estimation, TSH only administers a minimum sufficient number of underexposed items that are generally less discriminating in the first stage of CAT. The simulation study results indicate that the TSH method can effectively improve item pool usage without clearly compromising trait estimation precision in most conditions while maintaining the required level of test security.","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135567498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Two Statistics for Measuring the Score Comparability of Computerized Adaptive Tests 计算机化自适应测验分数可比性的两种统计方法
4区 心理学 Q2 Social Sciences Pub Date : 2023-10-19 DOI: 10.1177/01466216231209749
Adam E. Wyse
This study introduces two new statistics for measuring the score comparability of computerized adaptive tests (CATs) based on comparing conditional standard errors of measurement (CSEMs) for examinees that achieved the same scale scores. One statistic is designed to evaluate score comparability of alternate CAT forms for individual scale scores, while the other statistic is designed to evaluate the overall score comparability of alternate CAT forms. The effectiveness of the new statistics is illustrated using data from grade 3 through 8 reading and math CATs. Results suggest that both CATs demonstrated reasonably high levels of score comparability, that score comparability was less at very high or low scores where few students score, and that using random samples with fewer students per grade did not have a big impact on score comparability. Results also suggested that score comparability was sometimes higher when the bottom 20% of scorers were used to calculate overall score comparability compared to all students. Additional discussion related to applying the statistics in different contexts is provided.
本研究引入了两种新的统计方法来衡量计算机化自适应测验(CATs)分数的可比性,即比较获得相同量表分数的考生的条件标准误差(csem)。一个统计量用于评估替代CAT表格的个人量表得分的可比性,而另一个统计量用于评估替代CAT表格的总体得分可比性。新统计数据的有效性用三年级到八年级阅读和数学cat的数据来说明。结果表明,两种cat都表现出相当高的分数可比性,在很少学生得分的非常高或低的分数下,分数可比性较低,并且使用每个年级学生较少的随机样本对分数可比性没有太大影响。结果还表明,当使用得分最低的20%的学生来计算与所有学生的总分可比性时,分数的可比性有时会更高。还提供了与在不同上下文中应用统计数据相关的其他讨论。
{"title":"Two Statistics for Measuring the Score Comparability of Computerized Adaptive Tests","authors":"Adam E. Wyse","doi":"10.1177/01466216231209749","DOIUrl":"https://doi.org/10.1177/01466216231209749","url":null,"abstract":"This study introduces two new statistics for measuring the score comparability of computerized adaptive tests (CATs) based on comparing conditional standard errors of measurement (CSEMs) for examinees that achieved the same scale scores. One statistic is designed to evaluate score comparability of alternate CAT forms for individual scale scores, while the other statistic is designed to evaluate the overall score comparability of alternate CAT forms. The effectiveness of the new statistics is illustrated using data from grade 3 through 8 reading and math CATs. Results suggest that both CATs demonstrated reasonably high levels of score comparability, that score comparability was less at very high or low scores where few students score, and that using random samples with fewer students per grade did not have a big impact on score comparability. Results also suggested that score comparability was sometimes higher when the bottom 20% of scorers were used to calculate overall score comparability compared to all students. Additional discussion related to applying the statistics in different contexts is provided.","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135780396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficiency Analysis of Item Response Theory Kernel Equating for Mixed-Format Tests 项目反应理论核等价在混合格式测试中的有效性分析
4区 心理学 Q2 Social Sciences Pub Date : 2023-10-19 DOI: 10.1177/01466216231209757
Joakim Wallmark, Maria Josefsson, Marie Wiberg
This study aims to evaluate the performance of Item Response Theory (IRT) kernel equating in the context of mixed-format tests by comparing it to IRT observed score equating and kernel equating with log-linear presmoothing. Comparisons were made through both simulations and real data applications, under both equivalent groups (EG) and non-equivalent groups with anchor test (NEAT) sampling designs. To prevent bias towards IRT methods, data were simulated with and without the use of IRT models. The results suggest that the difference between IRT kernel equating and IRT observed score equating is minimal, both in terms of the equated scores and their standard errors. The application of IRT models for presmoothing yielded smaller standard error of equating than the log-linear presmoothing approach. When test data were generated using IRT models, IRT-based methods proved less biased than log-linear kernel equating. However, when data were simulated without IRT models, log-linear kernel equating showed less bias. Overall, IRT kernel equating shows great promise when equating mixed-format tests.
本研究旨在通过将项目反应理论(IRT)核等价与IRT观察得分等价和对数线性预平滑核等价进行比较,评价项目反应理论核等价在混合格式测试中的表现。通过模拟和实际数据应用,在锚点试验(NEAT)抽样设计的等效组(EG)和非等效组(non-equivalent groups)下进行了比较。为了防止对IRT方法的偏见,在使用和不使用IRT模型的情况下对数据进行了模拟。结果表明,IRT内核相等和IRT观察到的分数相等之间的差异是最小的,无论是在相等的分数和它们的标准误差方面。应用IRT模型进行预平滑比采用对数线性预平滑方法得到更小的方程标准误差。当使用IRT模型生成测试数据时,基于IRT的方法被证明比对数线性核方程的偏差更小。然而,当没有IRT模型的数据模拟时,对数线性核方程显示出较小的偏差。总的来说,IRT内核等价在等价混合格式测试时显示了很大的希望。
{"title":"Efficiency Analysis of Item Response Theory Kernel Equating for Mixed-Format Tests","authors":"Joakim Wallmark, Maria Josefsson, Marie Wiberg","doi":"10.1177/01466216231209757","DOIUrl":"https://doi.org/10.1177/01466216231209757","url":null,"abstract":"This study aims to evaluate the performance of Item Response Theory (IRT) kernel equating in the context of mixed-format tests by comparing it to IRT observed score equating and kernel equating with log-linear presmoothing. Comparisons were made through both simulations and real data applications, under both equivalent groups (EG) and non-equivalent groups with anchor test (NEAT) sampling designs. To prevent bias towards IRT methods, data were simulated with and without the use of IRT models. The results suggest that the difference between IRT kernel equating and IRT observed score equating is minimal, both in terms of the equated scores and their standard errors. The application of IRT models for presmoothing yielded smaller standard error of equating than the log-linear presmoothing approach. When test data were generated using IRT models, IRT-based methods proved less biased than log-linear kernel equating. However, when data were simulated without IRT models, log-linear kernel equating showed less bias. Overall, IRT kernel equating shows great promise when equating mixed-format tests.","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135779713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparing Person-Fit and Traditional Indices Across Careless Response Patterns in Surveys. 比较调查中随意反应模式下的个人适合度和传统指数。
IF 1 4区 心理学 Q4 PSYCHOLOGY, MATHEMATICAL Pub Date : 2023-09-01 Epub Date: 2023-08-03 DOI: 10.1177/01466216231194358
Eli A Jones, Stefanie A Wind, Chia-Lin Tsai, Yuan Ge

Methods to identify carelessness in survey research can be valuable tools in reducing bias during survey development, validation, and use. Because carelessness may take multiple forms, researchers typically use multiple indices when identifying carelessness. In the current study, we extend the literature on careless response identification by examining the usefulness of three item-response theory-based person-fit indices for both random and overconsistent careless response identification: infit MSE outfit MSE, and the polytomous lz statistic. We compared these statistics with traditional careless response indices using both empirical data and simulated data. The empirical data included 2,049 high school student surveys of teaching effectiveness from the Network for Educator Effectiveness. In the simulated data, we manipulated type of carelessness (random response or overconsistency) and percent of carelessness present (0%, 5%, 10%, 20%). Results suggest that infit and outfit MSE and the lz statistic may provide complementary information to traditional indices such as LongString, Mahalanobis Distance, Validity Items, and Completion Time. Receiver operating characteristic curves suggested that the person-fit indices showed good sensitivity and specificity for classifying both over-consistent and under-consistent careless patterns, thus functioning in a bidirectional manner. Carelessness classifications based on low fit values correlated with carelessness classifications from LongString and completion time, and classifications based on high fit values correlated with classifications from Mahalanobis Distance. We consider implications for research and practice.

在调查开发、验证和使用过程中,识别调查研究中疏忽大意的方法可以成为减少偏见的宝贵工具。由于粗心可能有多种形式,研究人员在识别粗心时通常会使用多个指数。在当前的研究中,我们扩展了关于粗心反应识别的文献,通过检验基于三项反应理论的人适合指数对随机和过度一致的粗心反应识别(infit MSE装备MSE和polytomous lz统计量)的有用性。我们使用经验数据和模拟数据将这些统计数据与传统的粗心反应指数进行了比较。实证数据包括教育工作者有效性网络对2049名高中生教学有效性的调查。在模拟数据中,我们操纵了粗心的类型(随机反应或过度一致性)和存在的粗心百分比(0%、5%、10%、20%)。结果表明,内场和装备MSE以及lz统计量可以为传统指标如LongString、Mahalanobis距离、有效性项目和完成时间提供补充信息。受试者操作特征曲线表明,个人拟合指数在过度一致和不一致的粗心模式下都表现出良好的敏感性和特异性,从而以双向方式发挥作用。基于低拟合值的粗心分类与LongString和完成时间的粗心分类相关,基于高拟合值的分类与Mahalanobis Distance的分类相关。我们考虑对研究和实践的影响。
{"title":"Comparing Person-Fit and Traditional Indices Across Careless Response Patterns in Surveys.","authors":"Eli A Jones, Stefanie A Wind, Chia-Lin Tsai, Yuan Ge","doi":"10.1177/01466216231194358","DOIUrl":"10.1177/01466216231194358","url":null,"abstract":"<p><p>Methods to identify carelessness in survey research can be valuable tools in reducing bias during survey development, validation, and use. Because carelessness may take multiple forms, researchers typically use multiple indices when identifying carelessness. In the current study, we extend the literature on careless response identification by examining the usefulness of three item-response theory-based person-fit indices for both random and overconsistent careless response identification: infit <i>MSE</i> outfit <i>MSE</i>, and the polytomous <i>l</i><sub><i>z</i></sub> statistic. We compared these statistics with traditional careless response indices using both empirical data and simulated data. The empirical data included 2,049 high school student surveys of teaching effectiveness from the Network for Educator Effectiveness. In the simulated data, we manipulated type of carelessness (random response or overconsistency) and percent of carelessness present (0%, 5%, 10%, 20%). Results suggest that infit and outfit <i>MSE</i> and the <i>l</i><sub><i>z</i></sub> statistic may provide complementary information to traditional indices such as LongString, Mahalanobis Distance, Validity Items, and Completion Time. Receiver operating characteristic curves suggested that the person-fit indices showed good sensitivity and specificity for classifying both over-consistent and under-consistent careless patterns, thus functioning in a bidirectional manner. Carelessness classifications based on low fit values correlated with carelessness classifications from LongString and completion time, and classifications based on high fit values correlated with classifications from Mahalanobis Distance. We consider implications for research and practice.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10552731/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41155112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Does Sparseness Matter? Examining the Use of Generalizability Theory and Many-Facet Rasch Measurement in Sparse Rating Designs. 稀疏很重要吗?考察概化理论和多面粗糙度测量在稀疏评级设计中的应用。
IF 1 4区 心理学 Q4 PSYCHOLOGY, MATHEMATICAL Pub Date : 2023-09-01 Epub Date: 2023-06-07 DOI: 10.1177/01466216231182148
Stefanie A Wind, Eli Jones, Sara Grajeda

Sparse rating designs, where each examinee's performance is scored by a small proportion of raters, are prevalent in practical performance assessments. However, relatively little research has focused on the degree to which different analytic techniques alert researchers to rater effects in such designs. We used a simulation study to compare the information provided by two popular approaches: Generalizability theory (G theory) and Many-Facet Rasch (MFR) measurement. In previous comparisons, researchers used complete data that were not simulated-thus limiting their ability to manipulate characteristics such as rater effects, and to understand the impact of incomplete data on the results. Both approaches provided information about rating quality in sparse designs, but the MFR approach highlighted rater effects related to centrality and bias more readily than G theory.

稀疏评分设计,即每个考生的表现由一小部分评分者打分,在实际表现评估中很普遍。然而,相对较少的研究关注不同的分析技术在多大程度上提醒研究人员注意此类设计中的评分效应。我们使用模拟研究来比较两种流行方法提供的信息:广义理论(G理论)和多面Rasch(MFR)测量。在之前的比较中,研究人员使用了未模拟的完整数据,从而限制了他们操纵评分者效应等特征的能力,并了解不完整数据对结果的影响。这两种方法都提供了关于稀疏设计中评级质量的信息,但MFR方法比G理论更容易强调与中心性和偏差相关的评级者效应。
{"title":"Does Sparseness Matter? Examining the Use of Generalizability Theory and Many-Facet Rasch Measurement in Sparse Rating Designs.","authors":"Stefanie A Wind, Eli Jones, Sara Grajeda","doi":"10.1177/01466216231182148","DOIUrl":"10.1177/01466216231182148","url":null,"abstract":"<p><p>Sparse rating designs, where each examinee's performance is scored by a small proportion of raters, are prevalent in practical performance assessments. However, relatively little research has focused on the degree to which different analytic techniques alert researchers to rater effects in such designs. We used a simulation study to compare the information provided by two popular approaches: Generalizability theory (G theory) and Many-Facet Rasch (MFR) measurement. In previous comparisons, researchers used complete data that were not simulated-thus limiting their ability to manipulate characteristics such as rater effects, and to understand the impact of incomplete data on the results. Both approaches provided information about rating quality in sparse designs, but the MFR approach highlighted rater effects related to centrality and bias more readily than G theory.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10552733/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41174005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Effects of Aberrant Responding on Model-Fit Assuming Different Underlying Response Processes. 假设不同的基本响应过程,偏离响应对模型拟合的影响。
IF 1 4区 心理学 Q4 PSYCHOLOGY, MATHEMATICAL Pub Date : 2023-09-01 Epub Date: 2023-09-19 DOI: 10.1177/01466216231201987
Jennifer Reimers, Ronna C Turner, Jorge N Tendeiro, Wen-Juo Lo, Elizabeth Keiffer

Aberrant responding on tests and surveys has been shown to affect the psychometric properties of scales and the statistical analyses from the use of those scales in cumulative model contexts. This study extends prior research by comparing the effects of four types of aberrant responding on model fit in both cumulative and ideal point model contexts using graded partial credit (GPCM) and generalized graded unfolding (GGUM) models. When fitting models to data, model misfit can be both a function of misspecification and aberrant responding. Results demonstrate how varying levels of aberrant data can severely impact model fit for both cumulative and ideal point data. Specifically, longstring responses have a stronger impact on dimensionality for both ideal point and cumulative data, while random responding tends to have the most negative impact on data model fit according to information criteria (AIC, BIC). The results also indicate that ideal point data models such as GGUM may be able to fit cumulative data as well as the cumulative model itself (GPCM), whereas cumulative data models may not provide sufficient model fit for data simulated using an ideal point model.

测试和调查的异常反应已被证明会影响量表的心理测量特性,以及在累积模型环境中使用这些量表的统计分析。本研究扩展了先前的研究,使用分级部分信用(GPCM)和广义分级展开(GGUM)模型比较了四种类型的异常反应对累积点模型和理想点模型中模型拟合的影响。当将模型与数据拟合时,模型不匹配可能是错误指定和异常响应的函数。结果表明,不同水平的异常数据会严重影响累积点数据和理想点数据的模型拟合。具体而言,对于理想点和累积数据,长串响应对维度的影响更强,而根据信息标准(AIC、BIC),随机响应往往对数据模型拟合产生最负面的影响。结果还表明,诸如GGUM的理想点数据模型可能能够拟合累积数据以及累积模型本身(GPCM),而累积数据模型可能不能为使用理想点模型模拟的数据提供足够的模型拟合。
{"title":"The Effects of Aberrant Responding on Model-Fit Assuming Different Underlying Response Processes.","authors":"Jennifer Reimers, Ronna C Turner, Jorge N Tendeiro, Wen-Juo Lo, Elizabeth Keiffer","doi":"10.1177/01466216231201987","DOIUrl":"10.1177/01466216231201987","url":null,"abstract":"<p><p>Aberrant responding on tests and surveys has been shown to affect the psychometric properties of scales and the statistical analyses from the use of those scales in cumulative model contexts. This study extends prior research by comparing the effects of four types of aberrant responding on model fit in both cumulative and ideal point model contexts using graded partial credit (GPCM) and generalized graded unfolding (GGUM) models. When fitting models to data, model misfit can be both a function of misspecification and aberrant responding. Results demonstrate how varying levels of aberrant data can severely impact model fit for both cumulative and ideal point data. Specifically, longstring responses have a stronger impact on dimensionality for both ideal point and cumulative data, while random responding tends to have the most negative impact on data model fit according to information criteria (AIC, BIC). The results also indicate that ideal point data models such as GGUM may be able to fit cumulative data as well as the cumulative model itself (GPCM), whereas cumulative data models may not provide sufficient model fit for data simulated using an ideal point model.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10552732/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41171817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Item Scores and Distractors to Detect Test Speededness. 使用项目得分和分心因素来检测测试速度。
IF 1 4区 心理学 Q4 PSYCHOLOGY, MATHEMATICAL Pub Date : 2023-09-01 Epub Date: 2023-06-15 DOI: 10.1177/01466216231182149
Kylie Gorney, James A Wollack, Daniel M Bolt

Test speededness refers to a situation in which examinee performance is inadvertently affected by the time limit of the test. Because speededness has the potential to severely bias both person and item parameter estimates, it is crucial that speeded examinees are detected. In this article, we develop a change-point analysis (CPA) procedure for detecting test speededness. Our procedure distinguishes itself from existing CPA procedures by using information from both item scores and distractors. Using detailed simulations, we show that under most conditions, the new CPA procedure improves the detection of speeded examinees and produces more accurate change-point estimates. It therefore seems there is a considerable amount of information to be gained from the item distractors, which, quite notably are available in all multiple-choice data. A real data example is also provided.

考试加速是指考生的成绩不经意地受到考试时间限制的影响。由于快速性有可能严重影响个人和项目参数估计,因此检测快速考生至关重要。在本文中,我们开发了一种用于检测测试速度的变点分析(CPA)程序。我们的程序通过使用来自项目得分和干扰因素的信息来区别于现有的CPA程序。通过详细的模拟,我们表明,在大多数情况下,新的CPA程序提高了对快速考生的检测,并产生了更准确的变化点估计。因此,似乎可以从项目干扰物中获得相当多的信息,值得注意的是,这些信息在所有多项选择数据中都是可用的。还提供了一个实际数据示例。
{"title":"Using Item Scores and Distractors to Detect Test Speededness.","authors":"Kylie Gorney, James A Wollack, Daniel M Bolt","doi":"10.1177/01466216231182149","DOIUrl":"10.1177/01466216231182149","url":null,"abstract":"<p><p>Test speededness refers to a situation in which examinee performance is inadvertently affected by the time limit of the test. Because speededness has the potential to severely bias both person and item parameter estimates, it is crucial that speeded examinees are detected. In this article, we develop a change-point analysis (CPA) procedure for detecting test speededness. Our procedure distinguishes itself from existing CPA procedures by using information from both item scores and distractors. Using detailed simulations, we show that under most conditions, the new CPA procedure improves the detection of speeded examinees and produces more accurate change-point estimates. It therefore seems there is a considerable amount of information to be gained from the item distractors, which, quite notably are available in all multiple-choice data. A real data example is also provided.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10552735/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41171818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sequential Bayesian Ability Estimation Applied to Mixed-Format Item Tests. 序列贝叶斯能力估计在混合格式项目测试中的应用。
IF 1 4区 心理学 Q4 PSYCHOLOGY, MATHEMATICAL Pub Date : 2023-09-01 Epub Date: 2023-09-08 DOI: 10.1177/01466216231201986
Jiawei Xiong, Allan S Cohen, Xinhui Maggie Xiong

Large-scale tests often contain mixed-format items, such as when multiple-choice (MC) items and constructed-response (CR) items are both contained in the same test. Although previous research has analyzed both types of items simultaneously, this may not always provide the best estimate of ability. In this paper, a two-step sequential Bayesian (SB) analytic method under the concept of empirical Bayes is explored for mixed item response models. This method integrates ability estimates from different item formats. Unlike the empirical Bayes method, the SB method estimates examinees' posterior ability parameters with individual-level sample-dependent prior distributions estimated from the MC items. Simulations were used to evaluate the accuracy of recovery of ability and item parameters over four factors: the type of the ability distribution, sample size, test length (number of items for each item type), and person/item parameter estimation method. The SB method was compared with a traditional concurrent Bayesian (CB) calibration method, EAPsum, that uses scaled scores for summed scores to estimate parameters from the MC and CR items simultaneously in one estimation step. From the simulation results, the SB method showed more accurate and reliable ability estimation than the CB method, especially when the sample size was small (150 and 500). Both methods presented similar recovery results for MC item parameters, but the CB method yielded a bit better recovery of the CR item parameters. The empirical example suggested that posterior ability estimated by the proposed SB method had higher reliability than the CB method.

大规模测试通常包含混合格式的项目,例如当多项选择(MC)项目和构造反应(CR)项目都包含在同一测试中时。尽管先前的研究同时分析了这两种类型的项目,但这可能并不总能提供对能力的最佳估计。本文在经验贝叶斯的概念下,对混合项目反应模型的两步序列贝叶斯分析方法进行了探索。该方法集成了来自不同项目格式的能力评估。与经验贝叶斯方法不同,SB方法使用从MC项目估计的个体水平样本相关先验分布来估计考生的后验能力参数。模拟用于评估四个因素的能力和项目参数恢复的准确性:能力分布的类型、样本量、测试长度(每个项目类型的项目数量)和个人/项目参数估计方法。将SB方法与传统的并发贝叶斯(CB)校准方法EAPsum进行了比较,该方法使用缩放分数作为总分数,在一个估计步骤中同时估计MC和CR项目的参数。从模拟结果来看,SB方法比CB方法显示出更准确可靠的能力估计,尤其是当样本量较小(150和500)时。两种方法对MC项目参数的恢复结果相似,但CB方法对CR项目参数的修复效果要好一些。实例表明,所提出的SB方法估计的后验能力比CB方法具有更高的可靠性。
{"title":"Sequential Bayesian Ability Estimation Applied to Mixed-Format Item Tests.","authors":"Jiawei Xiong, Allan S Cohen, Xinhui Maggie Xiong","doi":"10.1177/01466216231201986","DOIUrl":"10.1177/01466216231201986","url":null,"abstract":"<p><p>Large-scale tests often contain mixed-format items, such as when multiple-choice (MC) items and constructed-response (CR) items are both contained in the same test. Although previous research has analyzed both types of items simultaneously, this may not always provide the best estimate of ability. In this paper, a two-step sequential Bayesian (SB) analytic method under the concept of empirical Bayes is explored for mixed item response models. This method integrates ability estimates from different item formats. Unlike the empirical Bayes method, the SB method estimates examinees' posterior ability parameters with individual-level sample-dependent prior distributions estimated from the MC items. Simulations were used to evaluate the accuracy of recovery of ability and item parameters over four factors: the type of the ability distribution, sample size, test length (number of items for each item type), and person/item parameter estimation method. The SB method was compared with a traditional concurrent Bayesian (CB) calibration method, EAPsum, that uses scaled scores for summed scores to estimate parameters from the MC and CR items simultaneously in one estimation step. From the simulation results, the SB method showed more accurate and reliable ability estimation than the CB method, especially when the sample size was small (150 and 500). Both methods presented similar recovery results for MC item parameters, but the CB method yielded a bit better recovery of the CR item parameters. The empirical example suggested that posterior ability estimated by the proposed SB method had higher reliability than the CB method.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10552734/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41180283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modeling Rating Order Effects Under Item Response Theory Models for Rater-Mediated Assessments. 评价中介评价项目反应理论模型下评价顺序效应的建模。
IF 1.2 4区 心理学 Q2 Social Sciences Pub Date : 2023-06-01 DOI: 10.1177/01466216231174566
Hung-Yu Huang

Rater effects are commonly observed in rater-mediated assessments. By using item response theory (IRT) modeling, raters can be treated as independent factors that function as instruments for measuring ratees. Most rater effects are static and can be addressed appropriately within an IRT framework, and a few models have been developed for dynamic rater effects. Operational rating projects often require human raters to continuously and repeatedly score ratees over a certain period, imposing a burden on the cognitive processing abilities and attention spans of raters that stems from judgment fatigue and thus affects the rating quality observed during the rating period. As a result, ratees' scores may be influenced by the order in which they are graded by raters in a rating sequence, and the rating order effect should be considered in new IRT models. In this study, two types of many-faceted (MF)-IRT models are developed to account for such dynamic rater effects, which assume that rater severity can drift systematically or stochastically. The results obtained from two simulation studies indicate that the parameters of the newly developed models can be estimated satisfactorily using Bayesian estimation and that disregarding the rating order effect produces biased model structure and ratee proficiency parameter estimations. A creativity assessment is outlined to demonstrate the application of the new models and to investigate the consequences of failing to detect the possible rating order effect in a real rater-mediated evaluation.

评价者效应通常在评价者介导的评估中观察到。通过项目反应理论(IRT)建模,可以将评价者视为独立的因素,作为衡量评价率的工具。大多数比率效应是静态的,可以在IRT框架内适当地处理,并且已经为动态比率效应开发了一些模型。操作性评级项目往往需要人类评分员在一段时间内连续重复地评分,这给评分员的认知加工能力和注意力带来了负担,这是由于判断疲劳造成的,从而影响了评分期间观察到的评分质量。因此,评分者的分数可能会受到评分者在评分序列中的评分顺序的影响,在新的IRT模型中应该考虑评分顺序效应。在本研究中,开发了两种类型的多面(MF)-IRT模型来解释这种动态评级效应,这些模型假设评级严重程度可以系统地或随机地漂移。两项仿真研究的结果表明,新建立的模型参数可以用贝叶斯估计得到满意的估计,忽略评级顺序效应会导致模型结构和率熟练度参数估计有偏差。本文概述了一个创造力评估,以展示新模型的应用,并调查在真实的评分中介评估中未能检测到可能的评分顺序效应的后果。
{"title":"Modeling Rating Order Effects Under Item Response Theory Models for Rater-Mediated Assessments.","authors":"Hung-Yu Huang","doi":"10.1177/01466216231174566","DOIUrl":"https://doi.org/10.1177/01466216231174566","url":null,"abstract":"<p><p>Rater effects are commonly observed in rater-mediated assessments. By using item response theory (IRT) modeling, raters can be treated as independent factors that function as instruments for measuring ratees. Most rater effects are static and can be addressed appropriately within an IRT framework, and a few models have been developed for dynamic rater effects. Operational rating projects often require human raters to continuously and repeatedly score ratees over a certain period, imposing a burden on the cognitive processing abilities and attention spans of raters that stems from judgment fatigue and thus affects the rating quality observed during the rating period. As a result, ratees' scores may be influenced by the order in which they are graded by raters in a rating sequence, and the rating order effect should be considered in new IRT models. In this study, two types of many-faceted (MF)-IRT models are developed to account for such dynamic rater effects, which assume that rater severity can drift systematically or stochastically. The results obtained from two simulation studies indicate that the parameters of the newly developed models can be estimated satisfactorily using Bayesian estimation and that disregarding the rating order effect produces biased model structure and ratee proficiency parameter estimations. A creativity assessment is outlined to demonstrate the application of the new models and to investigate the consequences of failing to detect the possible rating order effect in a real rater-mediated evaluation.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/7c/68/10.1177_01466216231174566.PMC10240569.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10300637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Mixed Sequential IRT Model for Mixed-Format Items. 混合格式项目的混合序列 IRT 模型。
IF 1.2 4区 心理学 Q2 Social Sciences Pub Date : 2023-06-01 Epub Date: 2023-03-17 DOI: 10.1177/01466216231165302
Junhuan Wei, Yan Cai, Dongbo Tu

To provide more insight into an individual's response process and cognitive process, this study proposed three mixed sequential item response models (MS-IRMs) for mixed-format items consisting of a mixture of a multiple-choice item and an open-ended item that emphasize a sequential response process and are scored sequentially. Relative to existing polytomous models such as the graded response model (GRM), generalized partial credit model (GPCM), or traditional sequential Rasch model (SRM), the proposed models employ an appropriate processing function for each task to improve conventional polytomous models. Simulation studies were carried out to investigate the performance of the proposed models, and the results indicated that all proposed models outperformed the SRM, GRM, and GPCM in terms of parameter recovery and model fit. An application illustration of the MS-IRMs in comparison with traditional models was demonstrated by using real data from TIMSS 2007.

为了更深入地了解个体的反应过程和认知过程,本研究提出了三种混合序列项目反应模型(MS-IRM),适用于由选择题和开放题混合组成的混合格式项目,这些项目强调序列反应过程并按序列计分。相对于现有的多项式模型,如分级反应模型(GRM)、广义部分学分模型(GPCM)或传统的序列拉希模型(SRM),所提出的模型为每个任务采用了适当的处理函数,以改进传统的多项式模型。研究人员进行了仿真研究以考察所提模型的性能,结果表明,所有所提模型在参数恢复和模型拟合方面均优于 SRM、GRM 和 GPCM。通过使用 TIMSS 2007 的真实数据,展示了 MS-IRM 与传统模型的应用比较。
{"title":"A Mixed Sequential IRT Model for Mixed-Format Items.","authors":"Junhuan Wei, Yan Cai, Dongbo Tu","doi":"10.1177/01466216231165302","DOIUrl":"10.1177/01466216231165302","url":null,"abstract":"<p><p>To provide more insight into an individual's response process and cognitive process, this study proposed three mixed sequential item response models (MS-IRMs) for mixed-format items consisting of a mixture of a multiple-choice item and an open-ended item that emphasize a sequential response process and are scored sequentially. Relative to existing polytomous models such as the graded response model (GRM), generalized partial credit model (GPCM), or traditional sequential Rasch model (SRM), the proposed models employ an appropriate processing function for each task to improve conventional polytomous models. Simulation studies were carried out to investigate the performance of the proposed models, and the results indicated that all proposed models outperformed the SRM, GRM, and GPCM in terms of parameter recovery and model fit. An application illustration of the MS-IRMs in comparison with traditional models was demonstrated by using real data from TIMSS 2007.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10240568/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10297969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Applied Psychological Measurement
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1