首页 > 最新文献

International Journal of Testing最新文献

英文 中文
ITC Guidelines for Translating and Adapting Tests (Second Edition) ITC翻译和改编测试指南(第二版)
IF 1.7 Q2 SOCIAL SCIENCES, INTERDISCIPLINARY Pub Date : 2018-04-03 DOI: 10.1080/15305058.2017.1398166
The second edition of the International Test Commission Guidelines for Translating and Adapting Tests was prepared between 2005 and 2015 to improve upon the first edition, and to respond to advances in testing technology and practices. The 18 guidelines are organized into six categories to facilitate their use: pre-condition (3), test development (5), confirmation (4), administration (2), scoring and interpretation (2), and documentation (2). For each guideline, an explanation is provided along with suggestions for practice. A checklist is provided to improve the implementation of the guidelines.
2005年至2015年间,国际考试委员会编写了第二版《翻译和调整考试指南》,以改进第一版,并应对考试技术和实践的进步。为了便于使用,18项指南分为六类:先决条件(3)、测试开发(5)、确认(4)、给药(2)、评分和解释(2)以及文件(2)。对于每一条准则,都提供了解释和实践建议。提供了一份清单,以改进准则的执行情况。
{"title":"ITC Guidelines for Translating and Adapting Tests (Second Edition)","authors":"","doi":"10.1080/15305058.2017.1398166","DOIUrl":"https://doi.org/10.1080/15305058.2017.1398166","url":null,"abstract":"The second edition of the International Test Commission Guidelines for Translating and Adapting Tests was prepared between 2005 and 2015 to improve upon the first edition, and to respond to advances in testing technology and practices. The 18 guidelines are organized into six categories to facilitate their use: pre-condition (3), test development (5), confirmation (4), administration (2), scoring and interpretation (2), and documentation (2). For each guideline, an explanation is provided along with suggestions for practice. A checklist is provided to improve the implementation of the guidelines.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"18 1","pages":"101 - 134"},"PeriodicalIF":1.7,"publicationDate":"2018-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2017.1398166","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49495890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 403
Detecting Curvilinear Relationships: A Comparison of Scoring Approaches Based on Different Item Response Models 检测曲线关系:基于不同项目反应模型的评分方法比较
IF 1.7 Q2 SOCIAL SCIENCES, INTERDISCIPLINARY Pub Date : 2018-04-03 DOI: 10.1080/15305058.2017.1345913
Mengyang Cao, Q. Song, L. Tay
There is a growing use of noncognitive assessments around the world, and recent research has posited an ideal point response process underlying such measures. A critical issue is whether the typical use of dominance approaches (e.g., average scores, factor analysis, and the Samejima's graded response model) in scoring such measures is adequate. This study examined the performance of an ideal point scoring approach (e.g., the generalized graded unfolding model) as compared to the typical dominance scoring approaches in detecting curvilinear relationships between scored trait and external variable. Simulation results showed that when data followed the ideal point model, the ideal point approach generally exhibited more power and provided more accurate estimates of curvilinear effects than the dominance approaches. No substantial difference was found between ideal point and dominance scoring approaches in terms of Type I error rate and bias across different sample sizes and scale lengths, although skewness in the distribution of trait and external variable can potentially reduce statistical power. For dominance data, the ideal point scoring approach exhibited convergence problems in most conditions and failed to perform as well as the dominance scoring approaches. Practical implications for scoring responses to Likert-type surveys to examine curvilinear effects are discussed.
非认知评估在世界各地的使用越来越多,最近的研究已经提出了一个理想的点反应过程作为这些措施的基础。一个关键问题是,在对这些指标进行评分时,通常使用优势方法(例如,平均得分、因素分析和Samejima分级反应模型)是否足够。本研究检验了理想得分方法(例如,广义分级展开模型)与典型优势得分方法在检测得分特征和外部变量之间的曲线关系方面的性能。仿真结果表明,当数据遵循理想点模型时,理想点方法通常比优势方法表现出更大的威力,并提供更准确的曲线效应估计。在不同样本量和量表长度的情况下,理想点和优势评分方法在I型错误率和偏差方面没有发现实质性差异,尽管特征和外部变量分布的偏倚可能会降低统计能力。对于优势数据,理想点评分方法在大多数情况下都存在收敛问题,并且表现不如优势评分方法。讨论了对Likert型调查进行评分以检验曲线效应的实际意义。
{"title":"Detecting Curvilinear Relationships: A Comparison of Scoring Approaches Based on Different Item Response Models","authors":"Mengyang Cao, Q. Song, L. Tay","doi":"10.1080/15305058.2017.1345913","DOIUrl":"https://doi.org/10.1080/15305058.2017.1345913","url":null,"abstract":"There is a growing use of noncognitive assessments around the world, and recent research has posited an ideal point response process underlying such measures. A critical issue is whether the typical use of dominance approaches (e.g., average scores, factor analysis, and the Samejima's graded response model) in scoring such measures is adequate. This study examined the performance of an ideal point scoring approach (e.g., the generalized graded unfolding model) as compared to the typical dominance scoring approaches in detecting curvilinear relationships between scored trait and external variable. Simulation results showed that when data followed the ideal point model, the ideal point approach generally exhibited more power and provided more accurate estimates of curvilinear effects than the dominance approaches. No substantial difference was found between ideal point and dominance scoring approaches in terms of Type I error rate and bias across different sample sizes and scale lengths, although skewness in the distribution of trait and external variable can potentially reduce statistical power. For dominance data, the ideal point scoring approach exhibited convergence problems in most conditions and failed to perform as well as the dominance scoring approaches. Practical implications for scoring responses to Likert-type surveys to examine curvilinear effects are discussed.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"18 1","pages":"178 - 205"},"PeriodicalIF":1.7,"publicationDate":"2018-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2017.1345913","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49612645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Response Time Based Nonparametric Kullback-Leibler Divergence Measure for Detecting Aberrant Test-Taking Behavior 基于响应时间的非参数Kullback-Leibler散度测度检测异常应试行为
IF 1.7 Q2 SOCIAL SCIENCES, INTERDISCIPLINARY Pub Date : 2018-02-28 DOI: 10.1080/15305058.2018.1429446
K. Man, Jeffery R. Harring, Yunbo Ouyang, Sarah L. Thomas
Many important high-stakes decisions—college admission, academic performance evaluation, and even job promotion—depend on accurate and reliable scores from valid large-scale assessments. However, examinees sometimes cheat by copying answers from other test-takers or practicing with test items ahead of time, which can undermine the effectiveness of such assessments in yielding accurate, precise information of examinees' performances. This study focuses on the utility of a new nonparametric person-fit index using examinees' response times to detect two types of cheating behaviors. The feasibility of this method was investigated vis-à-vis a Monte Carlo simulation as well as through analyzing data from a large-scale assessment. Findings indicate that the proposed index was quite successful in detecting pre-knowledge cheating and extreme one-item cheating.
许多重要的高风险决策——大学录取、学业成绩评估,甚至是工作晋升——都依赖于有效的大规模评估中准确可靠的分数。然而,考生有时会通过抄袭其他考生的答案或提前练习考试题目来作弊,这可能会破坏这种评估的有效性,从而无法准确、准确地反映考生的表现。本研究的重点是利用一个新的非参数个人拟合指数,利用考生的反应时间来检测两种类型的作弊行为。通过-à-vis蒙特卡罗模拟以及对大规模评估数据的分析,研究了该方法的可行性。结果表明,该指标在检测前知作弊和极端单项作弊方面非常成功。
{"title":"Response Time Based Nonparametric Kullback-Leibler Divergence Measure for Detecting Aberrant Test-Taking Behavior","authors":"K. Man, Jeffery R. Harring, Yunbo Ouyang, Sarah L. Thomas","doi":"10.1080/15305058.2018.1429446","DOIUrl":"https://doi.org/10.1080/15305058.2018.1429446","url":null,"abstract":"Many important high-stakes decisions—college admission, academic performance evaluation, and even job promotion—depend on accurate and reliable scores from valid large-scale assessments. However, examinees sometimes cheat by copying answers from other test-takers or practicing with test items ahead of time, which can undermine the effectiveness of such assessments in yielding accurate, precise information of examinees' performances. This study focuses on the utility of a new nonparametric person-fit index using examinees' response times to detect two types of cheating behaviors. The feasibility of this method was investigated vis-à-vis a Monte Carlo simulation as well as through analyzing data from a large-scale assessment. Findings indicate that the proposed index was quite successful in detecting pre-knowledge cheating and extreme one-item cheating.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"18 1","pages":"155 - 177"},"PeriodicalIF":1.7,"publicationDate":"2018-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1429446","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47362564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
FIPC Linking Across Multidimensional Test Forms: Effects of Confounding Difficulty within Dimensions 跨多维测试形式的FIPC连接:维度内混淆难度的影响
IF 1.7 Q2 SOCIAL SCIENCES, INTERDISCIPLINARY Pub Date : 2018-02-27 DOI: 10.1080/15305058.2018.1428980
S. Kim, Ki Cole, M. Mwavita
This study investigated the effects of linking potentially multidimensional test forms using the fixed item parameter calibration. Forms had equal or unequal total test difficulty with and without confounding difficulty. The mean square errors and bias of estimated item and ability parameters were compared across the various confounding tests. The estimated discrimination parameters were influenced by the levels of correlation between dimensions. The mean square errors (MSEs) of the average of the true discrimination parameters with the estimated value were smallest when the correlation equaled 0; however, the MSEs of the multidimensional discrimination parameter were smallest when the correlation was larger than 0. The estimated difficulty parameters were highly affected by different amount of confounding difficulty within dimensions. Furthermore, the MSEs of the average of the true ability parameters on the first and second dimensions with the estimated ability were smaller than those from the ability parameter on each dimension for all conditions. The pattern varied according to the number of common items, and the measures of MSE and squared bias were relatively consistent across forms at the same level of correlation, except for the condition where the correlation was 0 and the number of common items was 8.
本研究探讨了使用固定项目参数校准连接潜在多维测试形式的效果。表格的总测试难度相等或不等,有或没有混淆难度。比较了各种混杂检验中估计项目和能力参数的均方误差和偏倚。估计的判别参数受到各维度间相关水平的影响。当相关系数为0时,真判别参数的平均值与估计值的均方误差最小;当相关系数大于0时,多维判别参数的均方差最小。估计的难度参数受到维度内不同的混淆难度的高度影响。此外,在所有条件下,第一、第二维度真实能力参数与估计能力平均值的均方差均小于各维度能力参数平均值的均方差。在相同的相关水平下,除了相关系数为0、常见题数为8的情况外,各表格的MSE和平方偏差的测量值相对一致。
{"title":"FIPC Linking Across Multidimensional Test Forms: Effects of Confounding Difficulty within Dimensions","authors":"S. Kim, Ki Cole, M. Mwavita","doi":"10.1080/15305058.2018.1428980","DOIUrl":"https://doi.org/10.1080/15305058.2018.1428980","url":null,"abstract":"This study investigated the effects of linking potentially multidimensional test forms using the fixed item parameter calibration. Forms had equal or unequal total test difficulty with and without confounding difficulty. The mean square errors and bias of estimated item and ability parameters were compared across the various confounding tests. The estimated discrimination parameters were influenced by the levels of correlation between dimensions. The mean square errors (MSEs) of the average of the true discrimination parameters with the estimated value were smallest when the correlation equaled 0; however, the MSEs of the multidimensional discrimination parameter were smallest when the correlation was larger than 0. The estimated difficulty parameters were highly affected by different amount of confounding difficulty within dimensions. Furthermore, the MSEs of the average of the true ability parameters on the first and second dimensions with the estimated ability were smaller than those from the ability parameter on each dimension for all conditions. The pattern varied according to the number of common items, and the measures of MSE and squared bias were relatively consistent across forms at the same level of correlation, except for the condition where the correlation was 0 and the number of common items was 8.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"18 1","pages":"323 - 345"},"PeriodicalIF":1.7,"publicationDate":"2018-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1428980","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"59952903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Effects of Situational Judgment Test Format on Reliability and Validity 情境判断测试格式对信度和有效性的影响
IF 1.7 Q2 SOCIAL SCIENCES, INTERDISCIPLINARY Pub Date : 2018-02-23 DOI: 10.1080/15305058.2018.1428981
Michelle P. Martín‐Raugh, Cristina Anguiano-Carrsaco, Teresa Jackson, Meghan W. Brenneman, Lauren M. Carney, Patrick V. Barnwell, Jonathan F. Kochert
Single-response situational judgment tests (SRSJTs) differ from multiple-response SJTs (MRSJTS) in that they present test takers with edited critical incidents and simply ask test takers to read over the action described and evaluate it according to its effectiveness. Research comparing the reliability and validity of SRSJTs and MRSJTs is thus far extremely limited. The study reported here directly compares forms of a SRSJT and MRSJT and explores the reliability, convergent validity, and predictive validity of each format. Results from this investigation present preliminary evidence to suggest SRSJTs may produce internal consistency reliability, convergent validity, and predictive validity estimates that are comparable to those achieved with many traditional MRSJTs. We conclude by discussing practical implications for personnel selection and assessment, and future research in psychological science more broadly.
单反应情景判断测试(SRSJT)与多反应情景判断考试(MRSJTS)的不同之处在于,它们向考生提供经过编辑的关键事件,并简单地要求考生阅读所描述的行动,并根据其有效性进行评估。迄今为止,比较SRSJT和MRSJT的可靠性和有效性的研究极其有限。本文报道的研究直接比较了SRSJT和MRSJT的形式,并探讨了每种形式的可靠性、收敛有效性和预测有效性。这项研究的结果提供了初步证据,表明SRSJT可以产生内部一致性可靠性、收敛有效性和预测有效性估计,这些估计与许多传统MRSJT实现的估计相当。最后,我们讨论了对人员选择和评估以及未来更广泛的心理科学研究的实际意义。
{"title":"Effects of Situational Judgment Test Format on Reliability and Validity","authors":"Michelle P. Martín‐Raugh, Cristina Anguiano-Carrsaco, Teresa Jackson, Meghan W. Brenneman, Lauren M. Carney, Patrick V. Barnwell, Jonathan F. Kochert","doi":"10.1080/15305058.2018.1428981","DOIUrl":"https://doi.org/10.1080/15305058.2018.1428981","url":null,"abstract":"Single-response situational judgment tests (SRSJTs) differ from multiple-response SJTs (MRSJTS) in that they present test takers with edited critical incidents and simply ask test takers to read over the action described and evaluate it according to its effectiveness. Research comparing the reliability and validity of SRSJTs and MRSJTs is thus far extremely limited. The study reported here directly compares forms of a SRSJT and MRSJT and explores the reliability, convergent validity, and predictive validity of each format. Results from this investigation present preliminary evidence to suggest SRSJTs may produce internal consistency reliability, convergent validity, and predictive validity estimates that are comparable to those achieved with many traditional MRSJTs. We conclude by discussing practical implications for personnel selection and assessment, and future research in psychological science more broadly.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"18 1","pages":"135 - 154"},"PeriodicalIF":1.7,"publicationDate":"2018-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1428981","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48386096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Adding Value to Second-Language Listening and Reading Subscores: Using a Score Augmentation Approach 增加第二语言听力和阅读分数的价值:使用分数增加法
IF 1.7 Q2 SOCIAL SCIENCES, INTERDISCIPLINARY Pub Date : 2018-01-04 DOI: 10.1080/15305058.2017.1407766
S. Papageorgiou, Ikkyu Choi
This study examined whether reporting subscores for groups of items within a test section assessing a second-language modality (specifically reading or listening comprehension) added value from a measurement perspective to the information already provided by the section scores. We analyzed the responses of 116,489 test takers to reading and listening items from operational administrations of two large-scale international tests of English as a foreign language. To “strengthen” the reliability of the subscores, and thus improve their added value, we applied a score augmentation method (Haberman, 2008). In doing so, our aim was to examine whether reporting augmented subscores for specific groups of reading and listening items could improve the added value of these subscores and consequently justify providing more fine-grained information about test taker performance. Our analysis indicated that in general, there was lack of support for reporting subscores from a psychometric perspective, and that score augmentation marginally improved the added value of the subscores. We discuss several implications of our findings for test developers wishing to report more fine-grained information about test performance. We conclude by arguing that research on how to best report such refined feedback should remain the focus of future efforts related to second-language proficiency tests.
本研究考察了在评估第二语言模态(特别是阅读或听力理解)的测试部分中报告项目组的子分数是否从测量的角度增加了部分分数已经提供的信息的价值。我们分析了116,489名考生对两次大型国际英语作为外语考试的阅读和听力项目的反应。为了“加强”子分数的可靠性,从而提高它们的附加值,我们采用了分数增强法(Haberman, 2008)。在这样做的过程中,我们的目的是检验报告特定组的阅读和听力项目的增加子分数是否可以提高这些子分数的附加值,从而证明提供更多关于考生表现的细粒度信息是合理的。我们的分析表明,总的来说,从心理测量学的角度来看,缺乏对报告子分数的支持,分数增加略微提高了子分数的附加值。对于希望报告有关测试性能的更细粒度信息的测试开发人员,我们讨论了我们的发现的几个含义。最后,我们认为研究如何最好地报告这种精炼的反馈应该是未来第二语言能力测试的重点。
{"title":"Adding Value to Second-Language Listening and Reading Subscores: Using a Score Augmentation Approach","authors":"S. Papageorgiou, Ikkyu Choi","doi":"10.1080/15305058.2017.1407766","DOIUrl":"https://doi.org/10.1080/15305058.2017.1407766","url":null,"abstract":"This study examined whether reporting subscores for groups of items within a test section assessing a second-language modality (specifically reading or listening comprehension) added value from a measurement perspective to the information already provided by the section scores. We analyzed the responses of 116,489 test takers to reading and listening items from operational administrations of two large-scale international tests of English as a foreign language. To “strengthen” the reliability of the subscores, and thus improve their added value, we applied a score augmentation method (Haberman, 2008). In doing so, our aim was to examine whether reporting augmented subscores for specific groups of reading and listening items could improve the added value of these subscores and consequently justify providing more fine-grained information about test taker performance. Our analysis indicated that in general, there was lack of support for reporting subscores from a psychometric perspective, and that score augmentation marginally improved the added value of the subscores. We discuss several implications of our findings for test developers wishing to report more fine-grained information about test performance. We conclude by arguing that research on how to best report such refined feedback should remain the focus of future efforts related to second-language proficiency tests.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"18 1","pages":"207 - 230"},"PeriodicalIF":1.7,"publicationDate":"2018-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2017.1407766","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43463605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Exploring a Source of Uneven Score Equity across the Test Score Range 探索考试成绩不均衡的根源
IF 1.7 Q2 SOCIAL SCIENCES, INTERDISCIPLINARY Pub Date : 2018-01-02 DOI: 10.1080/15305058.2017.1396463
A. Huggins-Manley, Yuxi Qiu, Randall D. Penfield
Score equity assessment (SEA) refers to an examination of population invariance of equating across two or more subpopulations of test examinees. Previous SEA studies have shown that score equity may be present for examinees scoring at particular test score ranges but absent for examinees scoring at other score ranges. No studies to date have performed research for the purpose of understanding why score equity can be inconsistent across the score range of some tests. The purpose of this study is to explore a source of uneven subpopulation score equity across the score range of a test. It is hypothesized that the difficulty of anchor items displaying differential item functioning (DIF) is directly related to the score location at which issues of score inequity are observed. The simulation study supports the hypothesis that the difficulty of DIF items has a systematic impact on the uneven nature of conditional score equity.
分数公平评估(SEA)是指对两个或多个考试考生亚群之间相等的总体不变性的检查。以前的SEA研究表明,在特定考试分数范围内得分的考生可能存在分数公平,但在其他分数范围内得分的考生则没有。到目前为止,还没有研究是为了理解为什么在某些测试的分数范围内,分数公平是不一致的。本研究的目的是探讨一个测试的分数范围内的亚群体得分不均衡的来源。假设锚题表现差异项目功能(DIF)的难度与观察到得分不平等问题的得分位置直接相关。模拟研究支持DIF项目难度对条件得分公平的不平衡性有系统影响的假设。
{"title":"Exploring a Source of Uneven Score Equity across the Test Score Range","authors":"A. Huggins-Manley, Yuxi Qiu, Randall D. Penfield","doi":"10.1080/15305058.2017.1396463","DOIUrl":"https://doi.org/10.1080/15305058.2017.1396463","url":null,"abstract":"Score equity assessment (SEA) refers to an examination of population invariance of equating across two or more subpopulations of test examinees. Previous SEA studies have shown that score equity may be present for examinees scoring at particular test score ranges but absent for examinees scoring at other score ranges. No studies to date have performed research for the purpose of understanding why score equity can be inconsistent across the score range of some tests. The purpose of this study is to explore a source of uneven subpopulation score equity across the score range of a test. It is hypothesized that the difficulty of anchor items displaying differential item functioning (DIF) is directly related to the score location at which issues of score inequity are observed. The simulation study supports the hypothesis that the difficulty of DIF items has a systematic impact on the uneven nature of conditional score equity.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"18 1","pages":"50 - 70"},"PeriodicalIF":1.7,"publicationDate":"2018-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2017.1396463","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45374977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Spurious Latent Class Problem in the Mixed Rasch Model: A Comparison of Three Maximum Likelihood Estimation Methods under Different Ability Distributions 混合Rasch模型中的伪潜在类问题:不同能力分布下三种最大似然估计方法的比较
IF 1.7 Q2 SOCIAL SCIENCES, INTERDISCIPLINARY Pub Date : 2018-01-02 DOI: 10.1080/15305058.2017.1312408
S. Şen
Recent research has shown that over-extraction of latent classes can be observed in the Bayesian estimation of the mixed Rasch model when the distribution of ability is non-normal. This study examined the effect of non-normal ability distributions on the number of latent classes in the mixed Rasch model when estimated with maximum likelihood estimation methods (conditional, marginal, and joint). Three information criteria fit indices (Akaike information criterion, Bayesian information criterion, and sample size adjusted BIC) were used in a simulation study and an empirical study. Findings of this study showed that the spurious latent class problem was observed with marginal maximum likelihood and joint maximum likelihood estimations. However, conditional maximum likelihood estimation showed no overextraction problem with non-normal ability distributions.
最近的研究表明,当能力分布非正态时,在混合Rasch模型的贝叶斯估计中可以观察到潜在类的过度提取。当使用最大似然估计方法(条件、边际和联合)进行估计时,本研究检验了非正态能力分布对混合Rasch模型中潜在类数量的影响。在模拟研究和实证研究中使用了三个信息准则拟合指数(Akaike信息准则、贝叶斯信息准则和样本量调整后的BIC)。这项研究的结果表明,通过边际最大似然和联合最大似然估计可以观察到虚假的潜在类问题。然而,条件最大似然估计在非正态能力分布的情况下没有表现出过度牵引问题。
{"title":"Spurious Latent Class Problem in the Mixed Rasch Model: A Comparison of Three Maximum Likelihood Estimation Methods under Different Ability Distributions","authors":"S. Şen","doi":"10.1080/15305058.2017.1312408","DOIUrl":"https://doi.org/10.1080/15305058.2017.1312408","url":null,"abstract":"Recent research has shown that over-extraction of latent classes can be observed in the Bayesian estimation of the mixed Rasch model when the distribution of ability is non-normal. This study examined the effect of non-normal ability distributions on the number of latent classes in the mixed Rasch model when estimated with maximum likelihood estimation methods (conditional, marginal, and joint). Three information criteria fit indices (Akaike information criterion, Bayesian information criterion, and sample size adjusted BIC) were used in a simulation study and an empirical study. Findings of this study showed that the spurious latent class problem was observed with marginal maximum likelihood and joint maximum likelihood estimations. However, conditional maximum likelihood estimation showed no overextraction problem with non-normal ability distributions.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"18 1","pages":"100 - 71"},"PeriodicalIF":1.7,"publicationDate":"2018-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2017.1312408","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48081560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
The Influence of Rater Effects in Training Sets on the Psychometric Quality of Automated Scoring for Writing Assessments 训练集中评分者效应对写作评估自动评分心理测量质量的影响
IF 1.7 Q2 SOCIAL SCIENCES, INTERDISCIPLINARY Pub Date : 2018-01-02 DOI: 10.1080/15305058.2017.1361426
Stefanie A. Wind, E. Wolfe, G. Engelhard, P. Foltz, Mark Rosenstein
Automated essay scoring engines (AESEs) are becoming increasingly popular as an efficient method for performance assessments in writing, including many language assessments that are used worldwide. Before they can be used operationally, AESEs must be “trained” using machine-learning techniques that incorporate human ratings. However, the quality of the human ratings used to train the AESEs is rarely examined. As a result, the impact of various rater effects (e.g., severity and centrality) on the quality of AESE-assigned scores is not known. In this study, we use data from a large-scale rater-mediated writing assessment to examine the impact of rater effects on the quality of AESE-assigned scores. Overall, the results suggest that if rater effects are present in the ratings used to train an AESE, the AESE scores may replicate these effects. Implications are discussed in terms of research and practice related to automated scoring.
自动论文评分引擎(AESEs)作为一种有效的写作绩效评估方法越来越受欢迎,包括世界范围内使用的许多语言评估。在使用aese之前,必须使用包含人类评级的机器学习技术对其进行“训练”。然而,用于训练aese的人类评级的质量很少被检查。因此,各种评分效应(如严重程度和中心性)对aese评分质量的影响尚不清楚。在这项研究中,我们使用了一项大规模评分者介导的写作评估的数据来检验评分者效应对aese评分质量的影响。总的来说,结果表明,如果用于训练AESE的评分中存在评分者效应,则AESE分数可能会复制这些效应。在研究和实践方面讨论了与自动评分相关的影响。
{"title":"The Influence of Rater Effects in Training Sets on the Psychometric Quality of Automated Scoring for Writing Assessments","authors":"Stefanie A. Wind, E. Wolfe, G. Engelhard, P. Foltz, Mark Rosenstein","doi":"10.1080/15305058.2017.1361426","DOIUrl":"https://doi.org/10.1080/15305058.2017.1361426","url":null,"abstract":"Automated essay scoring engines (AESEs) are becoming increasingly popular as an efficient method for performance assessments in writing, including many language assessments that are used worldwide. Before they can be used operationally, AESEs must be “trained” using machine-learning techniques that incorporate human ratings. However, the quality of the human ratings used to train the AESEs is rarely examined. As a result, the impact of various rater effects (e.g., severity and centrality) on the quality of AESE-assigned scores is not known. In this study, we use data from a large-scale rater-mediated writing assessment to examine the impact of rater effects on the quality of AESE-assigned scores. Overall, the results suggest that if rater effects are present in the ratings used to train an AESE, the AESE scores may replicate these effects. Implications are discussed in terms of research and practice related to automated scoring.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"18 1","pages":"27 - 49"},"PeriodicalIF":1.7,"publicationDate":"2018-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2017.1361426","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48679322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Differential Distractor Functioning as a Method for Explaining DIF: The Case of a National Admissions Test in Saudi Arabia 差异分散因子作为解释DIF的一种方法——以沙特阿拉伯的一次全国招生考试为例
IF 1.7 Q2 SOCIAL SCIENCES, INTERDISCIPLINARY Pub Date : 2018-01-02 DOI: 10.1080/15305058.2017.1345914
I. Tsaousis, G. Sideridis, Fahad Al-Saawi
The aim of the present study was to examine Differential Distractor Functioning (DDF) as a means of improving the quality of a measure through understanding biased responses across groups. A DDF analysis could shed light on the potential sources of construct-irrelevant variance by examining whether the differential selection of incorrect choices (distractors), attracts various groups in different ways. To examine possible DDF effects, a method introduced by Penfield (2008, 2010a), based on odds ratio estimators, was utilized. Items from the Chemistry sub-scale of the Standard Achievement Admission Test (SAAT) in Saudi Arabia were used as an example. Statistical evidence for differential item functioning (DIF_ existed for five items, at either moderate or strong levels. Particularly three items (i.e., items 45, 54, and 61), reached category B levels (i.e., moderate DIF), and two items (items 51and 60) category C levels (strong DIF) based on Educational Testing Service guidelines. These items were then examined more closely for DDF in an attempt to potentially understand the causes of DIF and group biased responses. The manuscript concludes with a series of remedial actions, based on distractor-relevant information, with the goal of improving the psychometric properties of an instrument under study.
本研究的目的是检验差异分散功能(DDF),通过理解不同群体的偏见反应来提高测量质量。DDF分析可以通过检查错误选择(干扰物)的差异选择是否以不同的方式吸引不同的群体,来揭示结构无关方差的潜在来源。为了检验可能的DDF效应,Penfield(20082010a)引入了一种基于比值比估计量的方法。以沙特阿拉伯标准成绩录取测试(SAAT)化学分量表中的项目为例。差异项目功能(DIF_)的统计证据存在于五个中等或强烈水平的项目中。特别是根据教育测试服务指南,三个项目(即项目45、54和61)达到了B类水平(即中度DIF),两个项目(项目51和60)达到了C类水平(强烈DIF)。然后对这些项目进行更密切的DDF检查,试图了解DIF和群体偏见反应的潜在原因。手稿以一系列基于干扰物相关信息的补救措施结束,目的是改善所研究仪器的心理测量特性。
{"title":"Differential Distractor Functioning as a Method for Explaining DIF: The Case of a National Admissions Test in Saudi Arabia","authors":"I. Tsaousis, G. Sideridis, Fahad Al-Saawi","doi":"10.1080/15305058.2017.1345914","DOIUrl":"https://doi.org/10.1080/15305058.2017.1345914","url":null,"abstract":"The aim of the present study was to examine Differential Distractor Functioning (DDF) as a means of improving the quality of a measure through understanding biased responses across groups. A DDF analysis could shed light on the potential sources of construct-irrelevant variance by examining whether the differential selection of incorrect choices (distractors), attracts various groups in different ways. To examine possible DDF effects, a method introduced by Penfield (2008, 2010a), based on odds ratio estimators, was utilized. Items from the Chemistry sub-scale of the Standard Achievement Admission Test (SAAT) in Saudi Arabia were used as an example. Statistical evidence for differential item functioning (DIF_ existed for five items, at either moderate or strong levels. Particularly three items (i.e., items 45, 54, and 61), reached category B levels (i.e., moderate DIF), and two items (items 51and 60) category C levels (strong DIF) based on Educational Testing Service guidelines. These items were then examined more closely for DDF in an attempt to potentially understand the causes of DIF and group biased responses. The manuscript concludes with a series of remedial actions, based on distractor-relevant information, with the goal of improving the psychometric properties of an instrument under study.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"18 1","pages":"1 - 26"},"PeriodicalIF":1.7,"publicationDate":"2018-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2017.1345914","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47606507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
International Journal of Testing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1