首页 > 最新文献

International Journal of Testing最新文献

英文 中文
Stopping Rules for Computer Adaptive Testing When Item Banks Have Nonuniform Information 题库信息不统一时计算机自适应测试的停止规则
IF 1.7 Q1 Social Sciences Pub Date : 2020-04-02 DOI: 10.1080/15305058.2019.1635604
S. Morris, Mike Bass, Elizabeth Howard, R. Neapolitan
The standard error (SE) stopping rule, which terminates a computer adaptive test (CAT) when the SE is less than a threshold, is effective when there are informative questions for all trait levels. However, in domains such as patient-reported outcomes, the items in a bank might all target one end of the trait continuum (e.g., negative symptoms), and the bank may lack depth for many individuals. In such cases, the predicted standard error reduction (PSER) stopping rule will stop the CAT even if the SE threshold has not been reached and can avoid administering excessive questions that provide little additional information. By tuning the parameters of the PSER algorithm, a practitioner can specify a desired tradeoff between accuracy and efficiency. Using simulated data for the Patient-Reported Outcomes Measurement Information System Anxiety and Physical Function banks, we demonstrate that these parameters can substantially impact CAT performance. When the parameters were optimally tuned, the PSER stopping rule was found to outperform the SE stopping rule overall, particularly for individuals not targeted by the bank, and presented roughly the same number of items across the trait continuum. Therefore, the PSER stopping rule provides an effective method for balancing the precision and efficiency of a CAT.
标准误差(SE)停止规则,当SE小于阈值时终止计算机自适应测试(CAT),在所有性状水平都存在信息性问题时有效。然而,在诸如患者报告结果等领域,信息库中的项目可能都针对特征连续体的一端(例如,阴性症状),并且信息库可能对许多个体缺乏深度。在这种情况下,即使没有达到SE阈值,预测的标准错误减少(PSER)停止规则也会停止CAT,并且可以避免管理提供很少额外信息的过多问题。通过调优PSER算法的参数,从业者可以在准确性和效率之间指定理想的权衡。使用患者报告结果测量信息系统焦虑和身体功能库的模拟数据,我们证明这些参数可以显著影响CAT的表现。当参数被优化后,发现PSER停止规则总体上优于SE停止规则,特别是对于非银行目标的个体,并且在特征连续体中呈现大致相同数量的项目。因此,PSER停止规则为平衡CAT的精度和效率提供了一种有效的方法。
{"title":"Stopping Rules for Computer Adaptive Testing When Item Banks Have Nonuniform Information","authors":"S. Morris, Mike Bass, Elizabeth Howard, R. Neapolitan","doi":"10.1080/15305058.2019.1635604","DOIUrl":"https://doi.org/10.1080/15305058.2019.1635604","url":null,"abstract":"The standard error (SE) stopping rule, which terminates a computer adaptive test (CAT) when the SE is less than a threshold, is effective when there are informative questions for all trait levels. However, in domains such as patient-reported outcomes, the items in a bank might all target one end of the trait continuum (e.g., negative symptoms), and the bank may lack depth for many individuals. In such cases, the predicted standard error reduction (PSER) stopping rule will stop the CAT even if the SE threshold has not been reached and can avoid administering excessive questions that provide little additional information. By tuning the parameters of the PSER algorithm, a practitioner can specify a desired tradeoff between accuracy and efficiency. Using simulated data for the Patient-Reported Outcomes Measurement Information System Anxiety and Physical Function banks, we demonstrate that these parameters can substantially impact CAT performance. When the parameters were optimally tuned, the PSER stopping rule was found to outperform the SE stopping rule overall, particularly for individuals not targeted by the bank, and presented roughly the same number of items across the trait continuum. Therefore, the PSER stopping rule provides an effective method for balancing the precision and efficiency of a CAT.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2020-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2019.1635604","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43767801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
True or False? Keying Direction and Acquiescence Influence the Validity of Socio-Emotional Skills Items in Predicting High School Achievement 是真是假?键盘指向和停顿对社会情感技能项目预测高中成绩有效性的影响
IF 1.7 Q1 Social Sciences Pub Date : 2020-04-02 DOI: 10.1080/15305058.2019.1673398
Ricardo Primi, Filip De Fruyt, Daniel Santos, Stephen Antonoplis, O. John
What type of items, keyed positively or negatively, makes social-emotional skill or personality scales more valid? The present study examines the different criterion validities of true- and false-keyed items, before and after correction for acquiescence. The sample included 12,987 children and adolescents from 425 schools of the State of São Paulo Brazil (ages 11–18 attending grades 6–12). They answered a computerized 162-item questionnaire measuring 18 facets grouped into five broad domains of social-emotional skills, i.e.: Open-mindedness (O), Conscientious Self-Management (C), Engaging with others (E), Amity (A), and Negative-Emotion Regulation (N). All facet scales were fully balanced (3 true-keyed and 3 false-keyed items per facet). Criterion validity coefficients of scales composed of only true-keyed items versus only false-keyed items were compared. The criterion measure was a standardized achievement test of language and math ability. We found that coefficients were almost as twice as big for false-keyed items’ scales than for true-keyed items’ scales. After correcting for acquiescence coefficients became more similar. Acquiescence suppresses the criterion validity of unbalanced scales composed of true-keyed items. We conclude that balanced scales with pairs of true and false keyed items make a better scale in terms of internal structural and predictive validity.
哪种类型的项目,无论是积极的还是消极的,都会使社交情感技能或性格量表更有效?本研究考察了在默认校正前后,真键和假键项目的不同标准有效性。样本包括来自巴西圣保罗州425所学校的12987名儿童和青少年(11-18岁,6-12年级)。他们回答了一份162项的计算机化问卷,测量了社会情感技能的18个方面,分为五个广泛的领域,即:开放心态(O)、认真的自我管理(C)、与他人交往(e)、友善(a)和消极情绪调节(N)。所有方面的量表都是完全平衡的(每个方面有3个真键和3个假键项目)。比较了仅由真键项目和仅由假键项目组成的量表的标准有效性系数。衡量标准是语言和数学能力的标准化成绩测试。我们发现,假键项目的量表的系数几乎是真键项目量表的两倍。在对默认系数进行校正后,系数变得更加相似。停顿抑制了由真键项目组成的不平衡量表的标准有效性。我们得出的结论是,在内部结构和预测有效性方面,具有成对真键和假键项目的平衡量表是一个更好的量表。
{"title":"True or False? Keying Direction and Acquiescence Influence the Validity of Socio-Emotional Skills Items in Predicting High School Achievement","authors":"Ricardo Primi, Filip De Fruyt, Daniel Santos, Stephen Antonoplis, O. John","doi":"10.1080/15305058.2019.1673398","DOIUrl":"https://doi.org/10.1080/15305058.2019.1673398","url":null,"abstract":"What type of items, keyed positively or negatively, makes social-emotional skill or personality scales more valid? The present study examines the different criterion validities of true- and false-keyed items, before and after correction for acquiescence. The sample included 12,987 children and adolescents from 425 schools of the State of São Paulo Brazil (ages 11–18 attending grades 6–12). They answered a computerized 162-item questionnaire measuring 18 facets grouped into five broad domains of social-emotional skills, i.e.: Open-mindedness (O), Conscientious Self-Management (C), Engaging with others (E), Amity (A), and Negative-Emotion Regulation (N). All facet scales were fully balanced (3 true-keyed and 3 false-keyed items per facet). Criterion validity coefficients of scales composed of only true-keyed items versus only false-keyed items were compared. The criterion measure was a standardized achievement test of language and math ability. We found that coefficients were almost as twice as big for false-keyed items’ scales than for true-keyed items’ scales. After correcting for acquiescence coefficients became more similar. Acquiescence suppresses the criterion validity of unbalanced scales composed of true-keyed items. We conclude that balanced scales with pairs of true and false keyed items make a better scale in terms of internal structural and predictive validity.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2020-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2019.1673398","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49361168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
The Recovery of Correlation Between Latent Abilities Using Compensatory and Noncompensatory Multidimensional IRT Models 利用代偿和非代偿多维IRT模型恢复潜在能力之间的相关性
IF 1.7 Q1 Social Sciences Pub Date : 2020-04-02 DOI: 10.1080/15305058.2019.1692212
Yanyan Fu, Tyler Strachan, E. Ip, John T. Willse, Shyh-Huei Chen, Terry A. Ackerman
This research examined correlation estimates between latent abilities when using the two-dimensional and three-dimensional compensatory and noncompensatory item response theory models. Simulation study results showed that the recovery of the latent correlation was best when the test contained 100% of simple structure items for all models and conditions. When a test measured weakly discriminated dimensions, it became harder to recover the latent correlation. Results also showed that increasing the sample size, test length, or using simpler models (i.e., two-parameter logistic rather than three-parameter logistic, compensatory rather than noncompensatory) could improve the recovery of latent correlation.
当使用二维和三维补偿和非补偿项目反应理论模型时,本研究检验了潜在能力之间的相关性估计。模拟研究结果表明,在所有模型和条件下,当测试包含100%的简单结构项时,潜在相关性的恢复最好。当测试测量到弱辨别维度时,恢复潜在相关性变得更加困难。结果还表明,增加样本量、测试长度或使用更简单的模型(即两参数逻辑而非三参数逻辑,补偿而非非补偿)可以提高潜在相关性的恢复。
{"title":"The Recovery of Correlation Between Latent Abilities Using Compensatory and Noncompensatory Multidimensional IRT Models","authors":"Yanyan Fu, Tyler Strachan, E. Ip, John T. Willse, Shyh-Huei Chen, Terry A. Ackerman","doi":"10.1080/15305058.2019.1692212","DOIUrl":"https://doi.org/10.1080/15305058.2019.1692212","url":null,"abstract":"This research examined correlation estimates between latent abilities when using the two-dimensional and three-dimensional compensatory and noncompensatory item response theory models. Simulation study results showed that the recovery of the latent correlation was best when the test contained 100% of simple structure items for all models and conditions. When a test measured weakly discriminated dimensions, it became harder to recover the latent correlation. Results also showed that increasing the sample size, test length, or using simpler models (i.e., two-parameter logistic rather than three-parameter logistic, compensatory rather than noncompensatory) could improve the recovery of latent correlation.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2020-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2019.1692212","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44045191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Effect of Quality Characteristics of Peer Raters on Rating Errors in Peer Assessment 同伴评价者的素质特征对同伴评价者错误率的影响
IF 1.7 Q1 Social Sciences Pub Date : 2020-02-12 DOI: 10.1080/15305058.2020.1720216
Xiuyan Guo, Pui‐wa Lei
Little research has been done on the effects of peer raters’ quality characteristics on peer rating qualities. This study aims to address this gap and investigate the effects of key variables related to peer raters’ qualities, including content knowledge, previous rating experience, training on rating tasks, and rating motivation. In an experiment where training and motivation interventions were manipulated, 24 classes with 838 high school students were randomly assigned to study conditions. Inter-rater error, intra-rater error and criterion error indices for peer ratings on four selected essays were analyzed using hierarchical linear models. Results indicated that peer raters’ content knowledge, previous rating experience, and rating motivation were associated with rating errors. This study also found some significant interactions between peer raters’ quality characteristics. Implications for in-person and online peer assessments as well as future directions are discussed.
同行评议员的素质特征对同行评议质量的影响研究很少。本研究旨在解决这一差距,并调查与同行评分员素质相关的关键变量的影响,包括内容知识、以往评级经验、评级任务培训和评级动机。在一项训练和动机干预被操纵的实验中,24个班级的838名高中生被随机分配到研究条件中。采用层次线性模型分析了四篇文章的评价者间误差、评价者内误差和标准误差指数。结果表明,同伴评分者的内容知识、以往评分经验和评分动机与评分错误有关。本研究还发现同伴评价员的素质特征之间存在显著的交互作用。讨论了对面对面和在线同行评估的影响以及未来的方向。
{"title":"Effect of Quality Characteristics of Peer Raters on Rating Errors in Peer Assessment","authors":"Xiuyan Guo, Pui‐wa Lei","doi":"10.1080/15305058.2020.1720216","DOIUrl":"https://doi.org/10.1080/15305058.2020.1720216","url":null,"abstract":"Little research has been done on the effects of peer raters’ quality characteristics on peer rating qualities. This study aims to address this gap and investigate the effects of key variables related to peer raters’ qualities, including content knowledge, previous rating experience, training on rating tasks, and rating motivation. In an experiment where training and motivation interventions were manipulated, 24 classes with 838 high school students were randomly assigned to study conditions. Inter-rater error, intra-rater error and criterion error indices for peer ratings on four selected essays were analyzed using hierarchical linear models. Results indicated that peer raters’ content knowledge, previous rating experience, and rating motivation were associated with rating errors. This study also found some significant interactions between peer raters’ quality characteristics. Implications for in-person and online peer assessments as well as future directions are discussed.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2020-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2020.1720216","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43660947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
The Relationship between Response-Time Effort and Accuracy in PISA Science Multiple Choice Items PISA科学选择题反应时间努力与准确性的关系
IF 1.7 Q1 Social Sciences Pub Date : 2020-01-10 DOI: 10.1080/15305058.2019.1706529
M. Michaelides, M. Ivanova, C. Nicolaou
The study examined the relationship between examinees’ test-taking effort and their accuracy rate on items from the PISA 2015 assessment. The 10% normative threshold method was applied on Science multiple-choice items in the Cyprus sample to detect rapid guessing behavior. Results showed that the extent of rapid guessing across simple and complex multiple-choice items was on average less than 6% per item. Rapid guessers were identified, and for most items their accuracy was lower than the accuracy for students engaging in solution-based behavior. Examinees with higher overall performance on the test items tended to engage in less rapid guessing than their lower performing peers. Overall, this empirical investigation presents original evidence on test-taking effort as measured by response time in PISA items and tests propositions of Wise’s (2017) Test-Taking Theory.
该研究考察了考生的考试努力与2015年PISA评估项目的准确率之间的关系。将10%标准阈值法应用于塞浦路斯样本中的科学多项选择题,以检测快速猜测行为。结果显示,在简单和复杂的多项选择题中,每个项目的快速猜测程度平均不到6%。快速猜测者被识别出来,对于大多数项目,他们的准确性低于参与基于解决方案的行为的学生的准确性。在测试项目上总体表现较高的考生往往比表现较差的同龄人猜得不那么快。总的来说,这项实证调查提供了通过PISA项目中的反应时间来衡量考试努力的原始证据,并测试了Wise(2017)考试理论的命题。
{"title":"The Relationship between Response-Time Effort and Accuracy in PISA Science Multiple Choice Items","authors":"M. Michaelides, M. Ivanova, C. Nicolaou","doi":"10.1080/15305058.2019.1706529","DOIUrl":"https://doi.org/10.1080/15305058.2019.1706529","url":null,"abstract":"The study examined the relationship between examinees’ test-taking effort and their accuracy rate on items from the PISA 2015 assessment. The 10% normative threshold method was applied on Science multiple-choice items in the Cyprus sample to detect rapid guessing behavior. Results showed that the extent of rapid guessing across simple and complex multiple-choice items was on average less than 6% per item. Rapid guessers were identified, and for most items their accuracy was lower than the accuracy for students engaging in solution-based behavior. Examinees with higher overall performance on the test items tended to engage in less rapid guessing than their lower performing peers. Overall, this empirical investigation presents original evidence on test-taking effort as measured by response time in PISA items and tests propositions of Wise’s (2017) Test-Taking Theory.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2020-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2019.1706529","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43585415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Log Data Analysis with ANFIS: A Fuzzy Neural Network Approach 基于ANFIS的测井数据分析:一种模糊神经网络方法
IF 1.7 Q1 Social Sciences Pub Date : 2020-01-02 DOI: 10.1080/15305058.2018.1551225
Ying Cui, Qi Guo, Jacqueline P. Leighton, Man-Wai Chu
This study explores the use of the Adaptive Neuro-Fuzzy Inference System (ANFIS), a neuro-fuzzy approach, to analyze the log data of technology-based assessments to extract relevant features of student problem-solving processes, and develop and refine a set of fuzzy logic rules that could be used to interpret student performance. The log data that record student response processes while solving a science simulation task were analyzed with ANFIS. Results indicate the ANFIS analysis could generate and refine a set of fuzzy rules that shed lights on the process of how students solve the simulation task. We conclude the article by discussing the advantages of combining human judgments with the learning capacity of ANFIS for log data analysis and outlining the limitations of the current study and areas of future research.
本研究探索使用神经模糊方法自适应神经模糊推理系统(ANFIS)分析基于技术的评估日志数据,以提取学生问题解决过程的相关特征,并开发和完善一套可用于解释学生表现的模糊逻辑规则。利用ANFIS分析了学生在解决科学模拟任务时的反应过程日志数据。结果表明,ANFIS分析可以生成并细化一组模糊规则,这些规则揭示了学生如何解决模拟任务的过程。最后,我们讨论了将人工判断与ANFIS的学习能力结合起来进行测井数据分析的优势,并概述了当前研究的局限性和未来研究的领域。
{"title":"Log Data Analysis with ANFIS: A Fuzzy Neural Network Approach","authors":"Ying Cui, Qi Guo, Jacqueline P. Leighton, Man-Wai Chu","doi":"10.1080/15305058.2018.1551225","DOIUrl":"https://doi.org/10.1080/15305058.2018.1551225","url":null,"abstract":"This study explores the use of the Adaptive Neuro-Fuzzy Inference System (ANFIS), a neuro-fuzzy approach, to analyze the log data of technology-based assessments to extract relevant features of student problem-solving processes, and develop and refine a set of fuzzy logic rules that could be used to interpret student performance. The log data that record student response processes while solving a science simulation task were analyzed with ANFIS. Results indicate the ANFIS analysis could generate and refine a set of fuzzy rules that shed lights on the process of how students solve the simulation task. We conclude the article by discussing the advantages of combining human judgments with the learning capacity of ANFIS for log data analysis and outlining the limitations of the current study and areas of future research.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2020-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1551225","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48938428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Engineering a Twenty-First Century Reading Comprehension Assessment System Utilizing Scenario-Based Assessment Techniques 利用情景评估技术构建21世纪阅读理解评估系统
IF 1.7 Q1 Social Sciences Pub Date : 2020-01-02 DOI: 10.1080/15305058.2018.1551224
J. Sabatini, T. O’Reilly, Jonathan P. Weeks, Zuowei Wang
The construct of reading comprehension has changed significantly in the twenty-first century; however, some test designs have not evolved sufficiently to capture these changes. Specifically, the nature of literacy sources and skills required has changed (wrought primarily by widespread use of digital technologies). Modern theories of comprehension and discourse processes have been developed to accommodate these changes, and the learning sciences have followed suit. These influences have significant implications for how we think about the development of comprehension proficiency across grades. In this paper, we describe a theoretically driven, developmentally sensitive assessment system based on a scenario-based assessment paradigm, and present evidence for its feasibility and psychometric soundness.
阅读理解的结构在21世纪发生了重大变化;然而,一些测试设计还没有发展到足以捕捉这些变化。具体来说,扫盲资源和所需技能的性质发生了变化(主要是由于数字技术的广泛使用)。现代理解和话语过程理论的发展适应了这些变化,学习科学也随之发展。这些影响对我们如何看待跨年级理解能力的发展具有重要意义。在本文中,我们描述了一个理论驱动的、基于场景的评估范式的发展敏感的评估系统,并提供了其可行性和心理测量合理性的证据。
{"title":"Engineering a Twenty-First Century Reading Comprehension Assessment System Utilizing Scenario-Based Assessment Techniques","authors":"J. Sabatini, T. O’Reilly, Jonathan P. Weeks, Zuowei Wang","doi":"10.1080/15305058.2018.1551224","DOIUrl":"https://doi.org/10.1080/15305058.2018.1551224","url":null,"abstract":"The construct of reading comprehension has changed significantly in the twenty-first century; however, some test designs have not evolved sufficiently to capture these changes. Specifically, the nature of literacy sources and skills required has changed (wrought primarily by widespread use of digital technologies). Modern theories of comprehension and discourse processes have been developed to accommodate these changes, and the learning sciences have followed suit. These influences have significant implications for how we think about the development of comprehension proficiency across grades. In this paper, we describe a theoretically driven, developmentally sensitive assessment system based on a scenario-based assessment paradigm, and present evidence for its feasibility and psychometric soundness.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2020-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1551224","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47386975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
The (Non)Impact of Differential Test Taker Engagement on Aggregated Scores 不同考生参与对总分的(非)影响
IF 1.7 Q1 Social Sciences Pub Date : 2020-01-02 DOI: 10.1080/15305058.2019.1605999
S. Wise, J. Soland, Y. Bo
Disengaged test taking tends to be most prevalent with low-stakes tests. This has led to questions about the validity of aggregated scores from large-scale international assessments such as PISA and TIMSS, as previous research has found a meaningful correlation between the mean engagement and mean performance of countries. The current study, using data from the computer-based version of the PISA-Based Test for Schools, examined the distortive effects of differential engagement on aggregated school-level scores. The results showed that, although there was considerable differential engagement among schools, the school means were highly stable due to two factors. First, any distortive effects of disengagement in a school were diluted by a high proportion of the students exhibiting no non-effortful behavior. Second, and most interestingly, disengagement produced both positive and negative distortion of individual student scores, which tended to cancel out much of the net distortive effect on the school’s mean.
在低风险的考试中,心不在焉的考试往往最为普遍。这导致了对大规模国际评估(如PISA和TIMSS)汇总分数有效性的质疑,因为之前的研究发现,国家的平均参与度和平均表现之间存在有意义的相关性。目前的研究使用了基于pisa的学校测试(Test for Schools)的计算机版数据,研究了不同参与程度对学校总体成绩的扭曲效应。结果表明,虽然学校之间的参与程度存在较大差异,但由于两个因素,学校的投入程度高度稳定。首先,在学校里,没有表现出不努力行为的学生所占比例很高,这就稀释了任何不投入的扭曲效应。其次,也是最有趣的一点是,不投入对个别学生的成绩产生了积极和消极的扭曲,这往往会抵消对学校平均分的大部分净扭曲效应。
{"title":"The (Non)Impact of Differential Test Taker Engagement on Aggregated Scores","authors":"S. Wise, J. Soland, Y. Bo","doi":"10.1080/15305058.2019.1605999","DOIUrl":"https://doi.org/10.1080/15305058.2019.1605999","url":null,"abstract":"Disengaged test taking tends to be most prevalent with low-stakes tests. This has led to questions about the validity of aggregated scores from large-scale international assessments such as PISA and TIMSS, as previous research has found a meaningful correlation between the mean engagement and mean performance of countries. The current study, using data from the computer-based version of the PISA-Based Test for Schools, examined the distortive effects of differential engagement on aggregated school-level scores. The results showed that, although there was considerable differential engagement among schools, the school means were highly stable due to two factors. First, any distortive effects of disengagement in a school were diluted by a high proportion of the students exhibiting no non-effortful behavior. Second, and most interestingly, disengagement produced both positive and negative distortion of individual student scores, which tended to cancel out much of the net distortive effect on the school’s mean.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2020-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2019.1605999","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47045097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Stopping Rules for Computer Adaptive Testing When Item Banks Have Nonuniform Information. 题库信息不统一时计算机自适应测试的停止规则。
IF 1.7 Q1 Social Sciences Pub Date : 2020-01-01 Epub Date: 2019-07-16
Scott B Morris, Michael Bass, Elizabeth Howard, Richard E Neapolitan

The standard error (SE) stopping rule, which terminates a computer adaptive test (CAT) when the SE is less than a threshold, is effective when there are informative questions for all trait levels. However, in domains such as patient reported outcomes, the items in a bank might all target one end of the trait continuum (e.g., negative symptoms), and the bank may lack depth for many individuals. In such cases, the predicted standard error reduction (PSER) stopping rule will stop the CAT even if the SE threshold has not been reached, and can avoid administering excessive questions that provide little additional information. By tuning the parameters of the PSER algorithm, a practitioner can specify a desired tradeoff between accuracy and efficiency. Using simulated data for the PROMIS Anxiety and Physical Function banks, we demonstrate that these parameters can substantially impact CAT performance. When the parameters were optimally tuned, the PSER stopping rule was found to outperform the SE stopping rule overall and particularly for individuals not targeted by the bank, and presented roughly the same number of items across the trait continuum. Therefore, the PSER stopping rule provides an effective method for balancing the precision and efficiency of a CAT.

标准误差(SE)停止规则,当SE小于阈值时终止计算机自适应测试(CAT),在所有性状水平都存在信息性问题时有效。然而,在诸如患者报告结果等领域,信息库中的项目可能都针对特征连续体的一端(例如,阴性症状),并且信息库可能对许多个体缺乏深度。在这种情况下,即使没有达到SE阈值,预测的标准错误减少(PSER)停止规则也会停止CAT,并且可以避免管理提供很少额外信息的过多问题。通过调优PSER算法的参数,从业者可以在准确性和效率之间指定理想的权衡。利用PROMIS焦虑和身体功能库的模拟数据,我们证明了这些参数可以极大地影响CAT的性能。当参数被优化后,发现PSER停止规则总体上优于SE停止规则,特别是对于非银行目标的个体,并且在特征连续体中呈现大致相同数量的项目。因此,PSER停止规则为平衡CAT的精度和效率提供了一种有效的方法。
{"title":"Stopping Rules for Computer Adaptive Testing When Item Banks Have Nonuniform Information.","authors":"Scott B Morris,&nbsp;Michael Bass,&nbsp;Elizabeth Howard,&nbsp;Richard E Neapolitan","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The <i>standard error</i> (<i>SE</i>) stopping rule, which terminates a <i>computer adaptive test</i> (CAT) when the SE is less than a threshold, is effective when there are informative questions for all trait levels. However, in domains such as patient reported outcomes, the items in a bank might all target one end of the trait continuum (e.g., negative symptoms), and the bank may lack depth for many individuals. In such cases, the <i>predicted standard error reduction</i> (PSER) stopping rule will stop the CAT even if the <i>SE</i> threshold has not been reached, and can avoid administering excessive questions that provide little additional information. By tuning the parameters of the PSER algorithm, a practitioner can specify a desired tradeoff between accuracy and efficiency<i>.</i> Using simulated data for the PROMIS <i>Anxiety</i> and <i>Physical Function</i> banks, we demonstrate that these parameters can substantially impact CAT performance. When the parameters were optimally tuned, the PSER stopping rule was found to outperform the <i>SE</i> stopping rule overall and particularly for individuals not targeted by the bank, and presented roughly the same number of items across the trait continuum. Therefore, the PSER stopping rule provides an effective method for balancing the precision and efficiency of a CAT.</p>","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7518406/pdf/nihms-1534260.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38521672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ITC Guidelines for the Large-Scale Assessment of Linguistically and Culturally Diverse Populations ITC语言和文化多样性人群大规模评估指南
IF 1.7 Q1 Social Sciences Pub Date : 2019-10-02 DOI: 10.1080/15305058.2019.1631024
M. Oliveri
These guidelines describe considerations relevant to the assessment of test takers in or across countries or regions that are linguistically or culturally diverse. The guidelines were developed by a committee of experts to help inform test developers, psychometricians, test users, and test administrators about fairness issues in support of the fair and valid assessment of linguistically or culturally diverse populations. They are meant to apply to most, if not all, aspects of the development, administration, scoring, and use of assessments; and are intended to supplement other existing professional standards or guidelines for testing and assessment. That is, these guidelines focus on the types of adaptations and considerations to use when developing, reviewing, and interpreting items and test scores from tests administered to culturally and linguistically or culturally diverse populations. Other guidelines such as the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014) or Guidelines for Best Practice in Cross-Cultural Surveys (Survey Research Center, 2016) may also be relevant to testing linguistically and culturally diverse populations.
这些指南描述了在语言或文化多样的国家或地区内或跨国家或地区对考生进行评估时的相关考虑因素。该指南由一个专家委员会制定,旨在帮助测试开发人员、心理测量师、测试用户和测试管理员了解公平问题,以支持对语言或文化多样性人群进行公平有效的评估。它们适用于评估的制定、管理、评分和使用的大部分(如果不是全部的话)方面;旨在补充其他现有的测试和评估专业标准或指南。也就是说,这些指南侧重于在开发、审查和解释针对文化、语言或文化多样性人群的测试项目和测试成绩时使用的适应类型和考虑因素。其他指南,如《教育和心理测试标准》(AERA、APA和NCME,2014)或《跨文化调查最佳实践指南》(调查研究中心,2016)也可能与测试语言和文化多样性人群有关。
{"title":"ITC Guidelines for the Large-Scale Assessment of Linguistically and Culturally Diverse Populations","authors":"M. Oliveri","doi":"10.1080/15305058.2019.1631024","DOIUrl":"https://doi.org/10.1080/15305058.2019.1631024","url":null,"abstract":"These guidelines describe considerations relevant to the assessment of test takers in or across countries or regions that are linguistically or culturally diverse. The guidelines were developed by a committee of experts to help inform test developers, psychometricians, test users, and test administrators about fairness issues in support of the fair and valid assessment of linguistically or culturally diverse populations. They are meant to apply to most, if not all, aspects of the development, administration, scoring, and use of assessments; and are intended to supplement other existing professional standards or guidelines for testing and assessment. That is, these guidelines focus on the types of adaptations and considerations to use when developing, reviewing, and interpreting items and test scores from tests administered to culturally and linguistically or culturally diverse populations. Other guidelines such as the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014) or Guidelines for Best Practice in Cross-Cultural Surveys (Survey Research Center, 2016) may also be relevant to testing linguistically and culturally diverse populations.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2019-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2019.1631024","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49265430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
期刊
International Journal of Testing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1