首页 > 最新文献

International Journal of Testing最新文献

英文 中文
Using Evidence-Centered Design to Support the Development of Culturally and Linguistically Sensitive Collaborative Problem-Solving Assessments 使用以证据为中心的设计来支持文化和语言敏感的协作问题解决评估的发展
IF 1.7 Q1 Social Sciences Pub Date : 2019-01-29 DOI: 10.1080/15305058.2018.1543308
M. Oliveri, René Lawless, R. Mislevy
Collaborative problem solving (CPS) ranks among the top five most critical skills necessary for college graduates to meet workforce demands (Hart Research Associates, 2015). It is also deemed a critical skill for educational success (Beaver, 2013). It thus deserves more prominence in the suite of courses and subjects assessed in K-16. Such inclusion, however, presents the need for improvements in the conceptualization, design, and analysis of CPS, which challenges us to think differently about assessing the skills than the current focus given to assessing individuals’ substantive knowledge. In this article, we discuss an Evidence-Centered Design approach to assess CPS in a culturally and linguistically diverse educational environment. We demonstrate ways to consider a sociocognitive perspective to conceptualize and model possible linguistic and/or cultural differences between populations along key stages of assessment development including assessment conceptualization and design to help reduce possible construct-irrelevant differences when assessing complex constructs with diverse populations.
协作解决问题(CPS)是大学毕业生满足劳动力需求所需的五大关键技能之一(Hart Research Associates, 2015)。它也被认为是教育成功的关键技能(Beaver, 2013)。因此,在K-16评估的课程和科目中,它应该得到更突出的地位。然而,这样的纳入提出了改进CPS的概念化、设计和分析的需要,这挑战了我们对评估技能的不同思考,而不是目前对评估个人实质性知识的关注。在本文中,我们讨论了在文化和语言多样化的教育环境中评估CPS的循证设计方法。我们展示了在评估发展的关键阶段考虑社会认知视角概念化和建模人群之间可能的语言和/或文化差异的方法,包括评估概念化和设计,以帮助在评估不同人群的复杂构念时减少可能的构念无关的差异。
{"title":"Using Evidence-Centered Design to Support the Development of Culturally and Linguistically Sensitive Collaborative Problem-Solving Assessments","authors":"M. Oliveri, René Lawless, R. Mislevy","doi":"10.1080/15305058.2018.1543308","DOIUrl":"https://doi.org/10.1080/15305058.2018.1543308","url":null,"abstract":"Collaborative problem solving (CPS) ranks among the top five most critical skills necessary for college graduates to meet workforce demands (Hart Research Associates, 2015). It is also deemed a critical skill for educational success (Beaver, 2013). It thus deserves more prominence in the suite of courses and subjects assessed in K-16. Such inclusion, however, presents the need for improvements in the conceptualization, design, and analysis of CPS, which challenges us to think differently about assessing the skills than the current focus given to assessing individuals’ substantive knowledge. In this article, we discuss an Evidence-Centered Design approach to assess CPS in a culturally and linguistically diverse educational environment. We demonstrate ways to consider a sociocognitive perspective to conceptualize and model possible linguistic and/or cultural differences between populations along key stages of assessment development including assessment conceptualization and design to help reduce possible construct-irrelevant differences when assessing complex constructs with diverse populations.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2019-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1543308","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44350922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Assessment of University Students’ Critical Thinking: Next Generation Performance Assessment 大学生批判性思维评价:下一代绩效评价
IF 1.7 Q1 Social Sciences Pub Date : 2019-01-24 DOI: 10.1080/15305058.2018.1543309
R. Shavelson, O. Zlatkin‐Troitschanskaia, K. Beck, Susanne Schmidt, Julián P. Mariño
Following employers’ criticisms and recent societal developments, policymakers and educators have called for students to develop a range of generic skills such as critical thinking (“twenty-first century skills”). So far, such skills have typically been assessed by student self-reports or with multiple-choice tests. An alternative approach is criterion-sampling measurement. This approach leads to developing performance assessments using “criterion” tasks, which are drawn from real-world situations in which students are being educated, both within and across academic or professional domains. One current project, iPAL (The international Performance Assessment of Learning), consolidates previous research and focuses on the next generation performance assessments. In this paper, we present iPAL’s assessment framework and show how it guides the development of such performance assessments, exemplify these assessments with a concrete task, and provide preliminary evidence of its reliability and validity, which allows us to draw initial implications for further test design and development.
在雇主的批评和最近的社会发展之后,政策制定者和教育工作者呼吁学生培养一系列通用技能,如批判性思维(“二十一世纪技能”)。到目前为止,这些技能通常是通过学生自我报告或多项选择题测试来评估的。另一种方法是标准抽样测量。这种方法导致使用“标准”任务进行绩效评估,这些任务来自学生在学术或专业领域内和跨领域接受教育的真实情况。目前的一个项目,iPAL(国际学习绩效评估),整合了以前的研究,并专注于下一代绩效评估。在本文中,我们介绍了iPAL的评估框架,并展示了它如何指导此类性能评估的开发,用具体任务举例说明这些评估,并提供了其可靠性和有效性的初步证据,这使我们能够为进一步的测试设计和开发得出初步启示。
{"title":"Assessment of University Students’ Critical Thinking: Next Generation Performance Assessment","authors":"R. Shavelson, O. Zlatkin‐Troitschanskaia, K. Beck, Susanne Schmidt, Julián P. Mariño","doi":"10.1080/15305058.2018.1543309","DOIUrl":"https://doi.org/10.1080/15305058.2018.1543309","url":null,"abstract":"Following employers’ criticisms and recent societal developments, policymakers and educators have called for students to develop a range of generic skills such as critical thinking (“twenty-first century skills”). So far, such skills have typically been assessed by student self-reports or with multiple-choice tests. An alternative approach is criterion-sampling measurement. This approach leads to developing performance assessments using “criterion” tasks, which are drawn from real-world situations in which students are being educated, both within and across academic or professional domains. One current project, iPAL (The international Performance Assessment of Learning), consolidates previous research and focuses on the next generation performance assessments. In this paper, we present iPAL’s assessment framework and show how it guides the development of such performance assessments, exemplify these assessments with a concrete task, and provide preliminary evidence of its reliability and validity, which allows us to draw initial implications for further test design and development.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2019-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1543309","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48194695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 65
An Examination of Different Methods of Setting Cutoff Values in Person Fit Research 对适合度研究中设定临界值的不同方法的考察
IF 1.7 Q1 Social Sciences Pub Date : 2019-01-02 DOI: 10.1080/15305058.2018.1464010
A. Mousavi, Ying Cui, Todd Rogers
This simulation study evaluates four different methods of setting cutoff values for person fit assessment, including (a) using fixed cutoff values either from theoretical distributions of person fit statistics, or arbitrarily chosen by the researchers in the literature; (b) using the specific percentile rank of empirical sampling distribution of person fit statistics from simulated fitting responses; (c) using bootstrap method to estimate cutoff values of empirical sampling distribution of person fit statistics from simulated fitting responses; and (d) using the p-value methods to identify misfitting responses conditional on ability levels. The Snijders' (2001), as an index with known theoretical distribution, van der Flier's U3 (1982) and Sijtsma's HT coefficient (1986), as indices with unknown theoretical distribution, were chosen. According to the simulation results, different methods of setting cutoff values tend to produce different levels of Type I error and detection rates, indicating it is critical to select an appropriate method for setting cutoff values in person fit research.
该模拟研究评估了四种不同的设定截断值的方法,包括(a)使用固定的截断值,或者从理论分布的人适合统计,或者由研究人员在文献中任意选择;(b)利用模拟拟合反应的人拟合统计量的经验抽样分布的特定百分位数秩;(c)利用自举法从模拟拟合响应中估计人拟合统计量经验抽样分布的截止值;(d)使用p值方法识别以能力水平为条件的错拟合反应。选用理论分布已知的指标Snijders’s(2001),理论分布未知的指标van der Flier’s U3(1982)和Sijtsma’s HT系数(1986)。仿真结果表明,不同的截止值设置方法往往会产生不同程度的I型误差和检出率,这表明在人身拟合研究中选择合适的截止值设置方法至关重要。
{"title":"An Examination of Different Methods of Setting Cutoff Values in Person Fit Research","authors":"A. Mousavi, Ying Cui, Todd Rogers","doi":"10.1080/15305058.2018.1464010","DOIUrl":"https://doi.org/10.1080/15305058.2018.1464010","url":null,"abstract":"This simulation study evaluates four different methods of setting cutoff values for person fit assessment, including (a) using fixed cutoff values either from theoretical distributions of person fit statistics, or arbitrarily chosen by the researchers in the literature; (b) using the specific percentile rank of empirical sampling distribution of person fit statistics from simulated fitting responses; (c) using bootstrap method to estimate cutoff values of empirical sampling distribution of person fit statistics from simulated fitting responses; and (d) using the p-value methods to identify misfitting responses conditional on ability levels. The Snijders' (2001), as an index with known theoretical distribution, van der Flier's U3 (1982) and Sijtsma's HT coefficient (1986), as indices with unknown theoretical distribution, were chosen. According to the simulation results, different methods of setting cutoff values tend to produce different levels of Type I error and detection rates, indicating it is critical to select an appropriate method for setting cutoff values in person fit research.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2019-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1464010","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48532510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A Comparison of the Relative Performance of Four IRT Models on Equating Passage-Based Tests 四种IRT模型在等价段落测试中的相对性能比较
IF 1.7 Q1 Social Sciences Pub Date : 2018-12-13 DOI: 10.1080/15305058.2018.1530239
Kyung Yong Kim, Euijin Lim, Won‐Chan Lee
For passage-based tests, items that belong to a common passage often violate the local independence assumption of unidimensional item response theory (UIRT). In this case, ignoring local item dependence (LID) and estimating item parameters using a UIRT model could be problematic because doing so might result in inaccurate parameter estimates, which, in turn, could impact the results of equating. Under the random groups design, the main purpose of this article was to compare the relative performance of the three-parameter logistic (3PL), graded response (GR), bifactor, and testlet models on equating passage-based tests when various degrees of LID were present due to passage. Simulation results showed that the testlet model produced the most accurate equating results, followed by the bifactor model. The 3PL model worked as well as the bifactor and testlet models when the degree of LID was low but returned less accurate equating results than the two multidimensional models as the degree of LID increased. Among the four models, the polytomous GR model provided the least accurate equating results.
对于基于篇章的测试,属于共同篇章的项目经常违反一维项目反应理论(UIRT)的局部独立性假设。在这种情况下,忽略局部项目依赖性(LID)和使用UIRT模型估计项目参数可能会有问题,因为这样做可能会导致参数估计不准确,进而影响等式的结果。在随机分组设计下,本文的主要目的是比较三参数逻辑(3PL)、分级反应(GR)、双因子和小测试模型在基于通道的等效测试中的相对性能,当由于通道存在不同程度的LID时。仿真结果表明,testlet模型产生了最准确的等值结果,其次是双因子模型。当LID程度较低时,3PL模型与双因子和小测试模型一样有效,但随着LID程度的增加,返回的等式结果不如两个多维模型准确。在这四个模型中,多面体GR模型提供了最不准确的等式结果。
{"title":"A Comparison of the Relative Performance of Four IRT Models on Equating Passage-Based Tests","authors":"Kyung Yong Kim, Euijin Lim, Won‐Chan Lee","doi":"10.1080/15305058.2018.1530239","DOIUrl":"https://doi.org/10.1080/15305058.2018.1530239","url":null,"abstract":"For passage-based tests, items that belong to a common passage often violate the local independence assumption of unidimensional item response theory (UIRT). In this case, ignoring local item dependence (LID) and estimating item parameters using a UIRT model could be problematic because doing so might result in inaccurate parameter estimates, which, in turn, could impact the results of equating. Under the random groups design, the main purpose of this article was to compare the relative performance of the three-parameter logistic (3PL), graded response (GR), bifactor, and testlet models on equating passage-based tests when various degrees of LID were present due to passage. Simulation results showed that the testlet model produced the most accurate equating results, followed by the bifactor model. The 3PL model worked as well as the bifactor and testlet models when the degree of LID was low but returned less accurate equating results than the two multidimensional models as the degree of LID increased. Among the four models, the polytomous GR model provided the least accurate equating results.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2018-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1530239","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46453114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Test Instructions Do Not Moderate the Indirect Effect of Perceived Test Importance on Test Performance in Low-Stakes Testing Contexts 在低风险测试环境中,测试说明不能缓和感知测试重要性对测试性能的间接影响
IF 1.7 Q1 Social Sciences Pub Date : 2018-10-02 DOI: 10.1080/15305058.2017.1396466
S. Finney, Aaron J. Myers, C. Mathers
Assessment specialists expend a great deal of energy to promote valid inferences from test scores gathered in low-stakes testing contexts. Given the indirect effect of perceived test importance on test performance via examinee effort, assessment practitioners have manipulated test instructions with the goal of increasing perceived test importance. Importantly, no studies have investigated the impact of test instructions on this indirect effect. In the current study, students were randomly assigned to one of three test instruction conditions intended to increase test relevance while keeping the test low-stakes to examinees. Test instructions did not impact average perceived test importance, examinee effort, or test performance. Furthermore, the indirect relationship between importance and performance via effort was not moderated by instructions. Thus, the effect of perceived test importance on test scores via expended effort appears consistent across different messages regarding the personal relevance of the test to examinees. The main implication for testing practice is that the effect of instructions may be negligible when reflective of authentic low-stakes test score use. Future studies should focus on uncovering instructions that increase the value of performance to the examinee yet remain truthful regarding score use.
评估专家花费大量精力从低风险测试环境中收集的测试分数中促进有效的推断。鉴于感知的考试重要性通过考生的努力对考试成绩产生间接影响,评估从业者操纵了考试说明,目的是提高感知的考试重要程度。重要的是,没有研究调查测试说明对这种间接影响的影响。在目前的研究中,学生被随机分配到三种测试指导条件中的一种,旨在提高测试相关性,同时保持测试对考生的低风险。测试说明不会影响平均感知的测试重要性、考生努力程度或测试成绩。此外,重要性和通过努力取得的成绩之间的间接关系并没有受到指示的调节。因此,在关于考试对考生的个人相关性的不同信息中,感知到的考试重要性通过努力对考试成绩的影响似乎是一致的。测试实践的主要含义是,当反映真实的低风险测试分数使用时,说明的效果可能可以忽略不计。未来的研究应侧重于揭示提高考生表现价值但在分数使用方面保持真实的说明。
{"title":"Test Instructions Do Not Moderate the Indirect Effect of Perceived Test Importance on Test Performance in Low-Stakes Testing Contexts","authors":"S. Finney, Aaron J. Myers, C. Mathers","doi":"10.1080/15305058.2017.1396466","DOIUrl":"https://doi.org/10.1080/15305058.2017.1396466","url":null,"abstract":"Assessment specialists expend a great deal of energy to promote valid inferences from test scores gathered in low-stakes testing contexts. Given the indirect effect of perceived test importance on test performance via examinee effort, assessment practitioners have manipulated test instructions with the goal of increasing perceived test importance. Importantly, no studies have investigated the impact of test instructions on this indirect effect. In the current study, students were randomly assigned to one of three test instruction conditions intended to increase test relevance while keeping the test low-stakes to examinees. Test instructions did not impact average perceived test importance, examinee effort, or test performance. Furthermore, the indirect relationship between importance and performance via effort was not moderated by instructions. Thus, the effect of perceived test importance on test scores via expended effort appears consistent across different messages regarding the personal relevance of the test to examinees. The main implication for testing practice is that the effect of instructions may be negligible when reflective of authentic low-stakes test score use. Future studies should focus on uncovering instructions that increase the value of performance to the examinee yet remain truthful regarding score use.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2017.1396466","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49293024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Investigating the Reliability of the Sentence Verification Technique 句子验证技术的可靠性研究
IF 1.7 Q1 Social Sciences Pub Date : 2018-09-20 DOI: 10.1080/15305058.2018.1497636
Amanda M Marcotte, Francis Rick, C. Wells
Reading comprehension plays an important role in achievement for all academic domains. The purpose of this study is to describe the sentence verification technique (SVT) (Royer, Hastings, & Hook, 1979) as an alternative method of assessing reading comprehension, which can be used with a variety of texts and across diverse populations and educational contexts. Additionally, this study adds a unique contribution to the extant literature on the SVT through an investigation of the precision of the instrument across proficiency levels. Data were gathered from a sample of 464 fourth-grade students from the Northeast region of the United States. Reliability was estimated using one, two, three, and four passage test forms. Two or three passages provided sufficient reliability. The conditional reliability analyses revealed that the SVT test scores were reliable for readers with average to below average proficiency, but did not provide reliable information for students who were very poor or strong readers.
阅读理解在所有学术领域的成就中都起着重要的作用。本研究的目的是描述句子验证技术(SVT) (Royer, Hastings, & Hook, 1979)作为一种评估阅读理解的替代方法,它可以用于各种文本,跨越不同的人群和教育背景。此外,本研究通过对仪器在熟练程度上的精度的调查,为现有的SVT文献增加了独特的贡献。数据收集自美国东北地区464名四年级学生的样本。信度估计使用一,二,三,四通道测试形式。两三个段落就足够可靠了。条件信度分析表明,对于阅读能力在平均水平到中等水平以下的学生,SVT测试分数是可靠的,而对于阅读能力非常差或很强的学生,SVT测试分数没有提供可靠的信息。
{"title":"Investigating the Reliability of the Sentence Verification Technique","authors":"Amanda M Marcotte, Francis Rick, C. Wells","doi":"10.1080/15305058.2018.1497636","DOIUrl":"https://doi.org/10.1080/15305058.2018.1497636","url":null,"abstract":"Reading comprehension plays an important role in achievement for all academic domains. The purpose of this study is to describe the sentence verification technique (SVT) (Royer, Hastings, & Hook, 1979) as an alternative method of assessing reading comprehension, which can be used with a variety of texts and across diverse populations and educational contexts. Additionally, this study adds a unique contribution to the extant literature on the SVT through an investigation of the precision of the instrument across proficiency levels. Data were gathered from a sample of 464 fourth-grade students from the Northeast region of the United States. Reliability was estimated using one, two, three, and four passage test forms. Two or three passages provided sufficient reliability. The conditional reliability analyses revealed that the SVT test scores were reliable for readers with average to below average proficiency, but did not provide reliable information for students who were very poor or strong readers.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2018-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1497636","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45868181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Item Parameter Drift in Context Questionnaires from International Large-Scale Assessments 国际大型评估问卷中项目参数的漂移
IF 1.7 Q1 Social Sciences Pub Date : 2018-09-14 DOI: 10.1080/15305058.2018.1481852
HyeSun Lee, K. Geisinger
The purpose of the current study was to examine the impact of item parameter drift (IPD) occurring in context questionnaires from an international large-scale assessment and determine the most appropriate way to address IPD. Focusing on the context of psychometric and educational research where scores from context questionnaires composed of polytomous items were employed for the classification of examinees, the current research investigated the impacts of IPD on the estimation of questionnaire scores and classification accuracy with five manipulated factors: the length of a questionnaire, the proportion of items exhibiting IPD, the direction and magnitude of IPD, and three decisions about IPD. The results indicated that the impact of IPD occurring in a short context questionnaire on the accuracy of score estimation and classification of examinees was substantial. The accuracy in classification considerably decreased especially at the lowest and highest categories of a trait. Unlike the recommendation from literature in educational testing, the current study demonstrated that keeping items exhibiting IPD and removing them only for transformation were appropriate when IPD occurred in relatively short context questionnaires. Using 2011 TIMSS data from Iran, an applied example demonstrated the application of provided guidance in making appropriate decisions about IPD.
本研究的目的是检验国际大规模评估的情境问卷中项目参数漂移(IPD)的影响,并确定解决IPD的最合适方法。本研究以心理测量和教育研究为背景,采用由多个相似项目组成的情境问卷中的分数对考生进行分类,通过五个操纵因素调查了IPD对问卷分数估计和分类准确性的影响:问卷长度、,显示IPD的项目的比例、IPD的方向和大小,以及关于IPD的三个决策。结果表明,在简短的上下文问卷中出现的IPD对考生的分数估计和分类的准确性有很大的影响。分类的准确性显著下降,尤其是在一个性状的最低和最高类别。与教育测试中文献中的建议不同,当前的研究表明,当IPD发生在相对较短的情境问卷中时,保留显示IPD的项目并仅为转换而删除它们是合适的。利用2011年伊朗TIMSS数据,一个应用实例展示了所提供的指导在IPD相关决策中的应用。
{"title":"Item Parameter Drift in Context Questionnaires from International Large-Scale Assessments","authors":"HyeSun Lee, K. Geisinger","doi":"10.1080/15305058.2018.1481852","DOIUrl":"https://doi.org/10.1080/15305058.2018.1481852","url":null,"abstract":"The purpose of the current study was to examine the impact of item parameter drift (IPD) occurring in context questionnaires from an international large-scale assessment and determine the most appropriate way to address IPD. Focusing on the context of psychometric and educational research where scores from context questionnaires composed of polytomous items were employed for the classification of examinees, the current research investigated the impacts of IPD on the estimation of questionnaire scores and classification accuracy with five manipulated factors: the length of a questionnaire, the proportion of items exhibiting IPD, the direction and magnitude of IPD, and three decisions about IPD. The results indicated that the impact of IPD occurring in a short context questionnaire on the accuracy of score estimation and classification of examinees was substantial. The accuracy in classification considerably decreased especially at the lowest and highest categories of a trait. Unlike the recommendation from literature in educational testing, the current study demonstrated that keeping items exhibiting IPD and removing them only for transformation were appropriate when IPD occurred in relatively short context questionnaires. Using 2011 TIMSS data from Iran, an applied example demonstrated the application of provided guidance in making appropriate decisions about IPD.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2018-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1481852","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42801965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Investigating the Comparability of Examination Difficulty Using Comparative Judgement and Rasch Modelling 运用比较判断和Rasch模型研究考试难度的可比性
IF 1.7 Q1 Social Sciences Pub Date : 2018-09-14 DOI: 10.1080/15305058.2018.1486316
Stephen D. Holmes, M. Meadows, I. Stockford, Qingping He
The relationship of expected and actual difficulty of items on six mathematics question papers designed for 16-year olds in England was investigated through paired comparison using experts and testing with students. A variant of the Rasch model was applied to the comparison data to establish a scale of expected difficulty. In testing, the papers were taken by 2933 students using an equivalent-groups design, allowing the actual difficulty of the items to be placed on the same measurement scale. It was found that the expected difficulty derived using the comparative judgement approach and the actual difficulty derived from the test data was reasonably strongly correlated. This suggests that comparative judgement may be an effective way to investigate the comparability of difficulty of examinations. The approach could potentially be used as a proxy for pretesting high-stakes tests in situations where pretesting is not feasible due to reasons of security or other risks.
采用专家配对比较和学生测试的方法,研究了英国16岁学生数学试题中6个题目的期望难度和实际难度之间的关系。Rasch模型的一种变体应用于比较数据,以建立预期难度的尺度。在测试中,2933名学生采用了等效组设计,允许将项目的实际难度放在同一测量尺度上。结果表明,采用比较判断法得出的期望难度与根据测试数据得出的实际难度具有较强的相关性。这表明比较判断可能是考察考试难度可比性的有效方法。在由于安全或其他风险原因而无法进行预测试的情况下,该方法可能被用作高风险测试预测试的替代方法。
{"title":"Investigating the Comparability of Examination Difficulty Using Comparative Judgement and Rasch Modelling","authors":"Stephen D. Holmes, M. Meadows, I. Stockford, Qingping He","doi":"10.1080/15305058.2018.1486316","DOIUrl":"https://doi.org/10.1080/15305058.2018.1486316","url":null,"abstract":"The relationship of expected and actual difficulty of items on six mathematics question papers designed for 16-year olds in England was investigated through paired comparison using experts and testing with students. A variant of the Rasch model was applied to the comparison data to establish a scale of expected difficulty. In testing, the papers were taken by 2933 students using an equivalent-groups design, allowing the actual difficulty of the items to be placed on the same measurement scale. It was found that the expected difficulty derived using the comparative judgement approach and the actual difficulty derived from the test data was reasonably strongly correlated. This suggests that comparative judgement may be an effective way to investigate the comparability of difficulty of examinations. The approach could potentially be used as a proxy for pretesting high-stakes tests in situations where pretesting is not feasible due to reasons of security or other risks.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2018-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1486316","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45405533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Analyzing Job Analysis Data Using Mixture Rasch Models 使用混合Rasch模型分析工作分析数据
IF 1.7 Q1 Social Sciences Pub Date : 2018-09-14 DOI: 10.1080/15305058.2018.1481853
Adam E. Wyse
An important piece of validity evidence to support the use of credentialing exams comes from performing a job analysis of the profession. One common job analysis method is the task inventory method, where people working in the field are surveyed using rating scales about the tasks thought necessary to safely and competently perform the job. This article describes how mixture Rasch models can be used to analyze these data, and how results from these analyses can help to identify whether different groups of people may be responding to job tasks differently. Three examples from different credentialing programs illustrate scenarios that can be found when applying mixture Rasch models to job analysis data. Discussion of what these results may imply for the development of credentialing exams and other analyses of job analysis data is provided.
支持使用资格考试的一项重要有效性证据来自对该行业的工作分析。一种常见的工作分析方法是任务清单法,即使用评分量表对现场工作人员进行调查,了解他们认为安全、胜任工作所需的任务。本文描述了如何使用混合Rasch模型来分析这些数据,以及这些分析的结果如何有助于确定不同人群对工作任务的反应是否不同。来自不同认证程序的三个例子说明了在将混合Rasch模型应用于工作分析数据时可以找到的场景。讨论了这些结果对资格考试的发展和对工作分析数据的其他分析可能意味着什么。
{"title":"Analyzing Job Analysis Data Using Mixture Rasch Models","authors":"Adam E. Wyse","doi":"10.1080/15305058.2018.1481853","DOIUrl":"https://doi.org/10.1080/15305058.2018.1481853","url":null,"abstract":"An important piece of validity evidence to support the use of credentialing exams comes from performing a job analysis of the profession. One common job analysis method is the task inventory method, where people working in the field are surveyed using rating scales about the tasks thought necessary to safely and competently perform the job. This article describes how mixture Rasch models can be used to analyze these data, and how results from these analyses can help to identify whether different groups of people may be responding to job tasks differently. Three examples from different credentialing programs illustrate scenarios that can be found when applying mixture Rasch models to job analysis data. Discussion of what these results may imply for the development of credentialing exams and other analyses of job analysis data is provided.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2018-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1481853","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47874147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A Polytomous Model of Cognitive Diagnostic Assessment for Graded Data 分级数据认知诊断评估的多元体模型
IF 1.7 Q1 Social Sciences Pub Date : 2018-07-03 DOI: 10.1080/15305058.2017.1396465
Dongbo Tu, Chanjin Zheng, Yan Cai, Xuliang Gao, Daxun Wang
Pursuing the line of the difference models in IRT (Thissen & Steinberg, 1986), this article proposed a new cognitive diagnostic model for graded/polytomous data based on the deterministic input, noisy, and gate (Haertel, 1989; Junker & Sijtsma, 2001), which is named the DINA model for graded data (DINA-GD). We investigated the performance of a full Bayesian estimation of the proposed model. In the simulation, the classification accuracy and item recovery for the DINA-GD model were investigated. The results indicated that the proposed model had acceptable examinees' correct attribute classification rate and item parameter recovery. In addition, a real-data example was used to illustrate the application of this new model with the graded data or polytomously scored items.
遵循IRT中的差异模型(Thissen&Steinberg,1986),本文提出了一种新的基于确定性输入、噪声和门的分级/多模数据的认知诊断模型(Haertel,1989;Junker&Sijtsma,2001),称为分级数据的DINA模型(DINA-GD)。我们研究了所提出的模型的完全贝叶斯估计的性能。在模拟中,研究了DINA-GD模型的分类精度和项目回收率。结果表明,该模型具有可接受的考生正确属性分类率和项目参数恢复率。此外,还用一个真实的数据例子说明了这种新模型在分级数据或多面体评分项目中的应用。
{"title":"A Polytomous Model of Cognitive Diagnostic Assessment for Graded Data","authors":"Dongbo Tu, Chanjin Zheng, Yan Cai, Xuliang Gao, Daxun Wang","doi":"10.1080/15305058.2017.1396465","DOIUrl":"https://doi.org/10.1080/15305058.2017.1396465","url":null,"abstract":"Pursuing the line of the difference models in IRT (Thissen & Steinberg, 1986), this article proposed a new cognitive diagnostic model for graded/polytomous data based on the deterministic input, noisy, and gate (Haertel, 1989; Junker & Sijtsma, 2001), which is named the DINA model for graded data (DINA-GD). We investigated the performance of a full Bayesian estimation of the proposed model. In the simulation, the classification accuracy and item recovery for the DINA-GD model were investigated. The results indicated that the proposed model had acceptable examinees' correct attribute classification rate and item parameter recovery. In addition, a real-data example was used to illustrate the application of this new model with the graded data or polytomously scored items.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2018-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2017.1396465","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49274990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
期刊
International Journal of Testing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1