首页 > 最新文献

Journal of applied measurement最新文献

英文 中文
Computer Adaptive Test Stopping Rules Applied to The Flexilevel Shoulder Functioning Test. 柔性肩功能测试中计算机自适应测试停止规则的应用
Pub Date : 2019-01-01
Trenton J Combs, Kyle W English, Barbara G Dodd, Hyeon-Ah Kang

Computerized adaptive testing (CAT) is an attractive alternative to traditional paper-and-pencil testing because it can provide accurate trait estimates while administering fewer items than a linear test form. A stopping rule is an important factor in determining an assessments efficiency. This simulation compares three variable-length stopping rules-standard error (SE) of .3, minimum information (MI) of .7 and change in trait (CT) of .02 - with and without a maximum number of items (20) imposed. We use fixed-length criteria of 10 and 20 items as a comparison for two versions of a linear assessment. The MI rules resulted in longer assessments with more biased trait estimates in comparison to other rules. The CT rule resulted in more biased estimates at the higher end of the trait scale and larger standard errors. The SE rules performed well across the trait scale in terms of both measurement precision and efficiency.

计算机自适应测试(CAT)是传统纸笔测试的一个有吸引力的替代方案,因为它可以提供准确的特征估计,同时管理比线性测试形式更少的项目。停止规则是决定评估效率的重要因素。这个模拟比较了三个可变长度停止规则——标准误差(SE)为0.3,最小信息(MI)为0.7,特征变化(CT)为0.02——有和没有施加最大数量的项目(20)。我们使用10和20个项目的固定长度标准作为线性评估的两个版本的比较。与其他规则相比,MI规则导致更长的评估时间和更有偏见的特征估计。CT规则导致在性状量表的较高端产生更多的偏倚估计和更大的标准误差。在测量精度和效率方面,SE规则在性状量表上表现良好。
{"title":"Computer Adaptive Test Stopping Rules Applied to The Flexilevel Shoulder Functioning Test.","authors":"Trenton J Combs,&nbsp;Kyle W English,&nbsp;Barbara G Dodd,&nbsp;Hyeon-Ah Kang","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Computerized adaptive testing (CAT) is an attractive alternative to traditional paper-and-pencil testing because it can provide accurate trait estimates while administering fewer items than a linear test form. A stopping rule is an important factor in determining an assessments efficiency. This simulation compares three variable-length stopping rules-standard error (SE) of .3, minimum information (MI) of .7 and change in trait (CT) of .02 - with and without a maximum number of items (20) imposed. We use fixed-length criteria of 10 and 20 items as a comparison for two versions of a linear assessment. The MI rules resulted in longer assessments with more biased trait estimates in comparison to other rules. The CT rule resulted in more biased estimates at the higher end of the trait scale and larger standard errors. The SE rules performed well across the trait scale in terms of both measurement precision and efficiency.</p>","PeriodicalId":73608,"journal":{"name":"Journal of applied measurement","volume":"20 1","pages":"66-78"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36986094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Expected Values for Category-To-Measure and Measure-To-Category Statistics: A Simulation Study. 类别到测量和测量到类别统计的期望值:模拟研究。
Pub Date : 2019-01-01
Eivind Kaspersen

There are many sources of evidence for a well-functioning rating-scale. Two of these sources are analyses of measure-to-category and category-to-measure statistics. An absolute cut-value of 40% for these statistics has been suggested. However, no evidence exists in the literature that this value is appropriate. Thus, this paper discusses the results of simulation studies that examined the expected values in different contexts. The study concludes that a static cut-value of 40% should be replaced with expected values for measure-to-category and category-to-measure analyses.

有很多证据可以证明一个运作良好的评定量表。其中两个来源是测量对类别和类别对测量统计的分析。这些统计数据的绝对临界值为40%。然而,文献中没有证据表明这个值是合适的。因此,本文讨论了在不同背景下检验期望值的模拟研究结果。该研究的结论是,在测量到类别和类别到测量的分析中,40%的静态临界值应该被期望值所取代。
{"title":"Expected Values for Category-To-Measure and Measure-To-Category Statistics: A Simulation Study.","authors":"Eivind Kaspersen","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>There are many sources of evidence for a well-functioning rating-scale. Two of these sources are analyses of measure-to-category and category-to-measure statistics. An absolute cut-value of 40% for these statistics has been suggested. However, no evidence exists in the literature that this value is appropriate. Thus, this paper discusses the results of simulation studies that examined the expected values in different contexts. The study concludes that a static cut-value of 40% should be replaced with expected values for measure-to-category and category-to-measure analyses.</p>","PeriodicalId":73608,"journal":{"name":"Journal of applied measurement","volume":"20 2","pages":"146-153"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37004326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Loevinger on Unidimensional Tests with Reference to Guttman, Rasch, and Wright. Loevinger论一维测试与Guttman, Rasch和Wright的参考。
Pub Date : 2019-01-01
Mark H Stone, A Jackson Stenner

Loevinger's specifications for a unidimensional test are discussed. The implications are reviewed using commentary from Guttman's and Rasch's specification for specific objectivity. A large population is sampled to evaluate the implications of this approach in light of Wright's early presentation regarding data analysis. The results of this analysis show the sample follows the specifications of Loevinger and those of Rasch for a unidimensional test.

讨论了一维试验的Loevinger规范。使用Guttman和Rasch对具体客观性规范的评论来回顾其含义。根据Wright关于数据分析的早期陈述,对大量人口进行抽样,以评估这种方法的含义。分析结果表明,样品符合Loevinger和Rasch的单维测试规范。
{"title":"Loevinger on Unidimensional Tests with Reference to Guttman, Rasch, and Wright.","authors":"Mark H Stone,&nbsp;A Jackson Stenner","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Loevinger's specifications for a unidimensional test are discussed. The implications are reviewed using commentary from Guttman's and Rasch's specification for specific objectivity. A large population is sampled to evaluate the implications of this approach in light of Wright's early presentation regarding data analysis. The results of this analysis show the sample follows the specifications of Loevinger and those of Rasch for a unidimensional test.</p>","PeriodicalId":73608,"journal":{"name":"Journal of applied measurement","volume":"20 2","pages":"123-133"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37004324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-Cultural Comparisons of School Leadership using Rasch Measurement. 使用Rasch测量法的学校领导跨文化比较。
Pub Date : 2019-01-01
Sijia Zhang, Stefanie A Wind

School leadership influences school conditions and organizational climate; these conditions in turn impact student outcomes. Accordingly, examining differences in principals' perceptions of leadership activities within and across countries may provide insight into achievement differences. The major purpose of this study was to explore differences in the relative difficulty of principals' leadership activities across four countries that reflect Asian and North American national contexts: (1) Hong Kong SAR, (2) Chinese Taipei, (3) the United States, and (4) Canada. We also sought to illustrate the use of Rasch measurement theory as a modern measurement approach to exploring the psychometric properties of a leadership survey, with a focus on differential item functioning. We applied a rating scale formulation of the Many-facet Rasch model to principals' responses to the Leadership Activities Scale in order to examine the degree to which the overall ordering of leadership activities was invariant across the four countries. Overall, the results suggested that there were significant differences in the difficulty ordering of leadership activities across countries, and that these differences were most pronounced between the two continents. Implications are discussed for research and practice.

学校领导影响学校条件和组织氛围;这些条件反过来又影响学生的学习成绩。因此,研究校长在国家内部和国家之间对领导活动的看法差异,可以深入了解成就差异。本研究的主要目的是探讨四个反映亚洲和北美国家背景的校长领导活动相对难度的差异:(1)香港特别行政区,(2)中国台北,(3)美国和(4)加拿大。我们还试图说明使用Rasch测量理论作为一种现代测量方法来探索领导力调查的心理测量特性,重点是差异项目功能。我们将多面拉赫模型的评分量表公式应用于校长对领导活动量表的反应,以检查四个国家领导活动的总体顺序不变的程度。总体而言,研究结果表明,不同国家的领导活动排序难度存在显著差异,而这些差异在两大洲之间最为明显。讨论了对研究和实践的启示。
{"title":"Cross-Cultural Comparisons of School Leadership using Rasch Measurement.","authors":"Sijia Zhang,&nbsp;Stefanie A Wind","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>School leadership influences school conditions and organizational climate; these conditions in turn impact student outcomes. Accordingly, examining differences in principals' perceptions of leadership activities within and across countries may provide insight into achievement differences. The major purpose of this study was to explore differences in the relative difficulty of principals' leadership activities across four countries that reflect Asian and North American national contexts: (1) Hong Kong SAR, (2) Chinese Taipei, (3) the United States, and (4) Canada. We also sought to illustrate the use of Rasch measurement theory as a modern measurement approach to exploring the psychometric properties of a leadership survey, with a focus on differential item functioning. We applied a rating scale formulation of the Many-facet Rasch model to principals' responses to the Leadership Activities Scale in order to examine the degree to which the overall ordering of leadership activities was invariant across the four countries. Overall, the results suggested that there were significant differences in the difficulty ordering of leadership activities across countries, and that these differences were most pronounced between the two continents. Implications are discussed for research and practice.</p>","PeriodicalId":73608,"journal":{"name":"Journal of applied measurement","volume":"20 2","pages":"167-183"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37004328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lucky Guess? Applying Rasch Measurement Theory to Grade 5 South African Mathematics Achievement Data. 侥幸的猜测?Rasch测量理论在南非五年级数学成绩数据中的应用。
Pub Date : 2019-01-01
Sarah Bansilal, Caroline Long, Andrea Juan

The use of multiple-choice items in assessments in the interest of increased efficiency brings associated challenges, notably the phenomenon of guessing. The purpose of this study is to use Rasch measurement theory to investigate the extent of guessing in a sample of responses taken from the Trends in International Mathematics and Science Study (TIMSS) 2015. A method of checking the extent of the guessing in test data, a tailored analysis, is applied to the data from a sample of 2188 learners on a subset of items. The analysis confirms prior research that showed that as the difficulty of the item increases, the probability of guessing also increases. An outcome of the tailored analysis is that items at the high proficiency end of the continuum, increase in difficulty. A consequence of item difficulties being estimated as relatively lower than they would be without guessing, is that learner proficiency at the higher end is under estimated while the achievement of learners with lower proficiencies are over estimated. Hence, it is important that finer analysis of systemic data takes into account guessing, so that more nuanced information can be obtained to inform subsequent cycles of education planning.

为了提高效率,在评估中使用多项选择题带来了相关的挑战,特别是猜测现象。本研究的目的是使用Rasch测量理论来调查2015年国际数学与科学趋势研究(TIMSS)中回答样本中的猜测程度。一种检查测试数据中猜测程度的方法,一种量身定制的分析,应用于来自2188个学习者样本的数据。该分析证实了先前的研究,即随着题目难度的增加,猜测的概率也会增加。量身定制分析的结果是,在连续体的高熟练度端,难度增加。项目难度的估计值相对低于他们在没有猜测的情况下的估计值的结果是,高水平学习者的熟练程度被低估了,而低水平学习者的成就被高估了。因此,对系统数据进行更细致的分析,考虑到猜测是很重要的,这样才能获得更细致的信息,为后续的教育规划周期提供信息。
{"title":"Lucky Guess? Applying Rasch Measurement Theory to Grade 5 South African Mathematics Achievement Data.","authors":"Sarah Bansilal,&nbsp;Caroline Long,&nbsp;Andrea Juan","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The use of multiple-choice items in assessments in the interest of increased efficiency brings associated challenges, notably the phenomenon of guessing. The purpose of this study is to use Rasch measurement theory to investigate the extent of guessing in a sample of responses taken from the Trends in International Mathematics and Science Study (TIMSS) 2015. A method of checking the extent of the guessing in test data, a tailored analysis, is applied to the data from a sample of 2188 learners on a subset of items. The analysis confirms prior research that showed that as the difficulty of the item increases, the probability of guessing also increases. An outcome of the tailored analysis is that items at the high proficiency end of the continuum, increase in difficulty. A consequence of item difficulties being estimated as relatively lower than they would be without guessing, is that learner proficiency at the higher end is under estimated while the achievement of learners with lower proficiencies are over estimated. Hence, it is important that finer analysis of systemic data takes into account guessing, so that more nuanced information can be obtained to inform subsequent cycles of education planning.</p>","PeriodicalId":73608,"journal":{"name":"Journal of applied measurement","volume":"20 2","pages":"206-220"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37004330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accuracy and Utility of the AUDIT-C with Adolescent Girls and Young Women (AGYW) Who Engage in HIV Risk Behaviors in South Africa. 南非参与 HIV 风险行为的少女和年轻妇女 (AGYW) 的 AUDIT-C 的准确性和实用性。
Pub Date : 2019-01-01
Tracy Kline, Corina Owens, Courtney Peasant Bonner, Tara Carney, Felicia A Browne, Wendee M Wechsberg

Hazardous drinking is a risk factor associated with sexual risk, gender-based violence, and HIV transmission in South Africa. Consequently, sound and appropriate measurement of drinking behavior is critical to determining what constitutes hazardous drinking. Many research studies use internal consistency estimates as the determining factor in psychometric assessment; however, deeper assessments are needed to best define a measurement tool. Rasch methodology was used to evaluate a shorter version of the Alcohol Use Disorders Identification Test, the AUDIT-C, in a sample of adolescent girls and young women (AGYW) who use alcohol and other drugs in South Africa (n =100). Investigations of operational response range, item fit, sensitivity, and response option usage provide a richer picture of AUDIT-C functioning than internal consistency alone in women who are vulnerable to hazardous drinking and therefore at risk of HIV. Analyses indicate that the AUDIT-C does not adequately measure this specialized population, and that more validation is needed to determine if the AUDIT-C should continue to be used in HIV prevention intervention studies focused on vulnerable adolescent girls and young women.

在南非,危险饮酒是与性风险、性别暴力和艾滋病毒传播相关的一个风险因素。因此,对饮酒行为进行合理、适当的测量对于确定什么是危险饮酒至关重要。许多研究将内部一致性估计值作为心理测量评估的决定性因素;然而,需要进行更深入的评估,才能最好地确定测量工具。本研究采用 Rasch 方法,对南非酗酒和使用其他药物的少女和年轻女性(AGYW)样本(n = 100)中的酒精使用障碍识别测验(AUDIT-C)的简短版本进行了评估。对操作反应范围、项目契合度、敏感度和反应选项使用情况的调查,比单纯的内部一致性更能反映出 AUDIT-C 在易饮酒并因此面临艾滋病风险的女性中的功能。分析表明,AUDIT-C 并不能充分测量这一特殊人群,还需要更多的验证来确定 AUDIT-C 是否应继续用于针对易感少女和年轻女性的 HIV 预防干预研究。
{"title":"Accuracy and Utility of the AUDIT-C with Adolescent Girls and Young Women (AGYW) Who Engage in HIV Risk Behaviors in South Africa.","authors":"Tracy Kline, Corina Owens, Courtney Peasant Bonner, Tara Carney, Felicia A Browne, Wendee M Wechsberg","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Hazardous drinking is a risk factor associated with sexual risk, gender-based violence, and HIV transmission in South Africa. Consequently, sound and appropriate measurement of drinking behavior is critical to determining what constitutes hazardous drinking. Many research studies use internal consistency estimates as the determining factor in psychometric assessment; however, deeper assessments are needed to best define a measurement tool. Rasch methodology was used to evaluate a shorter version of the Alcohol Use Disorders Identification Test, the AUDIT-C, in a sample of adolescent girls and young women (AGYW) who use alcohol and other drugs in South Africa (n =100). Investigations of operational response range, item fit, sensitivity, and response option usage provide a richer picture of AUDIT-C functioning than internal consistency alone in women who are vulnerable to hazardous drinking and therefore at risk of HIV. Analyses indicate that the AUDIT-C does not adequately measure this specialized population, and that more validation is needed to determine if the AUDIT-C should continue to be used in HIV prevention intervention studies focused on vulnerable adolescent girls and young women.</p>","PeriodicalId":73608,"journal":{"name":"Journal of applied measurement","volume":"20 1","pages":"112-122"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10961932/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36986025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Examining Differential Item Functioning in the Household Food Insecurity Scale: Does Participation in SNAP Affect Measurement Invariance? 检验家庭食品不安全量表中的差异项目功能:参与SNAP是否影响测量不变性?
Pub Date : 2019-01-01
Victoria T Tanaka, George Engelhard, Matthew P Rabbitt

The Household Food Security Survey Module (HFSSM) is a scale used by the U.S. Department of Agriculture to measure the severity of food insecurity experienced by U.S. households. In this study, measurement invariance of the HFSSM is examined across households based on participation in the Supplemental Nutrition Assistance Program (SNAP). Households with children who responded to the HFSSM in 2015 and 2016 (N = 3,931) are examined. The Rasch model is used to analyze differential item functioning (DIF) related to SNAP participation. Analyses suggest a small difference in reported food insecurity between SNAP and non-SNAP participants (27% versus 23% respectively). However, the size and direction of the DIF mitigates the impact on overall estimates of household food insecurity. Person-fit indices suggest that the household aberrant response rate is 6.6% and the number of misfitting households is comparable for SNAP (6.80%) and non-SNAP participants (6.30%). Implications for research and policy related to food insecurity are discussed.

家庭食品安全调查模块(HFSSM)是美国农业部用来衡量美国家庭所经历的食品不安全严重程度的量表。在本研究中,基于参与补充营养援助计划(SNAP)的家庭,检验了HFSSM的测量不变性。我们对2015年和2016年对家庭儿童健康评估有反应的家庭(N = 3931)进行了调查。Rasch模型用于分析与SNAP参与相关的差异项目功能(DIF)。分析表明,SNAP和非SNAP参与者在报告的粮食不安全方面存在微小差异(分别为27%和23%)。然而,DIF的大小和方向减轻了对家庭粮食不安全总体估计的影响。个人拟合指数表明,家庭异常反应率为6.6%,不适合家庭的数量在SNAP参与者(6.80%)和非SNAP参与者(6.30%)中相当。讨论了粮食不安全对研究和政策的影响。
{"title":"Examining Differential Item Functioning in the Household Food Insecurity Scale: Does Participation in SNAP Affect Measurement Invariance?","authors":"Victoria T Tanaka,&nbsp;George Engelhard,&nbsp;Matthew P Rabbitt","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The Household Food Security Survey Module (HFSSM) is a scale used by the U.S. Department of Agriculture to measure the severity of food insecurity experienced by U.S. households. In this study, measurement invariance of the HFSSM is examined across households based on participation in the Supplemental Nutrition Assistance Program (SNAP). Households with children who responded to the HFSSM in 2015 and 2016 (N = 3,931) are examined. The Rasch model is used to analyze differential item functioning (DIF) related to SNAP participation. Analyses suggest a small difference in reported food insecurity between SNAP and non-SNAP participants (27% versus 23% respectively). However, the size and direction of the DIF mitigates the impact on overall estimates of household food insecurity. Person-fit indices suggest that the household aberrant response rate is 6.6% and the number of misfitting households is comparable for SNAP (6.80%) and non-SNAP participants (6.30%). Implications for research and policy related to food insecurity are discussed.</p>","PeriodicalId":73608,"journal":{"name":"Journal of applied measurement","volume":"20 1","pages":"100-111"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36986023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Effects of Probability Threshold Choice on an Adjustment for Guessing using the Rasch Model. 概率阈值选择对使用Rasch模型的猜测调整的影响。
Pub Date : 2019-01-01
Glenn Thomas Waterbury, Christine E DeMars

This paper investigates a strategy for accounting for correct guessing with the Rasch model that we entitled the Guessing Adjustment. This strategy involves the identification of all person/item encounters where the probability of a correct response is below a specified threshold. These responses are converted to missing data and the calibration is conducted a second time. This simulation study focuses on the effects of different probability thresholds across varying conditions of sample size, amount of correct guessing, and item difficulty. Biases, standard errors, and root mean squared errors were calculated within each condition. Larger probability thresholds were generally associated with reductions in bias and increases in standard errors. Across most conditions, the reduction in bias was more impactful than the decrease in precision, as reflected by the RMSE. The Guessing Adjustment is an effective means for reducing the impact of correct guessing and the choice of probability threshold matters.

本文研究了一种用Rasch模型来计算正确猜测的策略,我们称之为猜测调整。该策略涉及识别所有遇到的正确反应概率低于指定阈值的人/物。这些响应被转换为缺失数据,并进行第二次校准。这个模拟研究的重点是不同的概率阈值在不同条件下的样本量,正确猜测量和项目难度的影响。在每个条件下计算偏倚、标准误差和均方根误差。较大的概率阈值通常与偏倚减少和标准误差增加有关。在大多数情况下,偏差的减少比精度的降低更有影响,正如RMSE所反映的那样。猜测调整是减少正确猜测和概率阈值选择影响的有效手段。
{"title":"The Effects of Probability Threshold Choice on an Adjustment for Guessing using the Rasch Model.","authors":"Glenn Thomas Waterbury,&nbsp;Christine E DeMars","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>This paper investigates a strategy for accounting for correct guessing with the Rasch model that we entitled the Guessing Adjustment. This strategy involves the identification of all person/item encounters where the probability of a correct response is below a specified threshold. These responses are converted to missing data and the calibration is conducted a second time. This simulation study focuses on the effects of different probability thresholds across varying conditions of sample size, amount of correct guessing, and item difficulty. Biases, standard errors, and root mean squared errors were calculated within each condition. Larger probability thresholds were generally associated with reductions in bias and increases in standard errors. Across most conditions, the reduction in bias was more impactful than the decrease in precision, as reflected by the RMSE. The Guessing Adjustment is an effective means for reducing the impact of correct guessing and the choice of probability threshold matters.</p>","PeriodicalId":73608,"journal":{"name":"Journal of applied measurement","volume":"20 1","pages":"1-12"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36986090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Quantifying Item Invariance for the Selection of the Least Biased Assessment. 选择最小偏差评价的量化项目不变性。
Pub Date : 2019-01-01
W Holmes Finch, Brian F French, Maria E Hernandez Finch

An important aspect of educational and psychological measurement and evaluation of individuals is the selection of scales with appropriate evidence of reliability and validity for inferences and uses of the scores for the population of interest. One aspect of validity is the degree to which a scale fairly assesses the construct(s) of interest for members of different subgroups within the population. Typically, this issue is addressed statistically through assessment of differential item functioning (DIF) of individual items, or differential bundle functioning (DBF) of sets of items. When selecting an assessment to use for a given application (e.g., measuring intelligence), or which form of an assessment to use in a given instance, researchers need to consider the extent to which the scales work with all members of the population. Little research has examined methods for comparing the amount or magnitude of DIF/DBF present in two assessments when deciding which assessment to use. The current simulation study examines 6 different statistics for this purpose. Results show that a method based on the random effects item response theory model may be optimal for instrument comparisons, particularly when the assessments being compared are not of the same length.

对个人进行教育和心理测量和评价的一个重要方面是选择具有适当的可靠性和有效性证据的量表,以进行推断,并将分数用于感兴趣的人口。效度的一个方面是量表公平地评估人群中不同亚群成员感兴趣的构念的程度。通常,这个问题是通过评估单个项目的差异项目功能(DIF)或项目集的差异捆绑功能(DBF)来统计地解决的。当选择一个评估用于一个给定的应用(例如,测量智力),或哪种形式的评估在一个给定的实例中使用时,研究人员需要考虑的程度,量表适用于人口的所有成员。很少有研究考察了在决定使用哪种评估时比较两种评估中存在的DIF/DBF的数量或大小的方法。目前的模拟研究为此目的检查了6种不同的统计数据。结果表明,基于随机效应项目反应理论模型的方法可能是工具比较的最佳方法,特别是当被比较的评估长度不相同时。
{"title":"Quantifying Item Invariance for the Selection of the Least Biased Assessment.","authors":"W Holmes Finch,&nbsp;Brian F French,&nbsp;Maria E Hernandez Finch","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>An important aspect of educational and psychological measurement and evaluation of individuals is the selection of scales with appropriate evidence of reliability and validity for inferences and uses of the scores for the population of interest. One aspect of validity is the degree to which a scale fairly assesses the construct(s) of interest for members of different subgroups within the population. Typically, this issue is addressed statistically through assessment of differential item functioning (DIF) of individual items, or differential bundle functioning (DBF) of sets of items. When selecting an assessment to use for a given application (e.g., measuring intelligence), or which form of an assessment to use in a given instance, researchers need to consider the extent to which the scales work with all members of the population. Little research has examined methods for comparing the amount or magnitude of DIF/DBF present in two assessments when deciding which assessment to use. The current simulation study examines 6 different statistics for this purpose. Results show that a method based on the random effects item response theory model may be optimal for instrument comparisons, particularly when the assessments being compared are not of the same length.</p>","PeriodicalId":73608,"journal":{"name":"Journal of applied measurement","volume":"20 1","pages":"13-26"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36986091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Examining Rater Judgements in Music Performance Assessment using Many-Facets Rasch Rating Scale Measurement Model. 用多面Rasch评定量表测量模型考察音乐表演评定中评价者的判断。
Pub Date : 2019-01-01
Pey Shin Ooi, George Engelhard

The fairness of raters in music performance assessment has become an important concern in the field of music. The assessment of students' music performance depends in a fundamental way on rater judgements. The quality of rater judgements is crucial to provide fair, meaningful and informative assessments of music performance. There are many external factors that can influence the quality of rater judgements. Previous research has used different measurement models to examine the quality of rater judgements (e.g., generalizability theory). There are limitations with the previous analysis methods that are based on classical test theory and its extensions. In this study, we use modern measurement theory (Rasch measurement theory) to examine the quality of rater judgements. The many-facets Rasch rating scale model is employed to investigate the extent of rater-invariant measurement in the context of music performance assessments related to university degrees in Malaysia (159 students rated by 24 raters). We examine the rating scale structure, the severity levels of the raters, and the judged difficulty of the items. We also examine the interaction effects across musical instrument subgroups (keyboard, strings, woodwinds, brass, percussions and vocal). The results suggest that there were differences in severity levels among the raters. The results of this study also suggest that raters had different severity levels when rating different musical instrument subgroups. The implications for research, theory and practice in the assessment of music performance are included in this paper.

评价人员在音乐演奏评价中的公平性问题已成为音乐学界关注的一个重要问题。对学生音乐表现的评价,从根本上讲取决于对学生的判断。评估师的判断质量对于提供公平、有意义和信息丰富的音乐表演评估至关重要。有许多外部因素可以影响评级判断的质量。以前的研究使用了不同的测量模型来检验评级判断的质量(例如,概率论)。以往基于经典测试理论及其扩展的分析方法存在局限性。在本研究中,我们使用现代测量理论(Rasch测量理论)来检验评估师判断的质量。采用多面Rasch评分量表模型来调查与马来西亚大学学位相关的音乐表演评估背景下的评分不变测量程度(由24名评分者评分的159名学生)。我们考察了评定量表的结构、评定者的严重程度和评定项目的难易程度。我们还研究了乐器子组(键盘,弦乐,木管乐器,铜管乐器,打击乐器和声乐)的相互作用效果。结果表明,评分者的严重程度存在差异。本研究的结果也表明,评分者对不同乐器亚组的评分有不同的严重程度。本文对音乐表演评估的研究、理论和实践意义进行了探讨。
{"title":"Examining Rater Judgements in Music Performance Assessment using Many-Facets Rasch Rating Scale Measurement Model.","authors":"Pey Shin Ooi,&nbsp;George Engelhard","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The fairness of raters in music performance assessment has become an important concern in the field of music. The assessment of students' music performance depends in a fundamental way on rater judgements. The quality of rater judgements is crucial to provide fair, meaningful and informative assessments of music performance. There are many external factors that can influence the quality of rater judgements. Previous research has used different measurement models to examine the quality of rater judgements (e.g., generalizability theory). There are limitations with the previous analysis methods that are based on classical test theory and its extensions. In this study, we use modern measurement theory (Rasch measurement theory) to examine the quality of rater judgements. The many-facets Rasch rating scale model is employed to investigate the extent of rater-invariant measurement in the context of music performance assessments related to university degrees in Malaysia (159 students rated by 24 raters). We examine the rating scale structure, the severity levels of the raters, and the judged difficulty of the items. We also examine the interaction effects across musical instrument subgroups (keyboard, strings, woodwinds, brass, percussions and vocal). The results suggest that there were differences in severity levels among the raters. The results of this study also suggest that raters had different severity levels when rating different musical instrument subgroups. The implications for research, theory and practice in the assessment of music performance are included in this paper.</p>","PeriodicalId":73608,"journal":{"name":"Journal of applied measurement","volume":"20 1","pages":"79-99"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36986022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of applied measurement
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1