Pub Date : 2024-02-20DOI: 10.1080/15366367.2023.2173468
Stefanie A. Wind, Yuan Ge
Mixed-format assessments made up of multiple-choice (MC) items and constructed response (CR) items that are scored using rater judgments include unique psychometric considerations. When these item ...
{"title":"Detecting Rater Bias in Mixed-Format Assessments","authors":"Stefanie A. Wind, Yuan Ge","doi":"10.1080/15366367.2023.2173468","DOIUrl":"https://doi.org/10.1080/15366367.2023.2173468","url":null,"abstract":"Mixed-format assessments made up of multiple-choice (MC) items and constructed response (CR) items that are scored using rater judgments include unique psychometric considerations. When these item ...","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139910185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-20DOI: 10.1080/15366367.2023.2210358
Chen Qiu, Michael R. Peabody, Kelly D. Bradley
It is meaningful to create a comprehensive score to extract information from mass continuous data when they measure the same latent concept. Therefore, this study adopts the logic of psychometrics ...
{"title":"Exploring Construct Measures Using Rasch Models and Discretization Methods to Analyze Existing Continuous Data","authors":"Chen Qiu, Michael R. Peabody, Kelly D. Bradley","doi":"10.1080/15366367.2023.2210358","DOIUrl":"https://doi.org/10.1080/15366367.2023.2210358","url":null,"abstract":"It is meaningful to create a comprehensive score to extract information from mass continuous data when they measure the same latent concept. Therefore, this study adopts the logic of psychometrics ...","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139910226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-20DOI: 10.1080/15366367.2023.2198915
Alexander Robitzsch, Oliver Lüdtke
Educational large-scale assessment (LSA) studies like the program for international student assessment (PISA) provide important information about trends in the performance of educational indicators...
{"title":"An Examination of the Linking Error Currently Used in PISA","authors":"Alexander Robitzsch, Oliver Lüdtke","doi":"10.1080/15366367.2023.2198915","DOIUrl":"https://doi.org/10.1080/15366367.2023.2198915","url":null,"abstract":"Educational large-scale assessment (LSA) studies like the program for international student assessment (PISA) provide important information about trends in the performance of educational indicators...","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139948159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-20DOI: 10.1080/15366367.2023.2173547
Tenko Raykov, Lisa Calvocoressi, Randall E. Schumacker
This paper is concerned with the process of selecting between the increasingly popular bi-factor model and the second-order factor model in measurement research. It is indicated that in certain set...
{"title":"Choosing Between the Bi-Factor and Second-Order Factor Models: A Direct Test Using Latent Variable Modeling","authors":"Tenko Raykov, Lisa Calvocoressi, Randall E. Schumacker","doi":"10.1080/15366367.2023.2173547","DOIUrl":"https://doi.org/10.1080/15366367.2023.2173547","url":null,"abstract":"This paper is concerned with the process of selecting between the increasingly popular bi-factor model and the second-order factor model in measurement research. It is indicated that in certain set...","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139910283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-20DOI: 10.1080/15366367.2023.2201963
Tenko Raykov, George Marcoulides, James Anthony, Natalja Menold
A Bayesian statistics-based approach is discussed that can be used for direct evaluation of the popular Cronbach’s coefficient alpha as an internal consistency index for multiple-component measurin...
{"title":"Evaluating Cronbach’s Coefficient Alpha and Testing Its Identity to Scale Reliability: A Direct Bayesian Confirmatory Factor Analysis Procedure","authors":"Tenko Raykov, George Marcoulides, James Anthony, Natalja Menold","doi":"10.1080/15366367.2023.2201963","DOIUrl":"https://doi.org/10.1080/15366367.2023.2201963","url":null,"abstract":"A Bayesian statistics-based approach is discussed that can be used for direct evaluation of the popular Cronbach’s coefficient alpha as an internal consistency index for multiple-component measurin...","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139910380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-20DOI: 10.1080/15366367.2023.2173467
David Bamat
The National Assessment of Educational Progress (NAEP) program only reports state-level subgroup results if it samples at least 62 students identifying with the subgroup. Since some subgroups const...
{"title":"Closing Reporting Gaps: A Comparison of Methods for Estimating Unreported Subgroup Achievement on NAEP","authors":"David Bamat","doi":"10.1080/15366367.2023.2173467","DOIUrl":"https://doi.org/10.1080/15366367.2023.2173467","url":null,"abstract":"The National Assessment of Educational Progress (NAEP) program only reports state-level subgroup results if it samples at least 62 students identifying with the subgroup. Since some subgroups const...","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139979674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-03DOI: 10.1080/15366367.2022.2115250
Katerina M. Marcoulides
ABSTRACT Integrative data analyses have recently been shown to be an effective tool for researchers interested in synthesizing datasets from multiple studies in order to draw statistical or substantive conclusions. The actual process of integrating the different datasets depends on the availability of some common measures or items reflecting the same studied constructs. However, exactly how many common items are needed to effectively integrate multiple datasets has to date not been determined. This study evaluated the effect of using different numbers of common items in integrative data analysis applications. The study used simulations based on realistic data integration settings in which the number of common item sets was varied. The results provided insight concerning the optimal numbers of common items sets to safeguard estimation precision. The practical implications of these findings in view of past research in the psychometric literature concerning the necessary number of common item sets are also discussed.
{"title":"Integration of Historical Data for the Analysis of Multiple Assessment Studies","authors":"Katerina M. Marcoulides","doi":"10.1080/15366367.2022.2115250","DOIUrl":"https://doi.org/10.1080/15366367.2022.2115250","url":null,"abstract":"ABSTRACT Integrative data analyses have recently been shown to be an effective tool for researchers interested in synthesizing datasets from multiple studies in order to draw statistical or substantive conclusions. The actual process of integrating the different datasets depends on the availability of some common measures or items reflecting the same studied constructs. However, exactly how many common items are needed to effectively integrate multiple datasets has to date not been determined. This study evaluated the effect of using different numbers of common items in integrative data analysis applications. The study used simulations based on realistic data integration settings in which the number of common item sets was varied. The results provided insight concerning the optimal numbers of common items sets to safeguard estimation precision. The practical implications of these findings in view of past research in the psychometric literature concerning the necessary number of common item sets are also discussed.","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2023-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82907670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-03DOI: 10.1080/15366367.2022.2114243
Tongtong Zou, D. Bolt
ABSTRACT Person misfit and person reliability indices in item response theory (IRT) can play an important role in evaluating the validity of a test or survey instrument at the respondent level. Prior empirical comparisons of these indices have been applied to binary item response data and suggest that the two types of indices return very similar results. In this paper, however, we demonstrate an important applied distinction between these methods when applied to polytomously-scored rating scale items, namely their varying sensitivities to response style tendencies. Using several empirical datasets, we illustrate settings in which these indices are in one case highly correlated and in two other cases weakly correlated. In the datasets showing a weak correlation between indices, the primary distinction appears due to the effects of response style behavior, whereby respondents whose response styles are less common (e.g. a disproportionate selection of the midpoint response) are found to misfit using Drasgow et al’s person misfit index, but often show high levels of reliability from a person reliability perspective; just the opposite frequently occurs for respondents that over-select the rating scale extremes. It is suggested that person misfit reporting should be supplemented with an evaluation of person reliability to best understand the validity of measurement at the respondent level when using IRT models with rating scale measures.
{"title":"Person Misfit and Person Reliability in Rating Scale Measures: The Role of Response Styles","authors":"Tongtong Zou, D. Bolt","doi":"10.1080/15366367.2022.2114243","DOIUrl":"https://doi.org/10.1080/15366367.2022.2114243","url":null,"abstract":"ABSTRACT Person misfit and person reliability indices in item response theory (IRT) can play an important role in evaluating the validity of a test or survey instrument at the respondent level. Prior empirical comparisons of these indices have been applied to binary item response data and suggest that the two types of indices return very similar results. In this paper, however, we demonstrate an important applied distinction between these methods when applied to polytomously-scored rating scale items, namely their varying sensitivities to response style tendencies. Using several empirical datasets, we illustrate settings in which these indices are in one case highly correlated and in two other cases weakly correlated. In the datasets showing a weak correlation between indices, the primary distinction appears due to the effects of response style behavior, whereby respondents whose response styles are less common (e.g. a disproportionate selection of the midpoint response) are found to misfit using Drasgow et al’s person misfit index, but often show high levels of reliability from a person reliability perspective; just the opposite frequently occurs for respondents that over-select the rating scale extremes. It is suggested that person misfit reporting should be supplemented with an evaluation of person reliability to best understand the validity of measurement at the respondent level when using IRT models with rating scale measures.","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2023-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73861512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-03DOI: 10.1080/15366367.2022.2054130
Kyle T. Turner, G. Engelhard
ABSTRACT The purpose of this study is to illustrate the use of functional data analysis (FDA) as a general methodology for analyzing person response functions (PRFs). Applications of FDA to psychometrics have included the estimation of item response functions and latent distributions, as well as differential item functioning. Although FDA has been suggested for modeling PRFs, there has been relatively little research stressing this application. FDA offers an approach for diagnosing person responses that may be due to guessing and other sources of within-person multidimensionality. PRFs provide graphical displays that can be used to highlight unusual response patterns, and to identify persons that are not responding as expected to a set of test items. In addition to examining individual PRFs, functional clustering techniques can be used to identify subgroups of persons that may be exhibiting categories of misfit such as guessing. A small simulation study is conducted to illustrate how FDA can be used to identify persons exhibiting different levels of guessing behavior (5%, 10%, 15% and 20%). The methodology is also applied to real data from a 3rd grade science assessment used in a southeastern state. FDA offers a promising methodology for evaluating whether or not meaningful scores have been obtained for a person. Typical indices of psychometric quality, such as standard errors of measurement and person fit indices, are not sufficient for representing certain types of aberrance in person response patterns. Nonparametric graphical methods for estimating PRFs that are based FDA provide a rich source of validity evidence regarding the meaning and usefulness of each person’s score.
{"title":"Functional Data Analysis and Person Response Functions","authors":"Kyle T. Turner, G. Engelhard","doi":"10.1080/15366367.2022.2054130","DOIUrl":"https://doi.org/10.1080/15366367.2022.2054130","url":null,"abstract":"ABSTRACT The purpose of this study is to illustrate the use of functional data analysis (FDA) as a general methodology for analyzing person response functions (PRFs). Applications of FDA to psychometrics have included the estimation of item response functions and latent distributions, as well as differential item functioning. Although FDA has been suggested for modeling PRFs, there has been relatively little research stressing this application. FDA offers an approach for diagnosing person responses that may be due to guessing and other sources of within-person multidimensionality. PRFs provide graphical displays that can be used to highlight unusual response patterns, and to identify persons that are not responding as expected to a set of test items. In addition to examining individual PRFs, functional clustering techniques can be used to identify subgroups of persons that may be exhibiting categories of misfit such as guessing. A small simulation study is conducted to illustrate how FDA can be used to identify persons exhibiting different levels of guessing behavior (5%, 10%, 15% and 20%). The methodology is also applied to real data from a 3rd grade science assessment used in a southeastern state. FDA offers a promising methodology for evaluating whether or not meaningful scores have been obtained for a person. Typical indices of psychometric quality, such as standard errors of measurement and person fit indices, are not sufficient for representing certain types of aberrance in person response patterns. Nonparametric graphical methods for estimating PRFs that are based FDA provide a rich source of validity evidence regarding the meaning and usefulness of each person’s score.","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2023-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79466782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-03DOI: 10.1080/15366367.2023.2187274
Yiling Cheng
ABSTRACT Computerized adaptive testing (CAT) offers an efficient and highly accurate method for estimating examinees' abilities. In this article, the free version of Concerto Software for CAT was reviewed, dividing our evaluation into three sections: software implementation, the Item Response Theory (IRT) features of CAT, and user experience. Overall, Concerto is an excellent tool for researchers seeking to create computerized adaptive (or non-adaptive) tests, providing a robust platform with comprehensive IRT capabilities and a user-friendly interface.
计算机自适应测试(CAT)为评估考生的能力提供了一种高效、高精度的方法。本文对Concerto Software for CAT的免费版本进行了审查,将我们的评估分为三个部分:软件实现、CAT的项目反应理论(IRT)特征和用户体验。总的来说,协奏曲是一个优秀的工具,为研究人员寻求创建计算机自适应(或非自适应)测试,提供了一个强大的平台,全面的IRT功能和用户友好的界面。
{"title":"Concerto Software for Computerized Adaptive Testing – Free Version","authors":"Yiling Cheng","doi":"10.1080/15366367.2023.2187274","DOIUrl":"https://doi.org/10.1080/15366367.2023.2187274","url":null,"abstract":"ABSTRACT Computerized adaptive testing (CAT) offers an efficient and highly accurate method for estimating examinees' abilities. In this article, the free version of Concerto Software for CAT was reviewed, dividing our evaluation into three sections: software implementation, the Item Response Theory (IRT) features of CAT, and user experience. Overall, Concerto is an excellent tool for researchers seeking to create computerized adaptive (or non-adaptive) tests, providing a robust platform with comprehensive IRT capabilities and a user-friendly interface.","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2023-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84024894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}