Pub Date : 2022-07-03DOI: 10.1080/15366367.2021.1971024
James J. Thompson
ABSTRACT With the use of computerized testing, ordinary assessments can capture both answer accuracy and answer response time. For the Canadian Programme for the International Assessment of Adult Competencies (PIAAC) numeracy and literacy subtests, person ability, person speed, question difficulty, question time intensity, fluency (rate), person fluency (skill), question fluency (load), pace (rank of response time within question), and person pace were assessed. Undirected Gaussian Graphical Model networks of the measures based on partial correlations were predictive of the measures as nodes. The population-based model extrapolated well to individual person estimations. Finally, it was shown that the “training” Canadian model generalized with minor differences to four other English-speaking PIAAC assessments (USA, Great Britain, Ireland, and New Zealand). Thus, the undirected network approach provides a heuristic that is both descriptive and predictive. However, the model is not causal and can be taken as an example of “mutualism.”
{"title":"Application of Network Analysis to Description and Prediction of Assessment Outcomes","authors":"James J. Thompson","doi":"10.1080/15366367.2021.1971024","DOIUrl":"https://doi.org/10.1080/15366367.2021.1971024","url":null,"abstract":"ABSTRACT With the use of computerized testing, ordinary assessments can capture both answer accuracy and answer response time. For the Canadian Programme for the International Assessment of Adult Competencies (PIAAC) numeracy and literacy subtests, person ability, person speed, question difficulty, question time intensity, fluency (rate), person fluency (skill), question fluency (load), pace (rank of response time within question), and person pace were assessed. Undirected Gaussian Graphical Model networks of the measures based on partial correlations were predictive of the measures as nodes. The population-based model extrapolated well to individual person estimations. Finally, it was shown that the “training” Canadian model generalized with minor differences to four other English-speaking PIAAC assessments (USA, Great Britain, Ireland, and New Zealand). Thus, the undirected network approach provides a heuristic that is both descriptive and predictive. However, the model is not causal and can be taken as an example of “mutualism.”","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":"3 1","pages":"121 - 138"},"PeriodicalIF":1.0,"publicationDate":"2022-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88826420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-03DOI: 10.1080/15366367.2021.1991742
M. Roduta Roberts, Chad M. Gotch, Megan Cook, Karin Werther, I. Chao
ABSTRACT Performance-based assessment is a common approach to assess the development and acquisition of practice competencies among health professions students. Judgments related to the quality of performance are typically operationalized as ratings against success criteria specified within a rubric. The extent to which the rubric is understood, interpreted, and applied by assessors is critical to support valid score interpretations and their subsequent use. Therefore, the purpose of this study was to examine evidence to support a scoring inference related to assessor ratings on a clinically oriented performance-based examination. Think-aloud data showed that rubric dimensions generally informed assessors’ ratings, but specific performance descriptors were rarely invoked. These findings support revisions to the rubric (e.g., less subjective, rating-scale language) and highlight tensions and implications of using rubrics for student evaluation and making decisions in a learning context.
{"title":"Using Think-aloud Interviews to Examine a Clinically Oriented Performance Assessment Rubric","authors":"M. Roduta Roberts, Chad M. Gotch, Megan Cook, Karin Werther, I. Chao","doi":"10.1080/15366367.2021.1991742","DOIUrl":"https://doi.org/10.1080/15366367.2021.1991742","url":null,"abstract":"ABSTRACT Performance-based assessment is a common approach to assess the development and acquisition of practice competencies among health professions students. Judgments related to the quality of performance are typically operationalized as ratings against success criteria specified within a rubric. The extent to which the rubric is understood, interpreted, and applied by assessors is critical to support valid score interpretations and their subsequent use. Therefore, the purpose of this study was to examine evidence to support a scoring inference related to assessor ratings on a clinically oriented performance-based examination. Think-aloud data showed that rubric dimensions generally informed assessors’ ratings, but specific performance descriptors were rarely invoked. These findings support revisions to the rubric (e.g., less subjective, rating-scale language) and highlight tensions and implications of using rubrics for student evaluation and making decisions in a learning context.","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":"22 1","pages":"139 - 150"},"PeriodicalIF":1.0,"publicationDate":"2022-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87509730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-04-03DOI: 10.1080/15366367.2021.1942672
Stefanie A. Wind
ABSTRACT In many performance assessments, one or two raters from the complete rater pool scores each performance, resulting in a sparse rating design, where there are limited observations of each rater relative to the complete sample of students. Although sparse rating designs can be constructed to facilitate estimation of student achievement, the relatively limited observations of each rater can pose challenges for identifying raters who may exhibit scoring idiosyncrasies specific to individual or subgroups of examinees, such as differential rater functioning (DRF; i.e., rater bias). In particular, when raters who exhibit DRF are directly connected to other raters who exhibit the same type of DRF, there is limited information with which to detect this effect. On the other hand, if raters who exhibit DRF are connected to raters who do not exhibit DRF, this effect may be more readily detected. In this study, a simulation is used to systematically examine the degree to which the nature of connections among raters who exhibit common DRF patterns in sparse rating designs impacts the sensitivity of DRF indices. The use of additional “monitoring ratings” and variable rater assignment to student performances are considered as strategies to improve DRF detection in sparse designs. The results indicate that the nature of connections among DRF raters has a substantial impact on the sensitivity of DRF indices, and that monitoring ratings and variable rater assignment to student performances can improve DRF detection.
{"title":"Rater Connections and the Detection of Bias in Performance Assessment","authors":"Stefanie A. Wind","doi":"10.1080/15366367.2021.1942672","DOIUrl":"https://doi.org/10.1080/15366367.2021.1942672","url":null,"abstract":"ABSTRACT In many performance assessments, one or two raters from the complete rater pool scores each performance, resulting in a sparse rating design, where there are limited observations of each rater relative to the complete sample of students. Although sparse rating designs can be constructed to facilitate estimation of student achievement, the relatively limited observations of each rater can pose challenges for identifying raters who may exhibit scoring idiosyncrasies specific to individual or subgroups of examinees, such as differential rater functioning (DRF; i.e., rater bias). In particular, when raters who exhibit DRF are directly connected to other raters who exhibit the same type of DRF, there is limited information with which to detect this effect. On the other hand, if raters who exhibit DRF are connected to raters who do not exhibit DRF, this effect may be more readily detected. In this study, a simulation is used to systematically examine the degree to which the nature of connections among raters who exhibit common DRF patterns in sparse rating designs impacts the sensitivity of DRF indices. The use of additional “monitoring ratings” and variable rater assignment to student performances are considered as strategies to improve DRF detection in sparse designs. The results indicate that the nature of connections among DRF raters has a substantial impact on the sensitivity of DRF indices, and that monitoring ratings and variable rater assignment to student performances can improve DRF detection.","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":"26 1","pages":"91 - 106"},"PeriodicalIF":1.0,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82602033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-04-03DOI: 10.1080/15366367.2021.1897398
Ö. K. Kalkan
ABSTRACT The four-parameter logistic (4PL) Item Response Theory (IRT) model has recently been reconsidered in the literature due to the advances in the statistical modeling software and the recent developments in the estimation of the 4PL IRT model parameters. The current simulation study evaluated the performance of expectation-maximization (EM), Quasi-Monte Carlo EM (QMCEM), and Metropolis-Hastings Robbins-Monro (MH-RM) estimation methods for the item parameters in the 4PL IRT model under the manipulated study conditions, including the number of factors, the correlation between factors, and test length. The results indicated that there was no method to be recommended as the best one among the three estimation algorithms for the estimation of 4PL item parameters accurately across all study conditions. However, using the MH-RM algorithm for 4PL model item parameter estimation can be suggested when the number of factors is 2 or 3. In addition, it may be advised to prefer long test lengths rather than shorter test lengths (n = 24), as three algorithms provide better item parameter estimates at long test lengths (n = 48).
{"title":"The Comparison of Estimation Methods for the Four-Parameter Logistic Item Response Theory Model","authors":"Ö. K. Kalkan","doi":"10.1080/15366367.2021.1897398","DOIUrl":"https://doi.org/10.1080/15366367.2021.1897398","url":null,"abstract":"ABSTRACT The four-parameter logistic (4PL) Item Response Theory (IRT) model has recently been reconsidered in the literature due to the advances in the statistical modeling software and the recent developments in the estimation of the 4PL IRT model parameters. The current simulation study evaluated the performance of expectation-maximization (EM), Quasi-Monte Carlo EM (QMCEM), and Metropolis-Hastings Robbins-Monro (MH-RM) estimation methods for the item parameters in the 4PL IRT model under the manipulated study conditions, including the number of factors, the correlation between factors, and test length. The results indicated that there was no method to be recommended as the best one among the three estimation algorithms for the estimation of 4PL item parameters accurately across all study conditions. However, using the MH-RM algorithm for 4PL model item parameter estimation can be suggested when the number of factors is 2 or 3. In addition, it may be advised to prefer long test lengths rather than shorter test lengths (n = 24), as three algorithms provide better item parameter estimates at long test lengths (n = 48).","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":"7 1","pages":"73 - 90"},"PeriodicalIF":1.0,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87961872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-04-03DOI: 10.1080/15366367.2022.2094847
{"title":"Now in JMP® Pro: Structual Equation Modeling","authors":"","doi":"10.1080/15366367.2022.2094847","DOIUrl":"https://doi.org/10.1080/15366367.2022.2094847","url":null,"abstract":"","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":"31 1","pages":"1 - 1"},"PeriodicalIF":1.0,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87111499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-04-03DOI: 10.1080/15366367.2021.1934805
Ismail Cuhadar
ABSTRACT In practice, some test items may display misfit at the upper-asymptote of item characteristic curve due to distraction, anxiety, or carelessness by the test takers (i.e., the slipping effect). The conventional item response theory (IRT) models do not take the slipping effect into consideration, which may violate the model fit assumption in IRT. The 4-parameter logistic model (4PLM) includes a parameter for the misfit at the upper-asymptote. Although the 4PLM took more attention by researchers in recent years, there are a few studies on the sample size requirements for the 4PLM in the literature. The current study investigated the sample size requirements for the parameter recovery in the 4PLM with a systematic simulation study design. Results indicated that the item parameters in the 4PLM can be estimated accurately when the sample size is at least 4000, and the person parameters, excluding the extreme ends of the ability scale, can be estimated accurately for the conditions with a sample size of at least 750.
{"title":"Sample Size Requirements for Parameter Recovery in the 4-Parameter Logistic Model","authors":"Ismail Cuhadar","doi":"10.1080/15366367.2021.1934805","DOIUrl":"https://doi.org/10.1080/15366367.2021.1934805","url":null,"abstract":"ABSTRACT In practice, some test items may display misfit at the upper-asymptote of item characteristic curve due to distraction, anxiety, or carelessness by the test takers (i.e., the slipping effect). The conventional item response theory (IRT) models do not take the slipping effect into consideration, which may violate the model fit assumption in IRT. The 4-parameter logistic model (4PLM) includes a parameter for the misfit at the upper-asymptote. Although the 4PLM took more attention by researchers in recent years, there are a few studies on the sample size requirements for the 4PLM in the literature. The current study investigated the sample size requirements for the parameter recovery in the 4PLM with a systematic simulation study design. Results indicated that the item parameters in the 4PLM can be estimated accurately when the sample size is at least 4000, and the person parameters, excluding the extreme ends of the ability scale, can be estimated accurately for the conditions with a sample size of at least 750.","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":"39 1","pages":"57 - 72"},"PeriodicalIF":1.0,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77196196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-04-03DOI: 10.1080/15366367.2021.1991759
C. Hoi, R. Schumacker
ABSTRACT Over the last few decades, researchers have increased interests in synthesizing data using the meta-analysis approach. While this method has been able to provide new insights to the literature with findings drawn from secondary data, scholars in the field of Psychology and Methodology have been proposing the integration of meta-analysis with structural equation modeling approach. In this vein, the method of meta-analytic structural equation modeling (MASEM) with the two-step structural equation modeling (TSSEM) approach have been developed, corresponding with the metaSEM package for the use in R statistic package. Ever since its development in 2015, the metaSEM package as well as the TSSEM approach have still been constantly updated and modified. In order to promote the use, this study aims at providing a software review for the metaSEM package and its codes on the R platform. R codes, figures, as well as initial results interpretations are provided.
{"title":"Using “metaSEM” Package in R","authors":"C. Hoi, R. Schumacker","doi":"10.1080/15366367.2021.1991759","DOIUrl":"https://doi.org/10.1080/15366367.2021.1991759","url":null,"abstract":"ABSTRACT Over the last few decades, researchers have increased interests in synthesizing data using the meta-analysis approach. While this method has been able to provide new insights to the literature with findings drawn from secondary data, scholars in the field of Psychology and Methodology have been proposing the integration of meta-analysis with structural equation modeling approach. In this vein, the method of meta-analytic structural equation modeling (MASEM) with the two-step structural equation modeling (TSSEM) approach have been developed, corresponding with the metaSEM package for the use in R statistic package. Ever since its development in 2015, the metaSEM package as well as the TSSEM approach have still been constantly updated and modified. In order to promote the use, this study aims at providing a software review for the metaSEM package and its codes on the R platform. R codes, figures, as well as initial results interpretations are provided.","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":"133 1","pages":"111 - 119"},"PeriodicalIF":1.0,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90704439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-02DOI: 10.1080/15366367.2021.1977583
Adam E. Wyse, J. Mcbride
ABSTRACT A common practical challenge is how to assign ability estimates to all incorrect and all correct response patterns when using item response theory (IRT) models and maximum likelihood estimation (MLE) since ability estimates for these types of responses equal −∞ or +∞. This article uses a simulation study and data from an operational K − 12 computerized adaptive test (CAT) to compare how well several alternatives – including Bayesian maximum a priori (MAP) estimators; various MLE based methods; and assigning constants – work as strategies for computing ability estimates for extreme scores in vertically scaled fixed-length Rasch-based CATs. Results suggested that the MLE-based methods, MAP estimators with prior standard deviations of 4 and above, and assigning constants achieved the desired outcomes of producing finite ability estimates for all correct and all incorrect responses that were more extreme than the MLE values of students that got one item correct or one item incorrect as well as being more extreme than the difficulty of the items students saw during the CAT. Additional analyses showed that it is possible for some methods to exhibit changes in how much they differ in magnitude and variability from the MLE comparison values or the b values of the CAT items for all correct versus all incorrect responses and across grades. Specific discussion is given to how one may select a strategy to assign ability estimates to extreme scores in vertically scaled fixed-length CATs that employ the Rasch model.
{"title":"Handling Extreme Scores in Vertically Scaled Fixed-Length Computerized Adaptive Tests","authors":"Adam E. Wyse, J. Mcbride","doi":"10.1080/15366367.2021.1977583","DOIUrl":"https://doi.org/10.1080/15366367.2021.1977583","url":null,"abstract":"ABSTRACT A common practical challenge is how to assign ability estimates to all incorrect and all correct response patterns when using item response theory (IRT) models and maximum likelihood estimation (MLE) since ability estimates for these types of responses equal −∞ or +∞. This article uses a simulation study and data from an operational K − 12 computerized adaptive test (CAT) to compare how well several alternatives – including Bayesian maximum a priori (MAP) estimators; various MLE based methods; and assigning constants – work as strategies for computing ability estimates for extreme scores in vertically scaled fixed-length Rasch-based CATs. Results suggested that the MLE-based methods, MAP estimators with prior standard deviations of 4 and above, and assigning constants achieved the desired outcomes of producing finite ability estimates for all correct and all incorrect responses that were more extreme than the MLE values of students that got one item correct or one item incorrect as well as being more extreme than the difficulty of the items students saw during the CAT. Additional analyses showed that it is possible for some methods to exhibit changes in how much they differ in magnitude and variability from the MLE comparison values or the b values of the CAT items for all correct versus all incorrect responses and across grades. Specific discussion is given to how one may select a strategy to assign ability estimates to extreme scores in vertically scaled fixed-length CATs that employ the Rasch model.","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":"25 1","pages":"1 - 20"},"PeriodicalIF":1.0,"publicationDate":"2022-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88487274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-02DOI: 10.1080/15366367.2021.1922232
Zhongmin Cui
ABSTRACT Although many educational and psychological tests are labeled as computerized adaptive test (CAT), not all tests show the same level of adaptivity – some tests might not have much adaptation because of various constraints imposed by test developers. Researchers have proposed some indices to measure the amount of adaption for an adaptive test. This article shows some limitations of the existing indices. A new index of adaptivity is proposed in this article. Its performance was evaluated in a simulation. The results show that the new index was able to overcome some of the limitations of the existing indices in the simulated scenarios.
{"title":"On Measuring Adaptivity of an Adaptive Test","authors":"Zhongmin Cui","doi":"10.1080/15366367.2021.1922232","DOIUrl":"https://doi.org/10.1080/15366367.2021.1922232","url":null,"abstract":"ABSTRACT Although many educational and psychological tests are labeled as computerized adaptive test (CAT), not all tests show the same level of adaptivity – some tests might not have much adaptation because of various constraints imposed by test developers. Researchers have proposed some indices to measure the amount of adaption for an adaptive test. This article shows some limitations of the existing indices. A new index of adaptivity is proposed in this article. Its performance was evaluated in a simulation. The results show that the new index was able to overcome some of the limitations of the existing indices in the simulated scenarios.","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":"24 1","pages":"21 - 33"},"PeriodicalIF":1.0,"publicationDate":"2022-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80780201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-02DOI: 10.1080/15366367.2021.1976089
A. Huggins-Manley
{"title":"Psychometrics: An Introduction","authors":"A. Huggins-Manley","doi":"10.1080/15366367.2021.1976089","DOIUrl":"https://doi.org/10.1080/15366367.2021.1976089","url":null,"abstract":"","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":"1 1","pages":"47 - 48"},"PeriodicalIF":1.0,"publicationDate":"2022-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85589079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}