Tae Yeon Kwon, A. Corinne Huggins-Manley, Jonathan Templin, Mingying Zheng
In classroom assessments, examinees can often answer test items multiple times, resulting in sequential multiple-attempt data. Sequential diagnostic classification models (DCMs) have been developed for such data. As student learning processes may be aligned with a hierarchy of measured traits, this study aimed to develop a sequential hierarchical DCM (sequential HDCM), which combines a sequential DCM with the HDCM, and investigate classification accuracy of the model in the presence of hierarchies when multiple attempts are allowed in dynamic assessment. We investigated the model's impact on classification accuracy when hierarchical structures are correctly specified, misspecified, or overspecified. The results indicate that (1) a sequential HDCM accurately classified students as masters and nonmasters when the data had a hierarchical structure; (2) a sequential HDCM produced similar or slightly higher classification accuracy than nonhierarchical sequential LCDM when the data had hierarchical structures; and (3) the misspecification of the hierarchical structure of the data resulted in lower classification accuracy when the misspecified model had fewer attribute profiles than the true model. We discuss limitations and make recommendations on using the proposed model in practice. This study provides practitioners with information about the possibilities for psychometric modeling of dynamic classroom assessment data.
{"title":"Modeling Hierarchical Attribute Structures in Diagnostic Classification Models with Multiple Attempts","authors":"Tae Yeon Kwon, A. Corinne Huggins-Manley, Jonathan Templin, Mingying Zheng","doi":"10.1111/jedm.12387","DOIUrl":"10.1111/jedm.12387","url":null,"abstract":"<p>In classroom assessments, examinees can often answer test items multiple times, resulting in sequential multiple-attempt data. Sequential diagnostic classification models (DCMs) have been developed for such data. As student learning processes may be aligned with a hierarchy of measured traits, this study aimed to develop a sequential hierarchical DCM (sequential HDCM), which combines a sequential DCM with the HDCM, and investigate classification accuracy of the model in the presence of hierarchies when multiple attempts are allowed in dynamic assessment. We investigated the model's impact on classification accuracy when hierarchical structures are correctly specified, misspecified, or overspecified. The results indicate that (1) a sequential HDCM accurately classified students as masters and nonmasters when the data had a hierarchical structure; (2) a sequential HDCM produced similar or slightly higher classification accuracy than nonhierarchical sequential LCDM when the data had hierarchical structures; and (3) the misspecification of the hierarchical structure of the data resulted in lower classification accuracy when the misspecified model had fewer attribute profiles than the true model. We discuss limitations and make recommendations on using the proposed model in practice. This study provides practitioners with information about the possibilities for psychometric modeling of dynamic classroom assessment data.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 2","pages":"198-218"},"PeriodicalIF":1.3,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140562989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Research has shown that multiple-indicator multiple-cause (MIMIC) models can result in inflated Type I error rates in detecting differential item functioning (DIF) when the assumption of equal latent variance is violated. This study explains how the violation of the equal variance assumption adversely impacts the detection of nonuniform DIF and how it can be addressed through moderated nonlinear factor analysis (MNLFA) model via Bayesian estimation approach to overcome limitations from the restrictive assumption. The Bayesian MNLFA approach suggested in this study better control Type I errors by freely estimating latent factor variances across different groups. Our experimentation with simulated data demonstrates that the BMNFA models outperform the existing MIMIC models, in terms of Type I error control as well as parameter recovery. The results suggest that the MNLFA models have the potential to be a superior choice to the existing MIMIC models, especially in situations where the assumption of equal latent variance assumption is not likely to hold.
研究表明,当违反潜在方差相等的假设时,多指标多原因(MIMIC)模型在检测差异项目功能(DIF)时可能会导致 I 类错误率上升。本研究解释了违反等方差假设如何对非均匀 DIF 的检测产生不利影响,以及如何通过贝叶斯估计方法的调节非线性因素分析(MNLFA)模型来克服限制性假设的局限性。本研究提出的贝叶斯 MNLFA 方法通过自由估计不同组的潜在因子方差,更好地控制了 I 类误差。我们用模拟数据进行的实验表明,BMNFA 模型在 I 类误差控制和参数恢复方面优于现有的 MIMIC 模型。结果表明,MNLFA 模型有可能成为优于现有 MIMIC 模型的选择,尤其是在等潜方差假设不可能成立的情况下。
{"title":"A Bayesian Moderated Nonlinear Factor Analysis Approach for DIF Detection under Violation of the Equal Variance Assumption","authors":"Sooyong Lee, Suhwa Han, Seung W. Choi","doi":"10.1111/jedm.12388","DOIUrl":"10.1111/jedm.12388","url":null,"abstract":"<p>Research has shown that multiple-indicator multiple-cause (MIMIC) models can result in inflated Type I error rates in detecting differential item functioning (DIF) when the assumption of equal latent variance is violated. This study explains how the violation of the equal variance assumption adversely impacts the detection of nonuniform DIF and how it can be addressed through moderated nonlinear factor analysis (MNLFA) model via Bayesian estimation approach to overcome limitations from the restrictive assumption. The Bayesian MNLFA approach suggested in this study better control Type I errors by freely estimating latent factor variances across different groups. Our experimentation with simulated data demonstrates that the BMNFA models outperform the existing MIMIC models, in terms of Type I error control as well as parameter recovery. The results suggest that the MNLFA models have the potential to be a superior choice to the existing MIMIC models, especially in situations where the assumption of equal latent variance assumption is not likely to hold.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 2","pages":"303-324"},"PeriodicalIF":1.3,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140153862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multidimensional achievement tests are recently gaining more importance in educational and psychological measurements. For example, multidimensional diagnostic tests can help students to determine which particular domain of knowledge they need to improve for better performance. To estimate the characteristics of candidate items (calibration) for future multidimensional achievement tests, we use optimal design theory. We generalize a previously developed exchange algorithm for optimal design computation to the multidimensional setting. We also develop an asymptotic theorem saying which item should be calibrated by examinees with extreme abilities. For several examples, we compute the optimal design numerically with the exchange algorithm. We see clear structures in these results and explain them using the asymptotic theorem. Moreover, we investigate the performance of the optimal design in a simulation study.
{"title":"Optimal Calibration of Items for Multidimensional Achievement Tests","authors":"Mahmood Ul Hassan, Frank Miller","doi":"10.1111/jedm.12386","DOIUrl":"10.1111/jedm.12386","url":null,"abstract":"<p>Multidimensional achievement tests are recently gaining more importance in educational and psychological measurements. For example, multidimensional diagnostic tests can help students to determine which particular domain of knowledge they need to improve for better performance. To estimate the characteristics of candidate items (calibration) for future multidimensional achievement tests, we use optimal design theory. We generalize a previously developed exchange algorithm for optimal design computation to the multidimensional setting. We also develop an asymptotic theorem saying which item should be calibrated by examinees with extreme abilities. For several examples, we compute the optimal design numerically with the exchange algorithm. We see clear structures in these results and explain them using the asymptotic theorem. Moreover, we investigate the performance of the optimal design in a simulation study.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 2","pages":"274-302"},"PeriodicalIF":1.3,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12386","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140153871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carmen Köhler, Lale Khorramdel, Artur Pokropek, Johannes Hartig
For assessment scales applied to different groups (e.g., students from different states; patients in different countries), multigroup differential item functioning (MG-DIF) needs to be evaluated in order to ensure that respondents with the same trait level but from different groups have equal response probabilities on a particular item. The current study compares two approaches for DIF detection: a multiple-group item response theory (MG-IRT) model and a generalized linear mixed model (GLMM). In the MG-IRT model approach, item parameters are constrained to be equal across groups and DIF is evaluated for each item in each group. In the GLMM, groups are treated as random, and item difficulties are modeled as correlated random effects with a joint multivariate normal distribution. Its nested structure allows the estimation of item difficulty variances and covariances at the group level. We use an excerpt from the PISA 2015 reading domain as an exemplary empirical investigation, and conduct a simulation study to compare the performance of the two approaches. Results from the empirical investigation show that the detection of countries with DIF is similar in both approaches. Results from the simulation study confirm this finding and indicate slight advantages of the MG-IRT model approach.
{"title":"DIF Detection for Multiple Groups: Comparing Three-Level GLMMs and Multiple-Group IRT Models","authors":"Carmen Köhler, Lale Khorramdel, Artur Pokropek, Johannes Hartig","doi":"10.1111/jedm.12384","DOIUrl":"10.1111/jedm.12384","url":null,"abstract":"<p>For assessment scales applied to different groups (e.g., students from different states; patients in different countries), multigroup differential item functioning (MG-DIF) needs to be evaluated in order to ensure that respondents with the same trait level but from different groups have equal response probabilities on a particular item. The current study compares two approaches for DIF detection: a multiple-group item response theory (MG-IRT) model and a generalized linear mixed model (GLMM). In the MG-IRT model approach, item parameters are constrained to be equal across groups and DIF is evaluated for each item in each group. In the GLMM, groups are treated as random, and item difficulties are modeled as correlated random effects with a joint multivariate normal distribution. Its nested structure allows the estimation of item difficulty variances and covariances at the group level. We use an excerpt from the PISA 2015 reading domain as an exemplary empirical investigation, and conduct a simulation study to compare the performance of the two approaches. Results from the empirical investigation show that the detection of countries with DIF is similar in both approaches. Results from the simulation study confirm this finding and indicate slight advantages of the MG-IRT model approach.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 2","pages":"325-344"},"PeriodicalIF":1.3,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12384","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139927955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I propose two practical advances to the argument-based approach to validity: developing a living document and incorporating preregistration. First, I present a potential structure for the living document that includes an up-to-date summary of the validity argument. As the validation process may span across multiple studies, the living document allows future users of the instrument to access the entire validity argument in one place. Second, I describe how preregistration can be incorporated in the argument-based approach. Specifically, I distinguish between two types of preregistration: preregistration of the argument and preregistration of validation studies. Preregistration of the argument is a single preregistration that is specified for the entire validation process. Here, the developer specifies interpretations, uses, and claims before collecting validity evidence. Preregistration of a validation study refers to preregistering a single validation study that aims to evaluate a set of claims. Here, the developer describes study components (e.g., research design, data collection, data analysis, etc.), before collecting data. Both preregistration types have the potential to reduce the risk of bias (e.g., hindsight and confirmation biases), as well as to allow others to evaluate the risk of bias and, hence, calibrate confidence, in the developer's evaluation of the validity argument.
{"title":"Argument-Based Approach to Validity: Developing a Living Document and Incorporating Preregistration","authors":"Daria Gerasimova","doi":"10.1111/jedm.12385","DOIUrl":"10.1111/jedm.12385","url":null,"abstract":"<p>I propose two practical advances to the argument-based approach to validity: developing a living document and incorporating preregistration. First, I present a potential structure for the living document that includes an up-to-date summary of the validity argument. As the validation process may span across multiple studies, the living document allows future users of the instrument to access the entire validity argument in one place. Second, I describe how preregistration can be incorporated in the argument-based approach. Specifically, I distinguish between two types of preregistration: preregistration of the argument and preregistration of validation studies. Preregistration of the argument is a single preregistration that is specified for the entire validation process. Here, the developer specifies interpretations, uses, and claims before collecting validity evidence. Preregistration of a validation study refers to preregistering a single validation study that aims to evaluate a set of claims. Here, the developer describes study components (e.g., research design, data collection, data analysis, etc.), before collecting data. Both preregistration types have the potential to reduce the risk of bias (e.g., hindsight and confirmation biases), as well as to allow others to evaluate the risk of bias and, hence, calibrate confidence, in the developer's evaluation of the validity argument.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 2","pages":"252-273"},"PeriodicalIF":1.3,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139837198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenchao Ma, Miguel A. Sorrel, Xiaoming Zhai, Yuan Ge
Most existing diagnostic models are developed to detect whether students have mastered a set of skills of interest, but few have focused on identifying what scientific misconceptions students possess. This article developed a general dual-purpose model for simultaneously estimating students' overall ability and the presence and absence of misconceptions. The expectation-maximization algorithm was developed to estimate the model parameters. A simulation study was conducted to evaluate to what extent the parameters can be accurately recovered under varied conditions. A set of real data in science education was also analyzed to examine the viability of the proposed model in practice.
{"title":"A Dual-Purpose Model for Binary Data: Estimating Ability and Misconceptions","authors":"Wenchao Ma, Miguel A. Sorrel, Xiaoming Zhai, Yuan Ge","doi":"10.1111/jedm.12383","DOIUrl":"10.1111/jedm.12383","url":null,"abstract":"<p>Most existing diagnostic models are developed to detect whether students have mastered a set of skills of interest, but few have focused on identifying what scientific misconceptions students possess. This article developed a general dual-purpose model for simultaneously estimating students' overall ability and the presence and absence of misconceptions. The expectation-maximization algorithm was developed to estimate the model parameters. A simulation study was conducted to evaluate to what extent the parameters can be accurately recovered under varied conditions. A set of real data in science education was also analyzed to examine the viability of the proposed model in practice.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 2","pages":"179-197"},"PeriodicalIF":1.3,"publicationDate":"2024-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139373800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The highly adaptive testing (HAT) design is introduced as an alternative test design for the Programme for International Student Assessment (PISA). The principle of HAT is to be as adaptive as possible when selecting items while accounting for PISA's nonstatistical constraints and addressing issues concerning PISA such as item position effects. HAT combines established methods from the field of computerized adaptive testing. It is implemented in R and code is provided. HAT was compared to the PISA 2018 multistage design (MST) in a simulation study based on a factorial design with the independent variables response probability (RP; .50, .62), item pool optimality (PISA 2018, optimal), and ability level (low, medium, high). PISA-specific conditions regarding sample size, missing responses, and nonstatistical constraints were implemented. HAT clearly outperformed MST regarding test information, RMSE, and constraint management across ability groups but it showed slightly weaker item exposure. Raising RP to .62 did not decrease test information much and is therefore a viable option to foster students’ test-taking experience with HAT. Test information for HAT was up to three times higher than for MST when using a hypothetical optimal item pool. Summarizing, HAT proved to be a promising and applicable test design for PISA.
{"title":"A Highly Adaptive Testing Design for PISA","authors":"Andreas Frey, Christoph König, Aron Fink","doi":"10.1111/jedm.12382","DOIUrl":"https://doi.org/10.1111/jedm.12382","url":null,"abstract":"The highly adaptive testing (HAT) design is introduced as an alternative test design for the Programme for International Student Assessment (PISA). The principle of HAT is to be as adaptive as possible when selecting items while accounting for PISA's nonstatistical constraints and addressing issues concerning PISA such as item position effects. HAT combines established methods from the field of computerized adaptive testing. It is implemented in R and code is provided. HAT was compared to the PISA 2018 multistage design (MST) in a simulation study based on a factorial design with the independent variables response probability (RP; .50, .62), item pool optimality (PISA 2018, optimal), and ability level (low, medium, high). PISA-specific conditions regarding sample size, missing responses, and nonstatistical constraints were implemented. HAT clearly outperformed MST regarding test information, RMSE, and constraint management across ability groups but it showed slightly weaker item exposure. Raising RP to .62 did not decrease test information much and is therefore a viable option to foster students’ test-taking experience with HAT. Test information for HAT was up to three times higher than for MST when using a hypothetical optimal item pool. Summarizing, HAT proved to be a promising and applicable test design for PISA.","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"13 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138539704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The highly adaptive testing (HAT) design is introduced as an alternative test design for the Programme for International Student Assessment (PISA). The principle of HAT is to be as adaptive as possible when selecting items while accounting for PISA's nonstatistical constraints and addressing issues concerning PISA such as item position effects. HAT combines established methods from the field of computerized adaptive testing. It is implemented in R and code is provided. HAT was compared to the PISA 2018 multistage design (MST) in a simulation study based on a factorial design with the independent variables response probability (RP; .50, .62), item pool optimality (PISA 2018, optimal), and ability level (low, medium, high). PISA-specific conditions regarding sample size, missing responses, and nonstatistical constraints were implemented. HAT clearly outperformed MST regarding test information, RMSE, and constraint management across ability groups but it showed slightly weaker item exposure. Raising RP to .62 did not decrease test information much and is therefore a viable option to foster students’ test-taking experience with HAT. Test information for HAT was up to three times higher than for MST when using a hypothetical optimal item pool. Summarizing, HAT proved to be a promising and applicable test design for PISA.
{"title":"A Highly Adaptive Testing Design for PISA","authors":"Andreas Frey, Christoph König, Aron Fink","doi":"10.1111/jedm.12382","DOIUrl":"10.1111/jedm.12382","url":null,"abstract":"<p>The highly adaptive testing (HAT) design is introduced as an alternative test design for the Programme for International Student Assessment (PISA). The principle of HAT is to be as adaptive as possible when selecting items while accounting for PISA's nonstatistical constraints and addressing issues concerning PISA such as item position effects. HAT combines established methods from the field of computerized adaptive testing. It is implemented in R and code is provided. HAT was compared to the PISA 2018 multistage design (MST) in a simulation study based on a factorial design with the independent variables response probability (RP; .50, .62), item pool optimality (PISA 2018, optimal), and ability level (low, medium, high). PISA-specific conditions regarding sample size, missing responses, and nonstatistical constraints were implemented. HAT clearly outperformed MST regarding test information, RMSE, and constraint management across ability groups but it showed slightly weaker item exposure. Raising RP to .62 did not decrease test information much and is therefore a viable option to foster students’ test-taking experience with HAT. Test information for HAT was up to three times higher than for MST when using a hypothetical optimal item pool. Summarizing, HAT proved to be a promising and applicable test design for PISA.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 3","pages":"415-437"},"PeriodicalIF":1.6,"publicationDate":"2023-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12382","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138539665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Culturally responsive assessments have been proposed as potential tools to ensure equity and fairness for examinees from all backgrounds including those from traditionally underserved or minoritized groups. However, these assessments are relatively new and, with few exceptions, are yet to be implemented in large scale. Consequently, there is a lack of guidance on how one can compute comparable scores on various versions of these assessments. In this paper, the multigroup multidimensional Rasch model is repurposed for modeling data originating from various versions of a culturally responsive assessment and for analyzing such data to compute comparable scores. Two simulation studies are performed to evaluate the performance of the model for data simulated from hypothetical culturally responsive assessments and to find the conditions under which the computed scores are accurate. Recommendations are made for measurement practitioners interested in culturally responsive assessments.
{"title":"Computation and Accuracy Evaluation of Comparable Scores on Culturally Responsive Assessments","authors":"Sandip Sinharay, Matthew S. Johnson","doi":"10.1111/jedm.12381","DOIUrl":"10.1111/jedm.12381","url":null,"abstract":"<p>Culturally responsive assessments have been proposed as potential tools to ensure equity and fairness for examinees from all backgrounds including those from traditionally underserved or minoritized groups. However, these assessments are relatively new and, with few exceptions, are yet to be implemented in large scale. Consequently, there is a lack of guidance on how one can compute comparable scores on various versions of these assessments. In this paper, the multigroup multidimensional Rasch model is repurposed for modeling data originating from various versions of a culturally responsive assessment and for analyzing such data to compute comparable scores. Two simulation studies are performed to evaluate the performance of the model for data simulated from hypothetical culturally responsive assessments and to find the conditions under which the computed scores are accurate. Recommendations are made for measurement practitioners interested in culturally responsive assessments.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 1","pages":"5-46"},"PeriodicalIF":1.3,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138539727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Culturally responsive assessments have been proposed as potential tools to ensure equity and fairness for examinees from all backgrounds including those from traditionally underserved or minoritized groups. However, these assessments are relatively new and, with few exceptions, are yet to be implemented in large scale. Consequently, there is a lack of guidance on how data on how one can compute comparable scores on various versions of these assessments. In this paper, the multigroup multidimensional Rasch model is repurposed for modeling data originating from various versions of a culturally responsive assessment and for analyzing such data to compute comparable scores. Two simulation studies are performed to evaluate the performance of the model for data simulated from hypothetical culturally responsive assessments and to find the conditions under which the computed scores are accurate. Recommendations are made for measurement practitioners interested in culturally responsive assessments.
{"title":"Computation and Accuracy Evaluation of Comparable Scores on Culturally Responsive Assessments","authors":"Sandip Sinharay, Matthew S. Johnson","doi":"10.1111/jedm.12381","DOIUrl":"https://doi.org/10.1111/jedm.12381","url":null,"abstract":"Culturally responsive assessments have been proposed as potential tools to ensure equity and fairness for examinees from all backgrounds including those from traditionally underserved or minoritized groups. However, these assessments are relatively new and, with few exceptions, are yet to be implemented in large scale. Consequently, there is a lack of guidance on how data on how one can compute comparable scores on various versions of these assessments. In this paper, the multigroup multidimensional Rasch model is repurposed for modeling data originating from various versions of a culturally responsive assessment and for analyzing such data to compute comparable scores. Two simulation studies are performed to evaluate the performance of the model for data simulated from hypothetical culturally responsive assessments and to find the conditions under which the computed scores are accurate. Recommendations are made for measurement practitioners interested in culturally responsive assessments.","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"44 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138539685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}