Pub Date : 2023-05-25DOI: 10.1080/08957347.2023.2214652
Eowyn P. O’Dwyer, Jesse R. Sparks, Leslie Nabors Oláh
ABSTRACT A critical aspect of the development of culturally relevant classroom assessments is the design of tasks that affirm students’ racial and ethnic identities and community cultural practices. This paper describes the process we followed to build a shared understanding of what culturally relevant assessments are, to pursue ways of bringing more diverse voices and perspectives into the development process to generate new ideas and further our understanding, and finally to integrate those understandings and findings into the design of scenario-based tasks (ETS Testlets). This paper describes our engagement with research literature and employee-led affinity groups, students, and external consultants. In synthesizing their advice and feedback, we identified five design principles that scenario-based assessment developers can incorporate into their own work. These principles are then applied to the development of a scenario-based assessment task. Finally, we reflect on our process and challenges faced to inform future advancements in the field.
{"title":"Enacting a Process for Developing Culturally Relevant Classroom Assessments","authors":"Eowyn P. O’Dwyer, Jesse R. Sparks, Leslie Nabors Oláh","doi":"10.1080/08957347.2023.2214652","DOIUrl":"https://doi.org/10.1080/08957347.2023.2214652","url":null,"abstract":"ABSTRACT A critical aspect of the development of culturally relevant classroom assessments is the design of tasks that affirm students’ racial and ethnic identities and community cultural practices. This paper describes the process we followed to build a shared understanding of what culturally relevant assessments are, to pursue ways of bringing more diverse voices and perspectives into the development process to generate new ideas and further our understanding, and finally to integrate those understandings and findings into the design of scenario-based tasks (ETS Testlets). This paper describes our engagement with research literature and employee-led affinity groups, students, and external consultants. In synthesizing their advice and feedback, we identified five design principles that scenario-based assessment developers can incorporate into their own work. These principles are then applied to the development of a scenario-based assessment task. Finally, we reflect on our process and challenges faced to inform future advancements in the field.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"36 1","pages":"286 - 303"},"PeriodicalIF":1.5,"publicationDate":"2023-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49204043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-20DOI: 10.1080/08957347.2023.2214655
Carla M. Evans
ABSTRACT Previous writings focus on why centering assessment design around students’ cultural, social, and/or linguistic diversity is important and how performance-based assessment can support such aims. This article extends previous work by describing how a culturally responsive classroom assessment framework was created from a culturally responsive education (CRE) pedagogical framework. The goal of the framework was to guide the design and evaluation of curriculum-embedded, classroom performance assessments. Components discussed include: modification of evidence-centered design processes, teacher and/or student adaptation of construct irrelevant aspects of task prompts, addition of cultural meaningfulness questions to think alouds, and revision of task quality review protocols to promote CRE design features. Future research is needed to explore the limitations of the framework applied, and the extent to which students perceive the classroom summative assessments designed do indeed allow them to better show all they know and can do in ways related to their cultural, social, and/or linguistic identities.
{"title":"Applying a Culturally Responsive Pedagogical Framework to Design and Evaluate Classroom Performance-Based Assessments in Hawai‘i","authors":"Carla M. Evans","doi":"10.1080/08957347.2023.2214655","DOIUrl":"https://doi.org/10.1080/08957347.2023.2214655","url":null,"abstract":"ABSTRACT Previous writings focus on why centering assessment design around students’ cultural, social, and/or linguistic diversity is important and how performance-based assessment can support such aims. This article extends previous work by describing how a culturally responsive classroom assessment framework was created from a culturally responsive education (CRE) pedagogical framework. The goal of the framework was to guide the design and evaluation of curriculum-embedded, classroom performance assessments. Components discussed include: modification of evidence-centered design processes, teacher and/or student adaptation of construct irrelevant aspects of task prompts, addition of cultural meaningfulness questions to think alouds, and revision of task quality review protocols to promote CRE design features. Future research is needed to explore the limitations of the framework applied, and the extent to which students perceive the classroom summative assessments designed do indeed allow them to better show all they know and can do in ways related to their cultural, social, and/or linguistic identities.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"36 1","pages":"269 - 285"},"PeriodicalIF":1.5,"publicationDate":"2023-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46027666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-20DOI: 10.1080/08957347.2023.2214654
Josh Lederman
Abstract Given its centrality to assessment, until the concept of validity includes concern for racial justice, such matters will be seen as residing outside the “real” work of validation, rendering them powerless to count against the apparent scientific merit of the test. As the definition of validity has evolved, however, it holds great potential to centralize matters like racial (in)justice, positioning them as necessary validity evidence. This article reviews a history of debates over what validity should and shouldn’t encompass; we then look toward the more centralized stances on validity – the book series Standards and Educational Measurement – where we see that test use, and the social impact of test use, has been a mounting concern over the years within these publications. Finally, we explore Kane’s argument-based approach to validation, which I argue could impact racial justice concerns by centralizing them within the very notion of what makes assessment valid or invalid.
{"title":"Validity and Racial Justice in Educational Assessment","authors":"Josh Lederman","doi":"10.1080/08957347.2023.2214654","DOIUrl":"https://doi.org/10.1080/08957347.2023.2214654","url":null,"abstract":"Abstract Given its centrality to assessment, until the concept of validity includes concern for racial justice, such matters will be seen as residing outside the “real” work of validation, rendering them powerless to count against the apparent scientific merit of the test. As the definition of validity has evolved, however, it holds great potential to centralize matters like racial (in)justice, positioning them as necessary validity evidence. This article reviews a history of debates over what validity should and shouldn’t encompass; we then look toward the more centralized stances on validity – the book series Standards and Educational Measurement – where we see that test use, and the social impact of test use, has been a mounting concern over the years within these publications. Finally, we explore Kane’s argument-based approach to validation, which I argue could impact racial justice concerns by centralizing them within the very notion of what makes assessment valid or invalid.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"36 1","pages":"242 - 254"},"PeriodicalIF":1.5,"publicationDate":"2023-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41535738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-03DOI: 10.1080/08957347.2023.2201705
Ross E. Markle
conversation
交谈
{"title":"College Admissions and Testing in a Time of Transformational Change","authors":"Ross E. Markle","doi":"10.1080/08957347.2023.2201705","DOIUrl":"https://doi.org/10.1080/08957347.2023.2201705","url":null,"abstract":"conversation","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"36 1","pages":"132 - 136"},"PeriodicalIF":1.5,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42510663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-03DOI: 10.1080/08957347.2023.2201700
Alexandra Lane Perez, Carla M. Evans
ABSTRACT New Hampshire’s Performance Assessment of Competency Education (PACE) innovative assessment system uses student scores from classroom performance assessments as well as other classroom tests for school accountability purposes. One concern is that not having annual state testing may incentivize schools and teachers away from teaching the breadth of the state content standards. This study examined the effects of PACE on Grade 8 test scores after 5 years of implementation using propensity score matching followed by hierarchical linear modeling. The results suggest that PACE students perform about the same, on average, in mathematics and ELA as non-PACE students on the state assessment. There was no evidence of differential effects for students who had an individualized education program or were granted FRL. Findings for this limited sample suggest schools and teachers did not sacrifice the breadth of students’ opportunity to learn the state content standards while piloting a state performance assessment reform.
{"title":"Keeping Up the PACE: Evaluating Grade 8 Student Achievement Outcomes for New Hampshire’s Innovative Assessment System","authors":"Alexandra Lane Perez, Carla M. Evans","doi":"10.1080/08957347.2023.2201700","DOIUrl":"https://doi.org/10.1080/08957347.2023.2201700","url":null,"abstract":"ABSTRACT New Hampshire’s Performance Assessment of Competency Education (PACE) innovative assessment system uses student scores from classroom performance assessments as well as other classroom tests for school accountability purposes. One concern is that not having annual state testing may incentivize schools and teachers away from teaching the breadth of the state content standards. This study examined the effects of PACE on Grade 8 test scores after 5 years of implementation using propensity score matching followed by hierarchical linear modeling. The results suggest that PACE students perform about the same, on average, in mathematics and ELA as non-PACE students on the state assessment. There was no evidence of differential effects for students who had an individualized education program or were granted FRL. Findings for this limited sample suggest schools and teachers did not sacrifice the breadth of students’ opportunity to learn the state content standards while piloting a state performance assessment reform.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"36 1","pages":"137 - 156"},"PeriodicalIF":1.5,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48459890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-03DOI: 10.1080/08957347.2023.2201704
Sarah Alahmadi, Andrew T. Jones, Carol L. Barry, Beatriz Ibáñez
ABSTRACT Rasch common-item equating is often used in high-stakes testing to maintain equivalent passing standards across test administrations. If unaddressed, item parameter drift poses a major threat to the accuracy of Rasch common-item equating. We compared the performance of well-established and newly developed drift detection methods in small and large sample sizes, varying the proportion of test items used as anchor (common) items and the proportion of drifted anchors. In the simulated-data study, the most accurate equating was obtained in large-sample conditions with a small-moderate number of drifted anchors using the mINFIT/mOUTFIT methods. However, when any drift was present in small-sample conditions and when a large number of drifted anchors were present in large-sample conditions, all methods performed ineffectively. In the operational-data study, percent-correct standards and failure rates varied across the methods in the large-sample exam but not in the small-sample exam. Different recommendations for high- and low-volume testing programs are provided.
{"title":"Comparing Drift Detection Methods for Accurate Rasch Equating in Different Sample Sizes","authors":"Sarah Alahmadi, Andrew T. Jones, Carol L. Barry, Beatriz Ibáñez","doi":"10.1080/08957347.2023.2201704","DOIUrl":"https://doi.org/10.1080/08957347.2023.2201704","url":null,"abstract":"ABSTRACT Rasch common-item equating is often used in high-stakes testing to maintain equivalent passing standards across test administrations. If unaddressed, item parameter drift poses a major threat to the accuracy of Rasch common-item equating. We compared the performance of well-established and newly developed drift detection methods in small and large sample sizes, varying the proportion of test items used as anchor (common) items and the proportion of drifted anchors. In the simulated-data study, the most accurate equating was obtained in large-sample conditions with a small-moderate number of drifted anchors using the mINFIT/mOUTFIT methods. However, when any drift was present in small-sample conditions and when a large number of drifted anchors were present in large-sample conditions, all methods performed ineffectively. In the operational-data study, percent-correct standards and failure rates varied across the methods in the large-sample exam but not in the small-sample exam. Different recommendations for high- and low-volume testing programs are provided.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"36 1","pages":"157 - 170"},"PeriodicalIF":1.5,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42571395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-03DOI: 10.1080/08957347.2023.2201703
R. P. Chalmers, Guoguo Zheng
ABSTRACT This article presents generalizations of SIBTEST and crossing-SIBTEST statistics for differential item functioning (DIF) investigations involving more than two groups. After reviewing the original two-group setup for these statistics, a set of multigroup generalizations that support contrast matrices for joint tests of DIF are presented. To investigate the Type I error and power behavior of these generalizations, a Monte Carlo simulation study was then explored. Results indicated that the proposed generalizations are reasonably effective at recovering their respective population parameter definitions, maintain optimal Type I error control, have suitable power to detect uniform and non-uniform DIF, and in shorter tests are competitive with the generalized logistic regression and generalized Mantel–Haenszel tests for DIF.
{"title":"Multi-Group Generalizations of SIBTEST and Crossing-SIBTEST","authors":"R. P. Chalmers, Guoguo Zheng","doi":"10.1080/08957347.2023.2201703","DOIUrl":"https://doi.org/10.1080/08957347.2023.2201703","url":null,"abstract":"ABSTRACT This article presents generalizations of SIBTEST and crossing-SIBTEST statistics for differential item functioning (DIF) investigations involving more than two groups. After reviewing the original two-group setup for these statistics, a set of multigroup generalizations that support contrast matrices for joint tests of DIF are presented. To investigate the Type I error and power behavior of these generalizations, a Monte Carlo simulation study was then explored. Results indicated that the proposed generalizations are reasonably effective at recovering their respective population parameter definitions, maintain optimal Type I error control, have suitable power to detect uniform and non-uniform DIF, and in shorter tests are competitive with the generalized logistic regression and generalized Mantel–Haenszel tests for DIF.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"36 1","pages":"171 - 191"},"PeriodicalIF":1.5,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44337447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-03DOI: 10.1080/08957347.2023.2201702
P. Zhan, Yao-sen Liu, Zhaohui Yu, Yanfang Pan
ABSTRACT Many educational and psychological studies have shown that the development of students is generally step-by-step (i.e. ordinal development) to a specific level. This study proposed a novel longitudinal learning diagnosis model with polytomous attributes to track students’ ordinal development in learning. Using the concept of polytomous attributes in the proposed model, the learning process of a specific skill, from non-mastery to mastery, can be divided into multiple ordinal steps in order to better characterize the learning trajectory. The results of an empirical study conducted to explore the performance of the proposed model indicated that it could adequately diagnose the ordinal development of skills in longitudinal assessments. A simulation study was also conducted to examine the estimation accuracy of general ability and the classification accuracy of attributes of the proposed model in different simulated conditions.
{"title":"Tracking Ordinal Development of Skills with a Longitudinal DINA Model with Polytomous Attributes","authors":"P. Zhan, Yao-sen Liu, Zhaohui Yu, Yanfang Pan","doi":"10.1080/08957347.2023.2201702","DOIUrl":"https://doi.org/10.1080/08957347.2023.2201702","url":null,"abstract":"ABSTRACT Many educational and psychological studies have shown that the development of students is generally step-by-step (i.e. ordinal development) to a specific level. This study proposed a novel longitudinal learning diagnosis model with polytomous attributes to track students’ ordinal development in learning. Using the concept of polytomous attributes in the proposed model, the learning process of a specific skill, from non-mastery to mastery, can be divided into multiple ordinal steps in order to better characterize the learning trajectory. The results of an empirical study conducted to explore the performance of the proposed model indicated that it could adequately diagnose the ordinal development of skills in longitudinal assessments. A simulation study was also conducted to examine the estimation accuracy of general ability and the classification accuracy of attributes of the proposed model in different simulated conditions.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"36 1","pages":"99 - 114"},"PeriodicalIF":1.5,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47235911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-03DOI: 10.1080/08957347.2023.2201701
L. Visser, Friederike Cartschau, Ariane von Goldammer, Janin Brandenburg, M. Timmerman, M. Hasselhorn, C. Mähler
ABSTRACT The growing number of children in primary schools in Germany who have German as their second language (L2) has raised questions about the fairness of performance assessment. Fair tests are a prerequisite for distinguishing between L2 learning delay and a specific learning disability. We evaluated five commonly used reading and spelling tests for measurement invariance (MI) as a function of first language (German vs. other). Multi-group confirmatory factor analyses revealed strict MI for the Weingarten Basic Vocabulary Spelling Tests (WRTs) 3+ and 4+ and the Salzburger Reading (SLT) and Spelling (SRT) Tests, suggesting these instruments are suitable for assessing reading and spelling skills regardless of first language. The MI for A Reading Comprehension Test for First to Seventh Graders – 2nd Edition (ELFE II) was partly strict with unequal intercepts for the text subscale. We discuss the implications of this finding for assessing reading performance of children with L2.
{"title":"Measurement Invariance in Relation to First Language: An Evaluation of German Reading and Spelling Tests","authors":"L. Visser, Friederike Cartschau, Ariane von Goldammer, Janin Brandenburg, M. Timmerman, M. Hasselhorn, C. Mähler","doi":"10.1080/08957347.2023.2201701","DOIUrl":"https://doi.org/10.1080/08957347.2023.2201701","url":null,"abstract":"ABSTRACT The growing number of children in primary schools in Germany who have German as their second language (L2) has raised questions about the fairness of performance assessment. Fair tests are a prerequisite for distinguishing between L2 learning delay and a specific learning disability. We evaluated five commonly used reading and spelling tests for measurement invariance (MI) as a function of first language (German vs. other). Multi-group confirmatory factor analyses revealed strict MI for the Weingarten Basic Vocabulary Spelling Tests (WRTs) 3+ and 4+ and the Salzburger Reading (SLT) and Spelling (SRT) Tests, suggesting these instruments are suitable for assessing reading and spelling skills regardless of first language. The MI for A Reading Comprehension Test for First to Seventh Graders – 2nd Edition (ELFE II) was partly strict with unequal intercepts for the text subscale. We discuss the implications of this finding for assessing reading performance of children with L2.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"36 1","pages":"115 - 131"},"PeriodicalIF":1.5,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"59806259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ABSTRACT Several states in the US have removed time limits on their state assessments. In Indiana, where this study takes place, the state assessment is both untimed during the testing window and allows unlimited breaks during the testing session. Using grade 3 and 8 math and English state assessment data, in this paper we focus on time used for testing and examine whether students who take more time tend to outperform their peers. Further, we also examine if the number of breaks students take is associated with student achievement scores. Findings suggest that even in an untimed setting, there remains a strong association between time spent on the assessment and achievement at both the student and school level. The number of breaks, on the other hand, show little to no association with achievement after controlling for time. The paper concludes with a discussion of the policy implications of the findings.
{"title":"A Census-Level, Multi-Grade Analysis of the Association Between Testing Time, Breaks, and Achievement","authors":"david. rutkowski, Leslie Rutkowski, Dubravka Svetina Valdivia, Yusuf Canbolat, Stephanie Underhill","doi":"10.1080/08957347.2023.2172019","DOIUrl":"https://doi.org/10.1080/08957347.2023.2172019","url":null,"abstract":"ABSTRACT Several states in the US have removed time limits on their state assessments. In Indiana, where this study takes place, the state assessment is both untimed during the testing window and allows unlimited breaks during the testing session. Using grade 3 and 8 math and English state assessment data, in this paper we focus on time used for testing and examine whether students who take more time tend to outperform their peers. Further, we also examine if the number of breaks students take is associated with student achievement scores. Findings suggest that even in an untimed setting, there remains a strong association between time spent on the assessment and achievement at both the student and school level. The number of breaks, on the other hand, show little to no association with achievement after controlling for time. The paper concludes with a discussion of the policy implications of the findings.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"36 1","pages":"14 - 30"},"PeriodicalIF":1.5,"publicationDate":"2023-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41397365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}