Pub Date : 2021-08-17DOI: 10.1080/10627197.2021.1962277
Stefanie A. Wind, Wenjing Guo
ABSTRACT Scoring procedures for the constructed-response (CR) items in large-scale mixed-format educational assessments often involve checks for rater agreement or rater reliability. Although these analyses are important, researchers have documented rater effects that persist despite rater training and that are not always detected in rater agreement and reliability analyses, such as severity/leniency, centrality/extremism, and biases. Left undetected, these effects pose threats to fairness. We illustrate how rater effects analyses can be incorporated into scoring procedures for large-scale mixed-format assessments. We used data from the National Assessment of Educational Progress (NAEP) to illustrate relatively simple analyses that can provide insight into patterns of rater judgment that may warrant additional attention. Our results suggested that the NAEP raters exhibited generally defensible psychometric properties, while also exhibiting some idiosyncrasies that could inform scoring procedures. Similar procedures could be used operationally to inform the interpretation and use of rater judgments in large-scale mixed-format assessments.
{"title":"Beyond Agreement: Exploring Rater Effects in Large-Scale Mixed Format Assessments","authors":"Stefanie A. Wind, Wenjing Guo","doi":"10.1080/10627197.2021.1962277","DOIUrl":"https://doi.org/10.1080/10627197.2021.1962277","url":null,"abstract":"ABSTRACT Scoring procedures for the constructed-response (CR) items in large-scale mixed-format educational assessments often involve checks for rater agreement or rater reliability. Although these analyses are important, researchers have documented rater effects that persist despite rater training and that are not always detected in rater agreement and reliability analyses, such as severity/leniency, centrality/extremism, and biases. Left undetected, these effects pose threats to fairness. We illustrate how rater effects analyses can be incorporated into scoring procedures for large-scale mixed-format assessments. We used data from the National Assessment of Educational Progress (NAEP) to illustrate relatively simple analyses that can provide insight into patterns of rater judgment that may warrant additional attention. Our results suggested that the NAEP raters exhibited generally defensible psychometric properties, while also exhibiting some idiosyncrasies that could inform scoring procedures. Similar procedures could be used operationally to inform the interpretation and use of rater judgments in large-scale mixed-format assessments.","PeriodicalId":46209,"journal":{"name":"Educational Assessment","volume":"26 1","pages":"264 - 283"},"PeriodicalIF":1.5,"publicationDate":"2021-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48562788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-03DOI: 10.1080/10627197.2021.1917358
Vasiliki Pitsia, Anastasios Karakolidis, P. Lehane
ABSTRACT Evidence suggests that the quality of teachers’ instructional practices can be improved when these are informed by relevant assessment data. Drawing on a sample of 1,300 primary school teachers in Ireland, this study examined the extent to which teachers use standardized test results for instructional purposes as well as the role of several factors in predicting this use. Specifically, the study analyzed data from a cross-sectional survey that gathered information about teachers’ use of, experiences with, and attitudes toward assessment data from standardized tests. After taking other teacher and school characteristics into consideration, the analysis revealed that teachers with more positive attitudes toward standardized tests and those who were often engaged in some form of professional development on standardized testing tended to use assessment data to inform their teaching more frequently. Based on the findings, policy and practice implications are discussed.
{"title":"Investigating the Use of Assessment Data by Primary School Teachers: Insights from a Large-scale Survey in Ireland","authors":"Vasiliki Pitsia, Anastasios Karakolidis, P. Lehane","doi":"10.1080/10627197.2021.1917358","DOIUrl":"https://doi.org/10.1080/10627197.2021.1917358","url":null,"abstract":"ABSTRACT Evidence suggests that the quality of teachers’ instructional practices can be improved when these are informed by relevant assessment data. Drawing on a sample of 1,300 primary school teachers in Ireland, this study examined the extent to which teachers use standardized test results for instructional purposes as well as the role of several factors in predicting this use. Specifically, the study analyzed data from a cross-sectional survey that gathered information about teachers’ use of, experiences with, and attitudes toward assessment data from standardized tests. After taking other teacher and school characteristics into consideration, the analysis revealed that teachers with more positive attitudes toward standardized tests and those who were often engaged in some form of professional development on standardized testing tended to use assessment data to inform their teaching more frequently. Based on the findings, policy and practice implications are discussed.","PeriodicalId":46209,"journal":{"name":"Educational Assessment","volume":"26 1","pages":"145 - 162"},"PeriodicalIF":1.5,"publicationDate":"2021-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/10627197.2021.1917358","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42751021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-03DOI: 10.1080/10627197.2021.1946390
T. Haladyna, Michael C. Rodriguez
ABSTRACT Full-information item analysis provides item developers and reviewers comprehensive empirical evidence of item quality, including option response frequency, point-biserial index (PBI) for distractors, mean-scores of respondents selecting each option, and option trace lines. The multi-serial index (MSI) is introduced as a more informative item-total correlation, accounting for variable distractor performance. The overall item PBI is empirically compared to the MSI. For items from an operational mathematics and reading test, poorly performing distractors are systematically removed to recompute the MSI, indicating improvements in item quality. Case studies for specific items with different characteristics are described to illustrate a variety of outcomes, focused on improving item discrimination. Full-information item analyses are presented for each case study item, providing clear examples of interpretation and use of item analyses. A summary of recommendations for item analysts is provided.
{"title":"Using Full-information Item Analysis to Improve Item Quality","authors":"T. Haladyna, Michael C. Rodriguez","doi":"10.1080/10627197.2021.1946390","DOIUrl":"https://doi.org/10.1080/10627197.2021.1946390","url":null,"abstract":"ABSTRACT Full-information item analysis provides item developers and reviewers comprehensive empirical evidence of item quality, including option response frequency, point-biserial index (PBI) for distractors, mean-scores of respondents selecting each option, and option trace lines. The multi-serial index (MSI) is introduced as a more informative item-total correlation, accounting for variable distractor performance. The overall item PBI is empirically compared to the MSI. For items from an operational mathematics and reading test, poorly performing distractors are systematically removed to recompute the MSI, indicating improvements in item quality. Case studies for specific items with different characteristics are described to illustrate a variety of outcomes, focused on improving item discrimination. Full-information item analyses are presented for each case study item, providing clear examples of interpretation and use of item analyses. A summary of recommendations for item analysts is provided.","PeriodicalId":46209,"journal":{"name":"Educational Assessment","volume":"26 1","pages":"198 - 211"},"PeriodicalIF":1.5,"publicationDate":"2021-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/10627197.2021.1946390","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42811776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-03DOI: 10.1080/10627197.2021.1956897
S. Wise, Sukkeun Im, Jay Lee
ABSTRACT This study investigated test-taking engagement on the Spring 2019 administration of a large-scale state summative assessment. Through the identification of rapid-guessing behavior – which is a validated indicator of disengagement – the percentage of Grade 8 test events with meaningful amounts of rapid guessing was 5.5% in mathematics, 6.7% in English Language Arts (ELA), and 3.5% in science. Disengagement rates on the state summative test were also found to vary materially across gender, ethnicity, Individualized Educational Plan (IEP) status, Limited English Proficient (LEP) status, free and reduced lunch (FRL) status, and disability status. However, school mean performance, proficiency rates, and relative ranking were only minimally affected by disengagement. Overall, results of this study indicate that disengagement has a material impact on individual state summative test scores, though its impact on score aggregations may be relatively minor.
{"title":"The Impact of Disengaged Test Taking on a State’s Accountability Test Results","authors":"S. Wise, Sukkeun Im, Jay Lee","doi":"10.1080/10627197.2021.1956897","DOIUrl":"https://doi.org/10.1080/10627197.2021.1956897","url":null,"abstract":"ABSTRACT This study investigated test-taking engagement on the Spring 2019 administration of a large-scale state summative assessment. Through the identification of rapid-guessing behavior – which is a validated indicator of disengagement – the percentage of Grade 8 test events with meaningful amounts of rapid guessing was 5.5% in mathematics, 6.7% in English Language Arts (ELA), and 3.5% in science. Disengagement rates on the state summative test were also found to vary materially across gender, ethnicity, Individualized Educational Plan (IEP) status, Limited English Proficient (LEP) status, free and reduced lunch (FRL) status, and disability status. However, school mean performance, proficiency rates, and relative ranking were only minimally affected by disengagement. Overall, results of this study indicate that disengagement has a material impact on individual state summative test scores, though its impact on score aggregations may be relatively minor.","PeriodicalId":46209,"journal":{"name":"Educational Assessment","volume":"26 1","pages":"163 - 174"},"PeriodicalIF":1.5,"publicationDate":"2021-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/10627197.2021.1956897","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45119734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-04-03DOI: 10.1080/10627197.2020.1858785
G. Krammer, Barbara Pflanzl, Gerlinde Lenske, Johannes Mayr
ABSTRACT Comparing teachers’ self-assessment to classes’ assessment of quality of teaching can offer insights for educational research and be a valuable resource for teachers’ continuous professional development. However, the quality of teaching needs to be measured in the same way across perspectives for this comparison to be meaningful. We used data from 622 teachers self-assessing aspects of quality of teaching and of their classes (12229 students) assessing the same aspects. Perspectives were compared with measurement invariance analyses. Teachers and classes agreed on the average level of instructional clarity, and disagreed over teacher-student relationship and performance monitoring, suggesting that mean differences across perspectives may not be as consistent as the literature claims. Results showed a nonuniform measurement bias for only one item of instructional clarity, while measurement of the other aspects was directly comparable. We conclude the viability of comparing teachers’ and classes’ perspectives of aspects of quality of teaching.
{"title":"Assessing Quality of Teaching from Different Perspectives: Measurement Invariance across Teachers and Classes","authors":"G. Krammer, Barbara Pflanzl, Gerlinde Lenske, Johannes Mayr","doi":"10.1080/10627197.2020.1858785","DOIUrl":"https://doi.org/10.1080/10627197.2020.1858785","url":null,"abstract":"ABSTRACT Comparing teachers’ self-assessment to classes’ assessment of quality of teaching can offer insights for educational research and be a valuable resource for teachers’ continuous professional development. However, the quality of teaching needs to be measured in the same way across perspectives for this comparison to be meaningful. We used data from 622 teachers self-assessing aspects of quality of teaching and of their classes (12229 students) assessing the same aspects. Perspectives were compared with measurement invariance analyses. Teachers and classes agreed on the average level of instructional clarity, and disagreed over teacher-student relationship and performance monitoring, suggesting that mean differences across perspectives may not be as consistent as the literature claims. Results showed a nonuniform measurement bias for only one item of instructional clarity, while measurement of the other aspects was directly comparable. We conclude the viability of comparing teachers’ and classes’ perspectives of aspects of quality of teaching.","PeriodicalId":46209,"journal":{"name":"Educational Assessment","volume":"26 1","pages":"88 - 103"},"PeriodicalIF":1.5,"publicationDate":"2021-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/10627197.2020.1858785","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42517403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-02-06DOI: 10.1080/10627197.2022.2130748
M. Meeter, M. V. van Brederode
ABSTRACT The transition from secondary to tertiary education varies from country to country. In many countries, secondary school is concluded with high-stakes, national exams, or high-stakes entry tests are used for admissions to tertiary education. In other countries, secondary-school grade point average (GPA) is the determining factor. In the Netherlands, both play a role. With administrative data of close to 180,000 students, we investigated whether national exam scores or secondary school GPA was a better predictor of tertiary first-year retention. For both university education and higher professional education, secondary school GPA was the better prediction of retention, to the extent that national exams did not explain any additional variance. Moreover, for students who failed their exam, being held back by the secondary school for an additional year and entering tertiary education one year later, GPA in the year of failure remained as predictive as for students who had passed their exams and started tertiary education immediately. National exam scores, on the other hand, had no predictive value at all for these students. It is concluded that secondary school GPA measures aspects of student performance that is not included in high-stakes national exams, but that are predictive of subsequent success in tertiary education.
{"title":"Predicting Retention in Higher Education from high-stakes Exams or School GPA","authors":"M. Meeter, M. V. van Brederode","doi":"10.1080/10627197.2022.2130748","DOIUrl":"https://doi.org/10.1080/10627197.2022.2130748","url":null,"abstract":"ABSTRACT The transition from secondary to tertiary education varies from country to country. In many countries, secondary school is concluded with high-stakes, national exams, or high-stakes entry tests are used for admissions to tertiary education. In other countries, secondary-school grade point average (GPA) is the determining factor. In the Netherlands, both play a role. With administrative data of close to 180,000 students, we investigated whether national exam scores or secondary school GPA was a better predictor of tertiary first-year retention. For both university education and higher professional education, secondary school GPA was the better prediction of retention, to the extent that national exams did not explain any additional variance. Moreover, for students who failed their exam, being held back by the secondary school for an additional year and entering tertiary education one year later, GPA in the year of failure remained as predictive as for students who had passed their exams and started tertiary education immediately. National exam scores, on the other hand, had no predictive value at all for these students. It is concluded that secondary school GPA measures aspects of student performance that is not included in high-stakes national exams, but that are predictive of subsequent success in tertiary education.","PeriodicalId":46209,"journal":{"name":"Educational Assessment","volume":"28 1","pages":"1 - 10"},"PeriodicalIF":1.5,"publicationDate":"2021-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42962496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-20DOI: 10.1080/10627197.2020.1858782
Glenn Thomas Waterbury, Christine E. DeMars
ABSTRACT Vertical scaling is used to put tests of different difficulty onto a common metric. The Rasch model is often used to perform vertical scaling, despite its strict functional form. Few, if any, studies have examined anchor item choice when using the Rasch model to vertically scale data that do not fit the model. The purpose of this study was to investigate the implications of anchor item choice on bias in growth estimates when data do not fit the Rasch model. Data were generated with varying levels of true difference between grades and levels of the lower asymptote. When true growth or the lower asymptote were zero, estimates were unbiased and anchor item choice was not consequential. As true growth and the lower asymptote both increased, growth was underestimated and choice of anchor items had an impact. Easy anchor items led to less biased estimates of growth than hard anchor items.
{"title":"Anchors Aweigh: How the Choice of Anchor Items Affects the Vertical Scaling of 3PL Data with the Rasch Model","authors":"Glenn Thomas Waterbury, Christine E. DeMars","doi":"10.1080/10627197.2020.1858782","DOIUrl":"https://doi.org/10.1080/10627197.2020.1858782","url":null,"abstract":"ABSTRACT Vertical scaling is used to put tests of different difficulty onto a common metric. The Rasch model is often used to perform vertical scaling, despite its strict functional form. Few, if any, studies have examined anchor item choice when using the Rasch model to vertically scale data that do not fit the model. The purpose of this study was to investigate the implications of anchor item choice on bias in growth estimates when data do not fit the Rasch model. Data were generated with varying levels of true difference between grades and levels of the lower asymptote. When true growth or the lower asymptote were zero, estimates were unbiased and anchor item choice was not consequential. As true growth and the lower asymptote both increased, growth was underestimated and choice of anchor items had an impact. Easy anchor items led to less biased estimates of growth than hard anchor items.","PeriodicalId":46209,"journal":{"name":"Educational Assessment","volume":"26 1","pages":"175 - 197"},"PeriodicalIF":1.5,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/10627197.2020.1858782","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47196823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-12DOI: 10.1080/10627197.2020.1858786
Esther Ulitzsch, Christiane Penk, Matthias von Davier, S. Pohl
ABSTRACT Identifying and considering test-taking effort is of utmost importance for drawing valid inferences on examinee competency in low-stakes tests. Different approaches exist for doing so. The speed-accuracy+engagement model aims at identifying non-effortful test-taking behavior in terms of nonresponse and rapid guessing based on responses and response times. The model allows for identifying rapid-guessing behavior on the item-by-examinee level whilst jointly modeling the processes underlying rapid guessing and effortful responding. To assess whether the model indeed provides a valid measure of test-taking effort, we investigate (1) convergent validity with previously developed behavioral as well as self-report measures on guessing behavior and effort, (2) fit within the nomological network of test-taking motivation derived from expectancy-value theory, and (3) ability to detect differences between groups that can be assumed to differ in test-taking effort. Results suggest that the model captures central aspects of non-effortful test-taking behavior. While it does not cover the whole spectrum of non-effortful test-taking behavior, it provides a measure for some aspects of it, in a manner that is less subjective than self-reports. The article concludes with a discussion of implications for the development of behavioral measures of non-effortful test-taking behavior.
{"title":"Model meets reality: Validating a new behavioral measure for test-taking effort","authors":"Esther Ulitzsch, Christiane Penk, Matthias von Davier, S. Pohl","doi":"10.1080/10627197.2020.1858786","DOIUrl":"https://doi.org/10.1080/10627197.2020.1858786","url":null,"abstract":"ABSTRACT Identifying and considering test-taking effort is of utmost importance for drawing valid inferences on examinee competency in low-stakes tests. Different approaches exist for doing so. The speed-accuracy+engagement model aims at identifying non-effortful test-taking behavior in terms of nonresponse and rapid guessing based on responses and response times. The model allows for identifying rapid-guessing behavior on the item-by-examinee level whilst jointly modeling the processes underlying rapid guessing and effortful responding. To assess whether the model indeed provides a valid measure of test-taking effort, we investigate (1) convergent validity with previously developed behavioral as well as self-report measures on guessing behavior and effort, (2) fit within the nomological network of test-taking motivation derived from expectancy-value theory, and (3) ability to detect differences between groups that can be assumed to differ in test-taking effort. Results suggest that the model captures central aspects of non-effortful test-taking behavior. While it does not cover the whole spectrum of non-effortful test-taking behavior, it provides a measure for some aspects of it, in a manner that is less subjective than self-reports. The article concludes with a discussion of implications for the development of behavioral measures of non-effortful test-taking behavior.","PeriodicalId":46209,"journal":{"name":"Educational Assessment","volume":"26 1","pages":"104 - 124"},"PeriodicalIF":1.5,"publicationDate":"2021-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/10627197.2020.1858786","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46282388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-05DOI: 10.1080/10627197.2020.1858784
S. Phillips, Ronald F. Ferguson, Jacob F. S. Rowley
ABSTRACT School systems are increasingly incorporating student perceptions of teaching effectiveness into educator accountability systems. Using Tripod’s 7Cs™ Framework of Teaching Effectiveness, this study examines key issues in validating student perception data for use in this manner. Analyses examine the internal structure of 7Cs scores and the extent to which scores predict key criteria. Results offer the first empirical evidence that 7Cs scores capture seven distinct dimensions of teaching effectiveness even as they also confirm prior research concluding 7Cs scores are largely unidimensional. At the same time, results demonstrate a modest relationship between 7Cs scores and teacher self-assessments of their own effectiveness. Together, findings suggest 7Cs scores can be used to collect meaningful information about over-arching effectiveness. However, additional evidence is warranted before giving 7Cs scores as much weight in high-stakes contexts as value-added test-score gains or expert classroom observations.
{"title":"Do They See What I See? Toward a Better Understanding of the 7Cs Framework of Teaching Effectiveness","authors":"S. Phillips, Ronald F. Ferguson, Jacob F. S. Rowley","doi":"10.1080/10627197.2020.1858784","DOIUrl":"https://doi.org/10.1080/10627197.2020.1858784","url":null,"abstract":"ABSTRACT School systems are increasingly incorporating student perceptions of teaching effectiveness into educator accountability systems. Using Tripod’s 7Cs™ Framework of Teaching Effectiveness, this study examines key issues in validating student perception data for use in this manner. Analyses examine the internal structure of 7Cs scores and the extent to which scores predict key criteria. Results offer the first empirical evidence that 7Cs scores capture seven distinct dimensions of teaching effectiveness even as they also confirm prior research concluding 7Cs scores are largely unidimensional. At the same time, results demonstrate a modest relationship between 7Cs scores and teacher self-assessments of their own effectiveness. Together, findings suggest 7Cs scores can be used to collect meaningful information about over-arching effectiveness. However, additional evidence is warranted before giving 7Cs scores as much weight in high-stakes contexts as value-added test-score gains or expert classroom observations.","PeriodicalId":46209,"journal":{"name":"Educational Assessment","volume":"26 1","pages":"69 - 87"},"PeriodicalIF":1.5,"publicationDate":"2021-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/10627197.2020.1858784","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48730441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-17DOI: 10.1080/10627197.2020.1858783
Stephanie Buono, E. Jang
ABSTRACT Increasing linguistic diversity in classrooms has led researchers to examine the validity and fairness of standardized achievement tests, specifically concerning whether test score interpretations are free of bias and score use is fair for all students. This study examined whether mathematics achievement test items that contain complex language function differently between two language subgroups: native English speakers (EL1, n= 1 000), and English language learners (ELL, n= 1 000). Confirmatory Differential Item Functioning (DIF) analyses using a SIBTEST were performed on 28 mathematics assessment items. Eleven items were identified to have complex language features, and DIF analyses revealed that seven of these items (63%) favored EL1s over ELLs. Effect sizes were moderate (0.05 ≤ <0.10) for six items, and marginal ( <0.05) for one item. This paper discusses validity issues with math achievement test items assessing ELLs and calls for careful test development and instructional accommodation in the classroom.
{"title":"The Effect of Linguistic Factors on Assessment of English Language Learners’ Mathematical Ability: A Differential Item Functioning Analysis","authors":"Stephanie Buono, E. Jang","doi":"10.1080/10627197.2020.1858783","DOIUrl":"https://doi.org/10.1080/10627197.2020.1858783","url":null,"abstract":"ABSTRACT Increasing linguistic diversity in classrooms has led researchers to examine the validity and fairness of standardized achievement tests, specifically concerning whether test score interpretations are free of bias and score use is fair for all students. This study examined whether mathematics achievement test items that contain complex language function differently between two language subgroups: native English speakers (EL1, n= 1 000), and English language learners (ELL, n= 1 000). Confirmatory Differential Item Functioning (DIF) analyses using a SIBTEST were performed on 28 mathematics assessment items. Eleven items were identified to have complex language features, and DIF analyses revealed that seven of these items (63%) favored EL1s over ELLs. Effect sizes were moderate (0.05 ≤ <0.10) for six items, and marginal ( <0.05) for one item. This paper discusses validity issues with math achievement test items assessing ELLs and calls for careful test development and instructional accommodation in the classroom.","PeriodicalId":46209,"journal":{"name":"Educational Assessment","volume":"26 1","pages":"125 - 144"},"PeriodicalIF":1.5,"publicationDate":"2020-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/10627197.2020.1858783","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49511402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}