Two families of analysis methods can be used for differential item functioning (DIF) analysis. One family is DIF analysis based on observed scores, such as the Mantel–Haenszel (MH) and the standardized proportion-correct metric for DIF procedures; the other is analysis based on latent ability, in which the statistic is a measure of departure from measurement invariance (DMI) for two studied groups. Previous research has shown, that DIF and DMI do not necessarily agree with each other. In practice, many operational testing programs use the MH DIF procedure to flag potential DIF items. Recently, weighted DIF statistics has been proposed, where weighted sum scores are used as the matching variable and the weights are the item discrimination parameters. It has been shown theoretically and analytically that, given the item parameters, weighted DIF statistics can close the gap between DIF and DMI. The current study investigates the robustness of using weighted DIF statistics empirically through simulations when item parameters have to be estimated from data.
{"title":"Robustness of Weighted Differential Item Functioning (DIF) Analysis: The Case of Mantel–Haenszel DIF Statistics","authors":"Ru Lu, Hongwen Guo, Neil J. Dorans","doi":"10.1002/ets2.12325","DOIUrl":"10.1002/ets2.12325","url":null,"abstract":"<p>Two families of analysis methods can be used for differential item functioning (DIF) analysis. One family is DIF analysis based on observed scores, such as the Mantel–Haenszel (MH) and the standardized proportion-correct metric for DIF procedures; the other is analysis based on latent ability, in which the statistic is a measure of departure from measurement invariance (DMI) for two studied groups. Previous research has shown, that DIF and DMI do not necessarily agree with each other. In practice, many operational testing programs use the MH DIF procedure to flag potential DIF items. Recently, weighted DIF statistics has been proposed, where weighted sum scores are used as the matching variable and the weights are the item discrimination parameters. It has been shown theoretically and analytically that, given the item parameters, weighted DIF statistics can close the gap between DIF and DMI. The current study investigates the robustness of using weighted DIF statistics empirically through simulations when item parameters have to be estimated from data.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-23"},"PeriodicalIF":0.0,"publicationDate":"2021-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/ets2.12325","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46346124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, harmonic regression models have been applied to implement quality control for educational assessment data consisting of multiple administrations and displaying seasonality. As with other types of regression models, it is imperative that model adequacy checking and model fit be appropriately conducted. However, there has been no literature on how to perform a comprehensive model adequacy evaluation when applying harmonic regression models to sequential data with seasonality in the educational assessment field. This paper is intended to fill this gap with an illustration of real data from an English language assessment. Two types of cross-validation, leave-one-out and out-of-sample, were designed to measure prediction errors and check model validation. Three types of R-squared (, , and ) and various residual diagnostics were applied to check model adequacy and model fitting.
{"title":"Model Adequacy Checking for Applying Harmonic Regression to Assessment Quality Control","authors":"Jiahe Qian, Shuhong Li","doi":"10.1002/ets2.12327","DOIUrl":"10.1002/ets2.12327","url":null,"abstract":"<p>In recent years, harmonic regression models have been applied to implement quality control for educational assessment data consisting of multiple administrations and displaying seasonality. As with other types of regression models, it is imperative that model adequacy checking and model fit be appropriately conducted. However, there has been no literature on how to perform a comprehensive model adequacy evaluation when applying harmonic regression models to sequential data with seasonality in the educational assessment field. This paper is intended to fill this gap with an illustration of real data from an English language assessment. Two types of cross-validation, leave-one-out and out-of-sample, were designed to measure prediction errors and check model validation. Three types of <i>R</i>-squared (, , and ) and various residual diagnostics were applied to check model adequacy and model fitting.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-26"},"PeriodicalIF":0.0,"publicationDate":"2021-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/ets2.12327","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42131139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Danielle Guzman-Orth, Cary A. Supalo, Derrick W. Smith, Okhee Lee, Teresa King
The landscape for STEM instruction is rapidly shifting in the United States. Attention toward STEM instruction and assessment opportunities is increasing. All students must have opportunities to gain access to the STEM content and show what they know and are able to do. We caution that attention to fairness and accessibility is critical for students from special populations, particularly English learners and students with disabilities. Opportunities for equitable access to STEM instruction and assessment are diminished without accessibility. In this report, we use an assets-based perspective to discuss and reframe common misconceptions and challenges as opportunities. We argue that attention to accessibility at the onset of STEM instruction and assessment is the pivotal foundation for fair opportunities in STEM. We highlight key opportunities and conclude with recommendations for improved fairness and access in STEM.
{"title":"Equitable STEM Instruction and Assessment: Accessibility and Fairness Considerations for Special Populations","authors":"Danielle Guzman-Orth, Cary A. Supalo, Derrick W. Smith, Okhee Lee, Teresa King","doi":"10.1002/ets2.12324","DOIUrl":"10.1002/ets2.12324","url":null,"abstract":"<p>The landscape for STEM instruction is rapidly shifting in the United States. Attention toward STEM instruction and assessment opportunities is increasing. All students must have opportunities to gain access to the STEM content and show what they know and are able to do. We caution that attention to fairness and accessibility is critical for students from special populations, particularly English learners and students with disabilities. Opportunities for equitable access to STEM instruction and assessment are diminished without accessibility. In this report, we use an assets-based perspective to discuss and reframe common misconceptions and challenges as opportunities. We argue that attention to accessibility at the onset of STEM instruction and assessment is the pivotal foundation for fair opportunities in STEM. We highlight key opportunities and conclude with recommendations for improved fairness and access in STEM.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-16"},"PeriodicalIF":0.0,"publicationDate":"2021-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/ets2.12324","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45176062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this investigation, we used real data to assess potential differential effects associated with taking a test in a test center (TC) versus testing at home using remote proctoring (RP). We used a pseudo-equivalent groups (PEG) approach to examine group equivalence at the item level and the total score level. If our assumption holds that the PEG approach removes between-group ability differences (as measured by the test) reasonably well, then a plausible explanation for any systematic differences in performance between TC and RP groups that remain after applying the PEG approach would be the operation of test mode effects. At the item level, we compared item difficulties estimated using the PEG approach (i.e., adjusting only for ability differences between groups) to those estimated via delta equating (i.e., adjusting for any systematic differences between groups). All tests used in this investigation showed small, nonsystematic differences, providing evidence of trivial effects associated with at-home testing. At the total score level, we linked the RP group scores to the TC group scores after adjusting for group differences using demographic covariates. We then compared the resulting RP group conversion to the original TC group conversion (the criterion in this study). The magnitude of differences between the RP conversion and the TC conversion was small, leading to the same pass/fail decision for most RP examinees. The present analyses seem to suggest little to no mode effects for the tests used in this investigation.
{"title":"Assessing Mode Effects of At-Home Testing Without a Randomized Trial","authors":"Sooyeon Kim, Michael Walker","doi":"10.1002/ets2.12323","DOIUrl":"10.1002/ets2.12323","url":null,"abstract":"<p>In this investigation, we used real data to assess potential differential effects associated with taking a test in a test center (TC) versus testing at home using remote proctoring (RP). We used a pseudo-equivalent groups (PEG) approach to examine group equivalence at the item level and the total score level. If our assumption holds that the PEG approach removes between-group ability differences (as measured by the test) reasonably well, then a plausible explanation for any systematic differences in performance between TC and RP groups that remain after applying the PEG approach would be the operation of test mode effects. At the item level, we compared item difficulties estimated using the PEG approach (i.e., adjusting only for ability differences between groups) to those estimated via delta equating (i.e., adjusting for any systematic differences between groups). All tests used in this investigation showed small, nonsystematic differences, providing evidence of trivial effects associated with at-home testing. At the total score level, we linked the RP group scores to the TC group scores after adjusting for group differences using demographic covariates. We then compared the resulting RP group conversion to the original TC group conversion (the criterion in this study). The magnitude of differences between the RP conversion and the TC conversion was small, leading to the same pass/fail decision for most RP examinees. The present analyses seem to suggest little to no mode effects for the tests used in this investigation.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-21"},"PeriodicalIF":0.0,"publicationDate":"2021-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/ets2.12323","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49496964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study investigated the effects of two different planning time conditions (i.e., operational [20 s] and extended length [90 s]) for the lecture listening-into-speaking tasks of the TOEFL iBT® test for candidates at different proficiency levels. Seventy international students based in universities and language schools in the United Kingdom (35 at a lower level; 35 at a higher level) participated in the study. The effects of different lengths of planning time were examined in terms of (a) the scores given by ETS-certified raters; (b) the quality of the speaking performances characterized by accurately reproduced idea units and the measures of complexity, accuracy, and fluency; and (c) self-reported use of cognitive and metacognitive processes and strategies during listening, planning, and speaking. The results found neither a statistically significant main effect of the length of planning time nor an interaction between planning time and proficiency on the scores or on the quality of the speaking performance. There were several cognitive and metacognitive processes and strategies where significantly more engagement was reported under the extended planning time, which suggests enhanced cognitive validity of the task. However, the increased engagement in planning did not lead to any measurable improvement in the score. Therefore, in the interest of practicality, the results of this study provide justifications for the operational length of planning time for the lecture listening-into-speaking tasks in the speaking section of the TOEFL iBT test.
{"title":"The Effects of Extended Planning Time on Candidates' Performance, Processes, and Strategy Use in the Lecture Listening-Into-Speaking Tasks of the TOEFL iBT® Test","authors":"Chihiro Inoue, Daniel M. K. Lam","doi":"10.1002/ets2.12322","DOIUrl":"10.1002/ets2.12322","url":null,"abstract":"<p>This study investigated the effects of two different planning time conditions (i.e., operational [20 s] and extended length [90 s]) for the lecture listening-into-speaking tasks of the <i>TOEFL iBT</i>® test for candidates at different proficiency levels. Seventy international students based in universities and language schools in the United Kingdom (35 at a lower level; 35 at a higher level) participated in the study. The effects of different lengths of planning time were examined in terms of (a) the scores given by ETS-certified raters; (b) the quality of the speaking performances characterized by accurately reproduced idea units and the measures of complexity, accuracy, and fluency; and (c) self-reported use of cognitive and metacognitive processes and strategies during listening, planning, and speaking. The results found neither a statistically significant main effect of the length of planning time nor an interaction between planning time and proficiency on the scores or on the quality of the speaking performance. There were several cognitive and metacognitive processes and strategies where significantly more engagement was reported under the extended planning time, which suggests enhanced cognitive validity of the task. However, the increased engagement in planning did not lead to any measurable improvement in the score. Therefore, in the interest of practicality, the results of this study provide justifications for the operational length of planning time for the lecture listening-into-speaking tasks in the speaking section of the TOEFL iBT test.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-32"},"PeriodicalIF":0.0,"publicationDate":"2021-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/ets2.12322","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41750346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Priya Kannan, Andrew D. Bryant, Shiyi Shao, E. Caroline Wylie
Interim assessments have been defined variously in different contexts and can be used for predictive purposes or instructional purposes. In this paper, we present results from a study where we evaluated reporting needs for interim assessments designed for instructional purposes and intended to be used at the end of defined curriculum units. Results from such unit assessments should help teachers determine gaps in student understanding and inform ongoing instructional decision-making. Our goal was to determine if learning progressions (LPs) could serve as the cognitive lens through which teachers can evaluate how their students' understanding of key constructs improves through periodic unit assessments. Therefore, we used the LP framework in mathematics and the key practices (KP) framework for English language arts (ELA) to design preliminary teacher report mock-ups for these unit assessments. Within a utilization-oriented evaluation framework, we conducted six needs-assessment focus groups with elementary and middle school mathematics (n = 12) and ELA (n = 11) teachers to specifically evaluate the extent to which they find results presented within the LP and KP frameworks understandable and useful for their instructional practice. Results from the focus groups show teachers' overall needs for types of information sought from unit assessment reports, the extent to which teachers are familiar with the LP and KP frameworks, their interpretations (including confusions) of the information presented in the preliminary mock-ups, and their additional needs for reports from unit assessments to be instructionally useful.
{"title":"Identifying Teachers' Needs for Results From Interim Unit Assessments","authors":"Priya Kannan, Andrew D. Bryant, Shiyi Shao, E. Caroline Wylie","doi":"10.1002/ets2.12320","DOIUrl":"10.1002/ets2.12320","url":null,"abstract":"<p>Interim assessments have been defined variously in different contexts and can be used for predictive purposes or instructional purposes. In this paper, we present results from a study where we evaluated reporting needs for interim assessments designed for instructional purposes and intended to be used at the end of defined curriculum units. Results from such unit assessments should help teachers determine gaps in student understanding and inform ongoing instructional decision-making. Our goal was to determine if learning progressions (LPs) could serve as the cognitive lens through which teachers can evaluate how their students' understanding of key constructs improves through periodic unit assessments. Therefore, we used the LP framework in mathematics and the key practices (KP) framework for English language arts (ELA) to design preliminary teacher report mock-ups for these unit assessments. Within a utilization-oriented evaluation framework, we conducted six needs-assessment focus groups with elementary and middle school mathematics (<i>n</i> = 12) and ELA (<i>n</i> = 11) teachers to specifically evaluate the extent to which they find results presented within the LP and KP frameworks understandable and useful for their instructional practice. Results from the focus groups show teachers' overall needs for types of information sought from unit assessment reports, the extent to which teachers are familiar with the LP and KP frameworks, their interpretations (including confusions) of the information presented in the preliminary mock-ups, and their additional needs for reports from unit assessments to be instructionally useful.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-39"},"PeriodicalIF":0.0,"publicationDate":"2021-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/ets2.12320","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46194713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes the development and evaluation of Interaction Competence Elicitor (ICE), a spoken dialog system (SDS) for the delivery of a paired oral discussion task in the context of language assessment. The purpose of ICE is to sustain a topic-specific conversation with a test taker in order to elicit discourse that can be later judged to assess the test taker's oral language ability, including interactional competence. The development of ICE is reported in detail to provide guidance for future developers of similar systems. The performance of ICE is evaluated on two aspects: (a) by analyzing system errors that occur at different stages in the natural language processing (NLP) pipeline in terms of both their preventability and their impact on the downstream stages of the pipeline, and (b) by analyzing questionnaire and semistructured interview data to establish the test takers' experience with the system. Findings suggest that ICE was robust in 90% of the dialog turns it produced, and test takers noted both positive and negative aspects of communicating with the system as opposed to a human interlocutor. We conclude that this prototype system lays important groundwork for the development and use of specialized SDSs in the assessment of oral communication, which includes interactional competence.
{"title":"The Development and Evaluation of Interactional Competence Elicitor for Oral Language Assessments","authors":"Evgeny Chukharev-Hudilainen, Gary J. Ockey","doi":"10.1002/ets2.12319","DOIUrl":"10.1002/ets2.12319","url":null,"abstract":"<p>This paper describes the development and evaluation of Interaction Competence Elicitor (ICE), a spoken dialog system (SDS) for the delivery of a paired oral discussion task in the context of language assessment. The purpose of ICE is to sustain a topic-specific conversation with a test taker in order to elicit discourse that can be later judged to assess the test taker's oral language ability, including interactional competence. The development of ICE is reported in detail to provide guidance for future developers of similar systems. The performance of ICE is evaluated on two aspects: (a) by analyzing system errors that occur at different stages in the natural language processing (NLP) pipeline in terms of both their preventability and their impact on the downstream stages of the pipeline, and (b) by analyzing questionnaire and semistructured interview data to establish the test takers' experience with the system. Findings suggest that ICE was robust in 90% of the dialog turns it produced, and test takers noted both positive and negative aspects of communicating with the system as opposed to a human interlocutor. We conclude that this prototype system lays important groundwork for the development and use of specialized SDSs in the assessment of oral communication, which includes interactional competence.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-20"},"PeriodicalIF":0.0,"publicationDate":"2021-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/ets2.12319","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41668805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Workforce development and career and technical education (CTE) have long provided reliable pathways to middle skill jobs and a gateway to the middle class. Given recent changes in middle skills jobs, the education landscape, and federal policy priorities, the role of CTE in the U.S. educational landscape is evolving more rapidly, encompassing a broader range of education, and practices are changing ahead of research. The first part of this report provides an overview of the current state of CTE in the United States, as well as the state of CTE research, and presents an argument for a broader definition of CTE that incorporates workforce development through postsecondary institutions. The second part provides operational definitions and typologies to facilitate future research. Our aim is to build a research framework for CTE that is grounded in a normative path through CTE: getting in (preparation and recruitment), getting through (retention and skill acquisition), getting out (completion and initial hire), and getting on (career progression). Key challenges and priorities for future research are discussed.
{"title":"Career and Technical Education as a Conduit for Skilled Technical Careers: A Targeted Research Review and Framework for Future Research","authors":"Sara Haviland, Steven Robbins","doi":"10.1002/ets2.12318","DOIUrl":"10.1002/ets2.12318","url":null,"abstract":"<p>Workforce development and career and technical education (CTE) have long provided reliable pathways to middle skill jobs and a gateway to the middle class. Given recent changes in middle skills jobs, the education landscape, and federal policy priorities, the role of CTE in the U.S. educational landscape is evolving more rapidly, encompassing a broader range of education, and practices are changing ahead of research. The first part of this report provides an overview of the current state of CTE in the United States, as well as the state of CTE research, and presents an argument for a broader definition of CTE that incorporates workforce development through postsecondary institutions. The second part provides operational definitions and typologies to facilitate future research. Our aim is to build a research framework for CTE that is grounded in a normative path through CTE: getting in (preparation and recruitment), getting through (retention and skill acquisition), getting out (completion and initial hire), and getting on (career progression). Key challenges and priorities for future research are discussed.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-42"},"PeriodicalIF":0.0,"publicationDate":"2021-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/ets2.12318","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47638937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nathan Jones, Courtney Bell, Yi Qi, Jennifer Lewis, David Kirui, Leslie Stickler, Amanda Redash
The observation systems being used in all 50 states require administrators to learn to accurately and reliably score their teachers' instruction using standardized observation systems. Although the literature on observation systems is growing, relatively few studies have examined the outcomes of trainings focused on developing administrators' accuracy using observation systems and the administrators' perceptions of that training. Therefore, the focus of this study is on examining administrators' efforts to become accurate and reliable within the context of a comprehensive teacher evaluation reform. This study was conducted during the year-long training and implementation of a new observation system in the context of a large urban district's teacher evaluation reform. The study brings together data on the outcomes of the district training—results on a certification exercise from all administrators in the district—with two sources of data on administrators' perceptions and beliefs. Specifically, we collected fall and spring survey data from nearly 300 administrators and longitudinal interview data from a subsample of 24 administrators. Taken together, these data allowed us to investigate administrators' responses to training and low-stakes practice using the observation process over 1 year. At the end of initial training, administrators reported high levels of learning, particularly in domains aligned with the focus of training. Over the year, administrators reported increased facility with the routines of conducting observations, but they still expressed learning needs, many related to the content of the observation framework. However, results from the training certification test suggested lower than desired levels of accuracy and reliability; administrators regularly did not agree with each other or with master raters. The certification test results suggested that even with a significant investment in administrator learning, there was more to be learned and mastered. If we hope for teacher evaluation to lead to the types of changes in teaching and learning that reformers have envisioned, policymakers and practitioners alike will need to devote time and resources to supporting administrator learning in initial training and throughout administrator use in practice.
{"title":"Certified to Evaluate: Exploring Administrator Accuracy and Beliefs in Teacher Observation","authors":"Nathan Jones, Courtney Bell, Yi Qi, Jennifer Lewis, David Kirui, Leslie Stickler, Amanda Redash","doi":"10.1002/ets2.12316","DOIUrl":"10.1002/ets2.12316","url":null,"abstract":"<p>The observation systems being used in all 50 states require administrators to learn to accurately and reliably score their teachers' instruction using standardized observation systems. Although the literature on observation systems is growing, relatively few studies have examined the outcomes of trainings focused on developing administrators' accuracy using observation systems and the administrators' perceptions of that training. Therefore, the focus of this study is on examining administrators' efforts to become accurate and reliable within the context of a comprehensive teacher evaluation reform. This study was conducted during the year-long training and implementation of a new observation system in the context of a large urban district's teacher evaluation reform. The study brings together data on the outcomes of the district training—results on a certification exercise from all administrators in the district—with two sources of data on administrators' perceptions and beliefs. Specifically, we collected fall and spring survey data from nearly 300 administrators and longitudinal interview data from a subsample of 24 administrators. Taken together, these data allowed us to investigate administrators' responses to training and low-stakes practice using the observation process over 1 year. At the end of initial training, administrators reported high levels of learning, particularly in domains aligned with the focus of training. Over the year, administrators reported increased facility with the routines of conducting observations, but they still expressed learning needs, many related to the content of the observation framework. However, results from the training certification test suggested lower than desired levels of accuracy and reliability; administrators regularly did not agree with each other or with master raters. The certification test results suggested that even with a significant investment in administrator learning, there was more to be learned and mastered. If we hope for teacher evaluation to lead to the types of changes in teaching and learning that reformers have envisioned, policymakers and practitioners alike will need to devote time and resources to supporting administrator learning in initial training and throughout administrator use in practice.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-20"},"PeriodicalIF":0.0,"publicationDate":"2021-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/ets2.12316","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42458081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nathaniel Owen, Prithvi N. Shrestha, Anna Kristina Hultgren
This project examined academic reading in two contrasting English as a medium of instruction (EMI) university settings in Nepal and Sweden and the unique challenges facing students who are studying in a language other than their primary language. The motivation for the project was to explore the role of high-stakes testing in EMI contexts and the implications for the design of the TOEFL iBT® test. We employed a sequential mixed-methods approach to gather substantive and authentic qualitative data from stakeholders immersed in EMI settings. A small sample of students (Nepal = 19, Sweden = nine) were asked to complete reading logs over a period of 3 weeks so we could determine the types of texts and reading load associated with diverse EMI settings. Additionally, a larger cohort of students from each setting (Nepal = 69, Sweden = 60) completed questionnaires examining academic reading demands, reading skills, and practices. Students who completed the questionnaires also completed the reading section of the TOEFL iBT test. The same students also completed a TOEFL® family of tests suitability questionnaire so we could consider the suitability of the TOEFL iBT test for EMI contexts. Following test completion, a series of semistructured interviews (Nepal = 21, Sweden = 23) focused more closely on students' perspectives of reading demands in their academic contexts and the suitability of the reading section of the TOEFL iBT test to make claims about readiness to study in EMI contexts. Our findings revealed that different EMI contexts have different standards of high and low academic reading proficiency and that these differences may occur due to differences in educational experiences of the respective cohorts. The findings offer important new insights into academic reading and assessment in EMI contexts. Students in EMI contexts are sensitive to violations of expectations regarding test-taking experiences (face validity). The study has implications for the design of test tasks, which should consider local, contextual varieties of English.
{"title":"Researching Academic Reading in Two Contrasting English as a Medium of Instruction Contexts at a University Level","authors":"Nathaniel Owen, Prithvi N. Shrestha, Anna Kristina Hultgren","doi":"10.1002/ets2.12317","DOIUrl":"10.1002/ets2.12317","url":null,"abstract":"<p>This project examined academic reading in two contrasting English as a medium of instruction (EMI) university settings in Nepal and Sweden and the unique challenges facing students who are studying in a language other than their primary language. The motivation for the project was to explore the role of high-stakes testing in EMI contexts and the implications for the design of the <i>TOEFL iBT</i>® test. We employed a sequential mixed-methods approach to gather substantive and authentic qualitative data from stakeholders immersed in EMI settings. A small sample of students (Nepal = 19, Sweden = nine) were asked to complete reading logs over a period of 3 weeks so we could determine the types of texts and reading load associated with diverse EMI settings. Additionally, a larger cohort of students from each setting (Nepal = 69, Sweden = 60) completed questionnaires examining academic reading demands, reading skills, and practices. Students who completed the questionnaires also completed the reading section of the TOEFL iBT test. The same students also completed a <i>TOEFL</i>® family of tests suitability questionnaire so we could consider the suitability of the TOEFL iBT test for EMI contexts. Following test completion, a series of semistructured interviews (Nepal = 21, Sweden = 23) focused more closely on students' perspectives of reading demands in their academic contexts and the suitability of the reading section of the TOEFL iBT test to make claims about readiness to study in EMI contexts. Our findings revealed that different EMI contexts have different standards of high and low academic reading proficiency and that these differences may occur due to differences in educational experiences of the respective cohorts. The findings offer important new insights into academic reading and assessment in EMI contexts. Students in EMI contexts are sensitive to violations of expectations regarding test-taking experiences (face validity). The study has implications for the design of test tasks, which should consider local, contextual varieties of English.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-28"},"PeriodicalIF":0.0,"publicationDate":"2021-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/ets2.12317","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42450371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}