Sugene Cho-Baker, Harrison J. Kell, Daniel Fishtein
The career gains of obtaining a graduate degree are well established, but those from lower socioeconomic status (SES) and underrepresented demographic backgrounds have persistently been disadvantaged in earning those degrees. We aim to contribute to research on enhancing access, diversity, and equity to graduate education by providing insights into what motivates individuals to pursue a graduate education across demographic and socioeconomic backgrounds. Using survey data collected from GRE® test takers at two time points and exploratory structural equation modeling, we explore the factors that individuals consider to be important for pursuing graduate education and selecting graduate programs, along with subsequent application and acceptance outcomes. We identified three factors considered in deciding to pursue graduate school and six factors considered in selecting graduate school programs. Those who aimed to apply to graduate school for professional development considered an extensive set of factors in selecting programs. The factors considered varied by gender, ethnicity/race, and SES. These factors further varied in the extent to which they predicted graduate school application and acceptance outcomes.
{"title":"Factors Considered in Graduate School Decision-Making: Implications for Graduate School Application and Acceptance","authors":"Sugene Cho-Baker, Harrison J. Kell, Daniel Fishtein","doi":"10.1002/ets2.12348","DOIUrl":"10.1002/ets2.12348","url":null,"abstract":"<p>The career gains of obtaining a graduate degree are well established, but those from lower socioeconomic status (SES) and underrepresented demographic backgrounds have persistently been disadvantaged in earning those degrees. We aim to contribute to research on enhancing access, diversity, and equity to graduate education by providing insights into what motivates individuals to pursue a graduate education across demographic and socioeconomic backgrounds. Using survey data collected from <i>GRE</i>® test takers at two time points and exploratory structural equation modeling, we explore the factors that individuals consider to be important for pursuing graduate education and selecting graduate programs, along with subsequent application and acceptance outcomes. We identified three factors considered in deciding to pursue graduate school and six factors considered in selecting graduate school programs. Those who aimed to apply to graduate school for professional development considered an extensive set of factors in selecting programs. The factors considered varied by gender, ethnicity/race, and SES. These factors further varied in the extent to which they predicted graduate school application and acceptance outcomes.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2022 1","pages":"1-18"},"PeriodicalIF":0.0,"publicationDate":"2022-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12348","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44481902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Synthetically generated speech (SGS) has become an integral part of our oral communication in a wide variety of contexts. It can be generated instantly at a low cost and allows precise control over multiple aspects of output, all of which can be highly appealing to second language (L2) assessment developers who have traditionally relied upon human voice actors for recording audio materials. Nevertheless, SGS is not widely used in L2 assessments. One major concern in this use case lies in its potential impact on test-taker performance: Would the use of SGS (as opposed to using human voice actor recordings) change how test takers respond to an item? In this study, we investigated using SGS as stimuli for English L2 listening assessment items on test-taker performance. The data came from a pilot administration of multiple new task types and included 653 test takers' responses to two versions of the same 13 items, differing only in terms of their listening stimuli: a version using human voice actor recordings and the other version with SGS files. Multifaceted comparisons between test takers' responses across the two versions showed that the two versions elicited remarkably comparable performance. The comparability provides strong empirical evidence for the use of SGS as a viable alternative for human voice actor recordings in the immediate domain of L2 assessment as well as related domains such as learning material and research instrument development.
{"title":"The Impact of Using Synthetically Generated Listening Stimuli on Test-Taker Performance: A Case Study With Multiple-Choice, Single-Selection Items","authors":"Ikkyu Choi, Jiyun Zu","doi":"10.1002/ets2.12347","DOIUrl":"10.1002/ets2.12347","url":null,"abstract":"<p>Synthetically generated speech (SGS) has become an integral part of our oral communication in a wide variety of contexts. It can be generated instantly at a low cost and allows precise control over multiple aspects of output, all of which can be highly appealing to second language (L2) assessment developers who have traditionally relied upon human voice actors for recording audio materials. Nevertheless, SGS is not widely used in L2 assessments. One major concern in this use case lies in its potential impact on test-taker performance: Would the use of SGS (as opposed to using human voice actor recordings) change how test takers respond to an item? In this study, we investigated using SGS as stimuli for English L2 listening assessment items on test-taker performance. The data came from a pilot administration of multiple new task types and included 653 test takers' responses to two versions of the same 13 items, differing only in terms of their listening stimuli: a version using human voice actor recordings and the other version with SGS files. Multifaceted comparisons between test takers' responses across the two versions showed that the two versions elicited remarkably comparable performance. The comparability provides strong empirical evidence for the use of SGS as a viable alternative for human voice actor recordings in the immediate domain of L2 assessment as well as related domains such as learning material and research instrument development.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2022 1","pages":"1-14"},"PeriodicalIF":0.0,"publicationDate":"2022-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12347","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48532571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kevin M. Williams, Michelle P. Martin-Raugh, Jennifer E. Lentini
Researchers and practitioners in postsecondary and workplace settings recognize the value of noncognitive constructs in predicting academic and vocational success but also perceive that many students or employees are lacking in these areas. In turn, there is increased interest in interventions designed to enhance these constructs. We provide an empirically informed theory of change (ToC) that describes the inputs, mechanisms, and outputs of noncognitive construct interventions (NCIs). The components that inform this ToC include specific relevant constructs that are amenable to intervention, intervention content and mechanisms of change, methodological considerations, moderators of program efficacy, recommendations for evaluating NCIs, and suggested outcomes. In turn, NCIs should provide benefits to individuals, institutions, and society at large and also advance our scientific understanding of this important phenomenon.
{"title":"Improving Noncognitive Constructs for Career Readiness and Success: A Theory of Change for Postsecondary, Workplace, and Research Applications","authors":"Kevin M. Williams, Michelle P. Martin-Raugh, Jennifer E. Lentini","doi":"10.1002/ets2.12346","DOIUrl":"10.1002/ets2.12346","url":null,"abstract":"<p>Researchers and practitioners in postsecondary and workplace settings recognize the value of noncognitive constructs in predicting academic and vocational success but also perceive that many students or employees are lacking in these areas. In turn, there is increased interest in interventions designed to enhance these constructs. We provide an empirically informed theory of change (ToC) that describes the inputs, mechanisms, and outputs of noncognitive construct interventions (NCIs). The components that inform this ToC include specific relevant constructs that are amenable to intervention, intervention content and mechanisms of change, methodological considerations, moderators of program efficacy, recommendations for evaluating NCIs, and suggested outcomes. In turn, NCIs should provide benefits to individuals, institutions, and society at large and also advance our scientific understanding of this important phenomenon.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2022 1","pages":"1-11"},"PeriodicalIF":0.0,"publicationDate":"2022-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12346","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45303138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Margarita Olivera-Aguilar, Hee-Sun Lee, Amy Pallant, Vinetha Belur, Matthew Mulholland, Ou Lydia Liu
This study uses a computerized formative assessment system that provides automated scoring and feedback to help students write scientific arguments in a climate change curriculum. We compared the effect of contextualized versus generic automated feedback on students' explanations of scientific claims and attributions of uncertainty to those claims. Classes were randomly assigned to the contextualized feedback condition (227 students from 11 classes) or to the generic feedback condition (138 students from 9 classes). The results indicate that the formative assessment helped students improve their scores in both explanation and uncertainty scores, but larger score gains were found in the uncertainty attribution scores. Although the contextualized feedback was associated with higher final scores, this effect was moderated by the number of revisions made, the initial score, and gender. We discuss how the results might be related to students' familiarity with writing scientific explanations versus uncertainty attributions at school.
{"title":"Comparing the Effect of Contextualized Versus Generic Automated Feedback on Students' Scientific Argumentation","authors":"Margarita Olivera-Aguilar, Hee-Sun Lee, Amy Pallant, Vinetha Belur, Matthew Mulholland, Ou Lydia Liu","doi":"10.1002/ets2.12344","DOIUrl":"https://doi.org/10.1002/ets2.12344","url":null,"abstract":"<p>This study uses a computerized formative assessment system that provides automated scoring and feedback to help students write scientific arguments in a climate change curriculum. We compared the effect of contextualized versus generic automated feedback on students' explanations of scientific claims and attributions of uncertainty to those claims. Classes were randomly assigned to the contextualized feedback condition (227 students from 11 classes) or to the generic feedback condition (138 students from 9 classes). The results indicate that the formative assessment helped students improve their scores in both explanation and uncertainty scores, but larger score gains were found in the uncertainty attribution scores. Although the contextualized feedback was associated with higher final scores, this effect was moderated by the number of revisions made, the initial score, and gender. We discuss how the results might be related to students' familiarity with writing scientific explanations versus uncertainty attributions at school.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2022 1","pages":"1-14"},"PeriodicalIF":0.0,"publicationDate":"2022-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12344","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"109171717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Katrina Roohr, Margarita Olivera-Aguilar, Jennifer Bochenek, Vinetha Belur
The United States continues to be a top destination for international students pursuing an advanced degree. Some information about the characteristics of international students applying to graduate programs in the United States is available, but little is known about how these characteristics are related to test taker performance on graduate admissions tests and how performance may be related to graduate program characteristics. The purpose of this study was to investigate different patterns of performance of international test takers from four cultural regions and two large countries (China and India) on both the GRE® test and the TOEFL® test and the relationship with demographic and graduate program characteristics. Using finite mixture modeling, we investigated the most common score profiles using GRE and TOEFL for international students intending to pursue a graduate program within the United States; evaluated the demographic and college-level factors related to the profiles; and evaluated whether the profiles were differentially associated with gender, intended field of study, and intended degree level. Results showed the following broad patterns of results: (a) Most countries and cultural regions, except for the Middle East, had three or four latent profiles representing low, medium, and high scores on the GRE and TOEFL sections; (b) two high-performing profiles were found in Confucian Asia, one with higher GRE Quantitative Reasoning scores and the other with higher scores on GRE Verbal and TOEFL; (c) regardless of profile, test takers from China performed highest on the GRE Quantitative Reasoning section as compared to other GRE and TOEFL section scores; (d) in general, there was a relationship with students in the lower performing profiles taking the TOEFL and GRE multiple times; (e) regardless of country or cultural region, men were represented more than women overall and across most of the profiles; and (f) test takers showed a preference for science-, technology-, engineering-, and mathematics-based fields and master's degrees, but this varied across country and cultural region. Implications for future research are discussed.
{"title":"Exploring GRE® and TOEFL® Score Profiles of International Students Intending to Pursue a Graduate Degree in the United States","authors":"Katrina Roohr, Margarita Olivera-Aguilar, Jennifer Bochenek, Vinetha Belur","doi":"10.1002/ets2.12343","DOIUrl":"https://doi.org/10.1002/ets2.12343","url":null,"abstract":"<p>The United States continues to be a top destination for international students pursuing an advanced degree. Some information about the characteristics of international students applying to graduate programs in the United States is available, but little is known about how these characteristics are related to test taker performance on graduate admissions tests and how performance may be related to graduate program characteristics. The purpose of this study was to investigate different patterns of performance of international test takers from four cultural regions and two large countries (China and India) on both the <i>GRE</i>® test and the <i>TOEFL</i>® test and the relationship with demographic and graduate program characteristics. Using finite mixture modeling, we investigated the most common score profiles using GRE and TOEFL for international students intending to pursue a graduate program within the United States; evaluated the demographic and college-level factors related to the profiles; and evaluated whether the profiles were differentially associated with gender, intended field of study, and intended degree level. Results showed the following broad patterns of results: (a) Most countries and cultural regions, except for the Middle East, had three or four latent profiles representing low, medium, and high scores on the GRE and TOEFL sections; (b) two high-performing profiles were found in Confucian Asia, one with higher GRE Quantitative Reasoning scores and the other with higher scores on GRE Verbal and TOEFL; (c) regardless of profile, test takers from China performed highest on the GRE Quantitative Reasoning section as compared to other GRE and TOEFL section scores; (d) in general, there was a relationship with students in the lower performing profiles taking the TOEFL and GRE multiple times; (e) regardless of country or cultural region, men were represented more than women overall and across most of the profiles; and (f) test takers showed a preference for science-, technology-, engineering-, and mathematics-based fields and master's degrees, but this varied across country and cultural region. Implications for future research are discussed.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2022 1","pages":"1-27"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12343","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"109172488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongwen Guo, Joseph A. Rios, Guangming Ling, Zhen Wang, Lin Gu, Zhitong Yang, Lydia O. Liu
Different variants of the selected-response (SR) item type have been developed for various reasons (i.e., simulating realistic situations, examining critical-thinking and/or problem-solving skills). Generally, the variants of SR item format are more complex than the traditional multiple-choice (MC) items, which may be more challenging to test takers and thus may discourage their test engagement on low-stakes assessments. Low test-taking effort has been shown to distort test scores and thereby diminish score validity. We used data collected from a large-scale assessment to investigate how variants of the SR item format may impact test properties and test engagement. Results show that the studied variants of SR item format were generally harder and more time consuming compared to the traditional MC item format, but they did not show negative impact on test-taking effort. However, item position had a dominating influence on nonresponse rates and rapid-guessing rates in a cumulative fashion, even though the effect sizes were relatively small in the studied data.
{"title":"Influence of Selected-Response Format Variants on Test Characteristics and Test-Taking Effort: An Empirical Study","authors":"Hongwen Guo, Joseph A. Rios, Guangming Ling, Zhen Wang, Lin Gu, Zhitong Yang, Lydia O. Liu","doi":"10.1002/ets2.12345","DOIUrl":"10.1002/ets2.12345","url":null,"abstract":"<p>Different variants of the selected-response (SR) item type have been developed for various reasons (i.e., simulating realistic situations, examining critical-thinking and/or problem-solving skills). Generally, the variants of SR item format are more complex than the traditional multiple-choice (MC) items, which may be more challenging to test takers and thus may discourage their test engagement on low-stakes assessments. Low test-taking effort has been shown to distort test scores and thereby diminish score validity. We used data collected from a large-scale assessment to investigate how variants of the SR item format may impact test properties and test engagement. Results show that the studied variants of SR item format were generally harder and more time consuming compared to the traditional MC item format, but they did not show negative impact on test-taking effort. However, item position had a dominating influence on nonresponse rates and rapid-guessing rates in a cumulative fashion, even though the effect sizes were relatively small in the studied data.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2022 1","pages":"1-20"},"PeriodicalIF":0.0,"publicationDate":"2022-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12345","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47274215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this report, we demonstrate use of differential response time (DRT) methodology, an extension of differential item functioning methodology, for examining differences in how students from different backgrounds engage with assessment tasks. We analyze response time data from a digitally delivered mathematics assessment to examine timing differences between English language learner (ELL) and non-ELL student groups. When matched on the total sum scores of the studied item form, results showed that ELLs spent a significantly longer time on most items compared to the non-ELLs who performed similarly on the test form. When matched on the total response time, results showed that ELL students spent a significantly longer time on items in the first half of the form but a shorter time on items in the second half. This research demonstrates the usefulness of DRT methodology in gaining insights about the differential engagement of students with assessment tasks.
{"title":"Comparing Test-Taking Behaviors of English Language Learners (ELLs) to Non-ELL Students: Use of Response Time in Measurement Comparability Research","authors":"Hongwen Guo, Kadriye Ercikan","doi":"10.1002/ets2.12340","DOIUrl":"10.1002/ets2.12340","url":null,"abstract":"<p>In this report, we demonstrate use of differential response time (DRT) methodology, an extension of differential item functioning methodology, for examining differences in how students from different backgrounds engage with assessment tasks. We analyze response time data from a digitally delivered mathematics assessment to examine timing differences between English language learner (ELL) and non-ELL student groups. When matched on the total sum scores of the studied item form, results showed that ELLs spent a significantly longer time on most items compared to the non-ELLs who performed similarly on the test form. When matched on the total response time, results showed that ELL students spent a significantly longer time on items in the first half of the form but a shorter time on items in the second half. This research demonstrates the usefulness of DRT methodology in gaining insights about the differential engagement of students with assessment tasks.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-15"},"PeriodicalIF":0.0,"publicationDate":"2021-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12340","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48342715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Agreement statistics and measures of prediction accuracy are often used to assess the quality of two measures of a construct. Agreement statistics are appropriate for measures that are supposed to be interchangeable, whereas prediction accuracy statistics are appropriate for situations where one variable is the target and the other variables are predictors. Using bivariate normality assumptions, we analytically examine the impact of categorization of a continuous variable and mean/sigma scaling on different measures of agreement and different measures of prediction accuracy. We vary the degree of relationship (squared correlation) between two continuous measures of a construct and the degree to which these measures are reduced to fewer and fewer categories (categorization). The main findings include that (a) categorization influences all the statistics investigated, (b) the correlation between the continuous variables affects the values of the statistics, and (c) scaling a prediction of a target variable to have the same mean and variability as the target increases agreement (according to Cohen's kappa and quadratic weighted kappa) but does so at the expense of prediction accuracy. The implications of these results for scoring of essays by humans or machines are also discussed.
{"title":"Impact of Categorization and Scaling on Classification Agreement and Prediction Accuracy Statistics","authors":"Wei Wang, Neil J. Dorans","doi":"10.1002/ets2.12339","DOIUrl":"10.1002/ets2.12339","url":null,"abstract":"<p>Agreement statistics and measures of prediction accuracy are often used to assess the quality of two measures of a construct. Agreement statistics are appropriate for measures that are supposed to be interchangeable, whereas prediction accuracy statistics are appropriate for situations where one variable is the target and the other variables are predictors. Using bivariate normality assumptions, we analytically examine the impact of categorization of a continuous variable and mean/sigma scaling on different measures of agreement and different measures of prediction accuracy. We vary the degree of relationship (squared correlation) between two continuous measures of a construct and the degree to which these measures are reduced to fewer and fewer categories (categorization). The main findings include that (a) categorization influences all the statistics investigated, (b) the correlation between the continuous variables affects the values of the statistics, and (c) scaling a prediction of a target variable to have the same mean and variability as the target increases agreement (according to Cohen's kappa and quadratic weighted kappa) but does so at the expense of prediction accuracy. The implications of these results for scoring of essays by humans or machines are also discussed.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-20"},"PeriodicalIF":0.0,"publicationDate":"2021-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12339","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44853401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shoko Sasayama, Pablo Garcia Gomez, John M. Norris
This report describes the development of efficient second language (L2) writing assessment tasks designed specifically for low-proficiency learners of English to be included in the TOEFL® Essentials™ test. Based on the can-do descriptors of the Common European Framework of Reference for Languages for the A1 through B1 levels of proficiency, four task types were identified to be prototypical candidate writing tasks for the target test-taker population (i.e., adolescent and adult low-proficiency English learners). Those four task types included: (a) Describe a Photo, (b) Write a Review, (c) Chat With a Friend, and (d) Write an E-mail. These task types were also considered efficient in the framework of the test in that they had the potential to be accessible to low-proficiency learners and to elicit sufficient spontaneous writing for assessment purposes within a short period of time. In the current study, eight assessment tasks, two for each task type, were developed and piloted with 169 A1–B1 learners of English from Japan and Colombia. The findings revealed that the Describe a Photo and Write an E-mail tasks performed the best in eliciting substantial language use and emphasizing distinct performance attributes, both characteristics needed for efficiently measuring test takers' writing proficiency as well as discriminating among proficiency levels at the lower end of the spectrum. The report concludes by highlighting some observations on L2 writing assessment task design for low-proficiency learners of English.
{"title":"Designing Efficient L2 Writing Assessment Tasks for Low-Proficiency Learners of English","authors":"Shoko Sasayama, Pablo Garcia Gomez, John M. Norris","doi":"10.1002/ets2.12341","DOIUrl":"10.1002/ets2.12341","url":null,"abstract":"<p>This report describes the development of efficient second language (L2) writing assessment tasks designed specifically for low-proficiency learners of English to be included in the <i>TOEFL® Essentials™</i> test. Based on the can-do descriptors of the Common European Framework of Reference for Languages for the A1 through B1 levels of proficiency, four task types were identified to be prototypical candidate writing tasks for the target test-taker population (i.e., adolescent and adult low-proficiency English learners). Those four task types included: (a) Describe a Photo, (b) Write a Review, (c) Chat With a Friend, and (d) Write an E-mail. These task types were also considered efficient in the framework of the test in that they had the potential to be accessible to low-proficiency learners and to elicit sufficient spontaneous writing for assessment purposes within a short period of time. In the current study, eight assessment tasks, two for each task type, were developed and piloted with 169 A1–B1 learners of English from Japan and Colombia. The findings revealed that the Describe a Photo and Write an E-mail tasks performed the best in eliciting substantial language use and emphasizing distinct performance attributes, both characteristics needed for efficiently measuring test takers' writing proficiency as well as discriminating among proficiency levels at the lower end of the spectrum. The report concludes by highlighting some observations on L2 writing assessment task design for low-proficiency learners of English.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-31"},"PeriodicalIF":0.0,"publicationDate":"2021-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12341","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46674905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The elicited imitation task (EIT), in which language learners listen to a series of spoken sentences and repeat each one verbatim, is a commonly used measure of language proficiency in second language acquisition research. The TOEFL® Essentials™ test includes an EIT as a holistic measure of speaking proficiency, referred to as the “Listen and Repeat” task type. In this report, we describe the design considerations that informed the development of the EIT for TOEFL Essentials. We also report the results of a series of investigations conducted during the prototyping and pilot phases of test development, which were undertaken with the goal of confirming task design specifications, evaluating scoring performance, and obtaining initial validity evidence to support score interpretation and use of the EIT in the TOEFL Essentials test. We found that task design variables generally performed as expected. The length of input sentence was strongly associated with performance (Pearson r = .88), consistent with the construct measured by the EIT, while other task variables not directly related to the EIT construct did not impact performance (e.g., graphics, speaker accent, and response time). Scorers drawn from TOEFL iBT test raters were able to score responses consistently with over 98% exact or adjacent interrater agreement on a 6-point scale, and scores on the pilot version of the EIT were highly reliable (Cronbach's α = .93 on the 15-item pilot version). Correlations between EIT scores and other measures were generally as expected: Correlations with other speaking tasks were high (.78–.84) and slightly to somewhat lower for other language measures (.73 for writing, .68 for listening, and .57 for reading). Correlation with an independent measure of holistic language proficiency (C-test) was moderately high (.69), as expected. We discuss the study findings in terms of the TOEFL Essentials test validity argument and point out limitations to the current results along with future research needs. Overall, we believe that the findings provide initial support to warrant the use of the EIT as operationalized in the TOEFL Essentials test.
{"title":"Developing an Innovative Elicited Imitation Task for Efficient English Proficiency Assessment","authors":"Larry Davis, John Norris","doi":"10.1002/ets2.12338","DOIUrl":"10.1002/ets2.12338","url":null,"abstract":"<p>The elicited imitation task (EIT), in which language learners listen to a series of spoken sentences and repeat each one verbatim, is a commonly used measure of language proficiency in second language acquisition research. The <i>TOEFL</i>® <i>Essentials</i>™ test includes an EIT as a holistic measure of speaking proficiency, referred to as the “Listen and Repeat” task type. In this report, we describe the design considerations that informed the development of the EIT for TOEFL Essentials. We also report the results of a series of investigations conducted during the prototyping and pilot phases of test development, which were undertaken with the goal of confirming task design specifications, evaluating scoring performance, and obtaining initial validity evidence to support score interpretation and use of the EIT in the TOEFL Essentials test. We found that task design variables generally performed as expected. The length of input sentence was strongly associated with performance (Pearson <i>r</i> = .88), consistent with the construct measured by the EIT, while other task variables not directly related to the EIT construct did not impact performance (e.g., graphics, speaker accent, and response time). Scorers drawn from TOEFL iBT test raters were able to score responses consistently with over 98% exact or adjacent interrater agreement on a 6-point scale, and scores on the pilot version of the EIT were highly reliable (Cronbach's α = .93 on the 15-item pilot version). Correlations between EIT scores and other measures were generally as expected: Correlations with other speaking tasks were high (.78–.84) and slightly to somewhat lower for other language measures (.73 for writing, .68 for listening, and .57 for reading). Correlation with an independent measure of holistic language proficiency (C-test) was moderately high (.69), as expected. We discuss the study findings in terms of the TOEFL Essentials test validity argument and point out limitations to the current results along with future research needs. Overall, we believe that the findings provide initial support to warrant the use of the EIT as operationalized in the TOEFL Essentials test.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-30"},"PeriodicalIF":0.0,"publicationDate":"2021-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12338","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47986785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}