ETS Research Report Series最新文献_第4页

Factors Considered in Graduate School Decision-Making: Implications for Graduate School Application and Acceptance 研究生院决策中考虑的因素对研究生院申请和录取的启示

Q3 Social Sciences

ETS Research Report Series

Pub Date : 2022-03-19 DOI: 10.1002/ets2.12348

Sugene Cho-Baker, Harrison J. Kell, Daniel Fishtein

The career gains of obtaining a graduate degree are well established, but those from lower socioeconomic status (SES) and underrepresented demographic backgrounds have persistently been disadvantaged in earning those degrees. We aim to contribute to research on enhancing access, diversity, and equity to graduate education by providing insights into what motivates individuals to pursue a graduate education across demographic and socioeconomic backgrounds. Using survey data collected from GRE® test takers at two time points and exploratory structural equation modeling, we explore the factors that individuals consider to be important for pursuing graduate education and selecting graduate programs, along with subsequent application and acceptance outcomes. We identified three factors considered in deciding to pursue graduate school and six factors considered in selecting graduate school programs. Those who aimed to apply to graduate school for professional development considered an extensive set of factors in selecting programs. The factors considered varied by gender, ethnicity/race, and SES. These factors further varied in the extent to which they predicted graduate school application and acceptance outcomes.

获得研究生学位的职业收益是公认的，但那些来自较低社会经济地位(SES)和代表性不足的人口背景的人在获得这些学位方面一直处于不利地位。我们的目标是通过深入了解个人在人口和社会经济背景下追求研究生教育的动机，为提高研究生教育的可及性、多样性和公平性的研究做出贡献。利用从两个时间点的GRE考生中收集的调查数据和探索性结构方程模型，我们探讨了个人认为对追求研究生教育和选择研究生课程以及随后的申请和录取结果重要的因素。我们确定了决定读研时要考虑的三个因素和选择读研项目时要考虑的六个因素。那些打算申请研究生院进行专业发展的人在选择课程时考虑了一系列广泛的因素。考虑的因素因性别、民族/种族和社会经济地位而异。这些因素在预测研究生院申请和录取结果的程度上进一步变化。

{"title":"Factors Considered in Graduate School Decision-Making: Implications for Graduate School Application and Acceptance","authors":"Sugene Cho-Baker, Harrison J. Kell, Daniel Fishtein","doi":"10.1002/ets2.12348","DOIUrl":"10.1002/ets2.12348","url":null,"abstract":"The career gains of obtaining a graduate degree are well established, but those from lower socioeconomic status (SES) and underrepresented demographic backgrounds have persistently been disadvantaged in earning those degrees. We aim to contribute to research on enhancing access, diversity, and equity to graduate education by providing insights into what motivates individuals to pursue a graduate education across demographic and socioeconomic backgrounds. Using survey data collected from GRE® test takers at two time points and exploratory structural equation modeling, we explore the factors that individuals consider to be important for pursuing graduate education and selecting graduate programs, along with subsequent application and acceptance outcomes. We identified three factors considered in deciding to pursue graduate school and six factors considered in selecting graduate school programs. Those who aimed to apply to graduate school for professional development considered an extensive set of factors in selecting programs. The factors considered varied by gender, ethnicity/race, and SES. These factors further varied in the extent to which they predicted graduate school application and acceptance outcomes.","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2022 1","pages":"1-18"},"PeriodicalIF":0.0,"publicationDate":"2022-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12348","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44481902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

The Impact of Using Synthetically Generated Listening Stimuli on Test-Taker Performance: A Case Study With Multiple-Choice, Single-Selection Items 使用综合生成的听力刺激对考生表现的影响:多项选择和单项选择项目的案例研究

Q3 Social Sciences

ETS Research Report Series

Pub Date : 2022-03-11 DOI: 10.1002/ets2.12347

Ikkyu Choi, Jiyun Zu

Synthetically generated speech (SGS) has become an integral part of our oral communication in a wide variety of contexts. It can be generated instantly at a low cost and allows precise control over multiple aspects of output, all of which can be highly appealing to second language (L2) assessment developers who have traditionally relied upon human voice actors for recording audio materials. Nevertheless, SGS is not widely used in L2 assessments. One major concern in this use case lies in its potential impact on test-taker performance: Would the use of SGS (as opposed to using human voice actor recordings) change how test takers respond to an item? In this study, we investigated using SGS as stimuli for English L2 listening assessment items on test-taker performance. The data came from a pilot administration of multiple new task types and included 653 test takers' responses to two versions of the same 13 items, differing only in terms of their listening stimuli: a version using human voice actor recordings and the other version with SGS files. Multifaceted comparisons between test takers' responses across the two versions showed that the two versions elicited remarkably comparable performance. The comparability provides strong empirical evidence for the use of SGS as a viable alternative for human voice actor recordings in the immediate domain of L2 assessment as well as related domains such as learning material and research instrument development.

合成生成语音(SGS)已经成为我们在各种语境中口语交流的重要组成部分。它可以以低成本立即生成，并允许对输出的多个方面进行精确控制，所有这些都对传统上依赖人类声优录制音频材料的第二语言(L2)评估开发人员非常有吸引力。然而，SGS在第二语言评估中的应用并不广泛。这个用例中的一个主要问题在于它对考生表现的潜在影响:使用SGS(相对于使用人类配音演员的录音)会改变考生对一个项目的反应吗?在本研究中，我们调查了使用SGS作为英语第二语言听力评估项目对考生表现的刺激。这些数据来自多个新任务类型的试点管理，包括653名考生对相同13个项目的两个版本的回答，不同的只是他们的听力刺激:一个版本使用人类配音演员录音，另一个版本使用SGS文件。对两种版本的考生的反应进行多方面的比较表明，两种版本的表现具有显著的可比性。这种可比性提供了强有力的经验证据，证明在第二语言评估的直接领域以及学习材料和研究仪器开发等相关领域，使用SGS作为人类声优录音的可行替代方案。

{"title":"The Impact of Using Synthetically Generated Listening Stimuli on Test-Taker Performance: A Case Study With Multiple-Choice, Single-Selection Items","authors":"Ikkyu Choi, Jiyun Zu","doi":"10.1002/ets2.12347","DOIUrl":"10.1002/ets2.12347","url":null,"abstract":"Synthetically generated speech (SGS) has become an integral part of our oral communication in a wide variety of contexts. It can be generated instantly at a low cost and allows precise control over multiple aspects of output, all of which can be highly appealing to second language (L2) assessment developers who have traditionally relied upon human voice actors for recording audio materials. Nevertheless, SGS is not widely used in L2 assessments. One major concern in this use case lies in its potential impact on test-taker performance: Would the use of SGS (as opposed to using human voice actor recordings) change how test takers respond to an item? In this study, we investigated using SGS as stimuli for English L2 listening assessment items on test-taker performance. The data came from a pilot administration of multiple new task types and included 653 test takers' responses to two versions of the same 13 items, differing only in terms of their listening stimuli: a version using human voice actor recordings and the other version with SGS files. Multifaceted comparisons between test takers' responses across the two versions showed that the two versions elicited remarkably comparable performance. The comparability provides strong empirical evidence for the use of SGS as a viable alternative for human voice actor recordings in the immediate domain of L2 assessment as well as related domains such as learning material and research instrument development.","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2022 1","pages":"1-14"},"PeriodicalIF":0.0,"publicationDate":"2022-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12347","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48532571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving Noncognitive Constructs for Career Readiness and Success: A Theory of Change for Postsecondary, Workplace, and Research Applications 改善职业准备和成功的非认知结构：中学、工作场所和研究应用的变革理论

Q3 Social Sciences

ETS Research Report Series

Pub Date : 2022-01-24 DOI: 10.1002/ets2.12346

Kevin M. Williams, Michelle P. Martin-Raugh, Jennifer E. Lentini

Researchers and practitioners in postsecondary and workplace settings recognize the value of noncognitive constructs in predicting academic and vocational success but also perceive that many students or employees are lacking in these areas. In turn, there is increased interest in interventions designed to enhance these constructs. We provide an empirically informed theory of change (ToC) that describes the inputs, mechanisms, and outputs of noncognitive construct interventions (NCIs). The components that inform this ToC include specific relevant constructs that are amenable to intervention, intervention content and mechanisms of change, methodological considerations, moderators of program efficacy, recommendations for evaluating NCIs, and suggested outcomes. In turn, NCIs should provide benefits to individuals, institutions, and society at large and also advance our scientific understanding of this important phenomenon.

高等教育和工作场所的研究人员和从业者认识到非认知结构在预测学术和职业成功方面的价值，但也意识到许多学生或员工在这些领域缺乏。反过来，人们对旨在增强这些结构的干预措施越来越感兴趣。我们提供了一个经验丰富的变化理论(ToC)，描述了非认知结构干预(nci)的输入、机制和输出。该ToC的组成部分包括可适应干预的特定相关结构、干预内容和变化机制、方法考虑、项目效果的调节因素、评估NCIs的建议以及建议的结果。反过来，NCIs应该为个人、机构和整个社会带来好处，并推进我们对这一重要现象的科学理解。

引用次数: 3

Comparing the Effect of Contextualized Versus Generic Automated Feedback on Students' Scientific Argumentation 情境化与一般性自动反馈对学生科学论证的影响比较

Q3 Social Sciences

ETS Research Report Series

Pub Date : 2022-01-19 DOI: 10.1002/ets2.12344

Margarita Olivera-Aguilar, Hee-Sun Lee, Amy Pallant, Vinetha Belur, Matthew Mulholland, Ou Lydia Liu

This study uses a computerized formative assessment system that provides automated scoring and feedback to help students write scientific arguments in a climate change curriculum. We compared the effect of contextualized versus generic automated feedback on students' explanations of scientific claims and attributions of uncertainty to those claims. Classes were randomly assigned to the contextualized feedback condition (227 students from 11 classes) or to the generic feedback condition (138 students from 9 classes). The results indicate that the formative assessment helped students improve their scores in both explanation and uncertainty scores, but larger score gains were found in the uncertainty attribution scores. Although the contextualized feedback was associated with higher final scores, this effect was moderated by the number of revisions made, the initial score, and gender. We discuss how the results might be related to students' familiarity with writing scientific explanations versus uncertainty attributions at school.

这项研究使用了一个计算机化的形成性评估系统，该系统提供自动评分和反馈，以帮助学生在气候变化课程中撰写科学论点。我们比较了情境化与通用自动化反馈对学生对科学主张的解释和对这些主张的不确定性归因的影响。班级被随机分配到情境化反馈条件(来自11个班级的227名学生)或一般反馈条件(来自9个班级的138名学生)。结果表明，形成性评价有助于提高学生的解释和不确定性得分，但不确定性归因得分的提高幅度较大。虽然情境化反馈与较高的最终分数有关，但这种影响受到修改次数、初始分数和性别的影响。我们讨论了结果可能与学生在学校对写作科学解释的熟悉程度和不确定性归因之间的关系。

{"title":"Comparing the Effect of Contextualized Versus Generic Automated Feedback on Students' Scientific Argumentation","authors":"Margarita Olivera-Aguilar, Hee-Sun Lee, Amy Pallant, Vinetha Belur, Matthew Mulholland, Ou Lydia Liu","doi":"10.1002/ets2.12344","DOIUrl":"https://doi.org/10.1002/ets2.12344","url":null,"abstract":"This study uses a computerized formative assessment system that provides automated scoring and feedback to help students write scientific arguments in a climate change curriculum. We compared the effect of contextualized versus generic automated feedback on students' explanations of scientific claims and attributions of uncertainty to those claims. Classes were randomly assigned to the contextualized feedback condition (227 students from 11 classes) or to the generic feedback condition (138 students from 9 classes). The results indicate that the formative assessment helped students improve their scores in both explanation and uncertainty scores, but larger score gains were found in the uncertainty attribution scores. Although the contextualized feedback was associated with higher final scores, this effect was moderated by the number of revisions made, the initial score, and gender. We discuss how the results might be related to students' familiarity with writing scientific explanations versus uncertainty attributions at school.","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2022 1","pages":"1-14"},"PeriodicalIF":0.0,"publicationDate":"2022-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12344","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"109171717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring GRE® and TOEFL® Score Profiles of International Students Intending to Pursue a Graduate Degree in the United States 探索有意在美国攻读研究生学位的国际学生的GRE®和托福®成绩概况

Q3 Social Sciences

ETS Research Report Series

Pub Date : 2022-01-17 DOI: 10.1002/ets2.12343

Katrina Roohr, Margarita Olivera-Aguilar, Jennifer Bochenek, Vinetha Belur

The United States continues to be a top destination for international students pursuing an advanced degree. Some information about the characteristics of international students applying to graduate programs in the United States is available, but little is known about how these characteristics are related to test taker performance on graduate admissions tests and how performance may be related to graduate program characteristics. The purpose of this study was to investigate different patterns of performance of international test takers from four cultural regions and two large countries (China and India) on both the GRE® test and the TOEFL® test and the relationship with demographic and graduate program characteristics. Using finite mixture modeling, we investigated the most common score profiles using GRE and TOEFL for international students intending to pursue a graduate program within the United States; evaluated the demographic and college-level factors related to the profiles; and evaluated whether the profiles were differentially associated with gender, intended field of study, and intended degree level. Results showed the following broad patterns of results: (a) Most countries and cultural regions, except for the Middle East, had three or four latent profiles representing low, medium, and high scores on the GRE and TOEFL sections; (b) two high-performing profiles were found in Confucian Asia, one with higher GRE Quantitative Reasoning scores and the other with higher scores on GRE Verbal and TOEFL; (c) regardless of profile, test takers from China performed highest on the GRE Quantitative Reasoning section as compared to other GRE and TOEFL section scores; (d) in general, there was a relationship with students in the lower performing profiles taking the TOEFL and GRE multiple times; (e) regardless of country or cultural region, men were represented more than women overall and across most of the profiles; and (f) test takers showed a preference for science-, technology-, engineering-, and mathematics-based fields and master's degrees, but this varied across country and cultural region. Implications for future research are discussed.

美国仍然是国际学生攻读高等学位的首选目的地。关于申请美国研究生课程的国际学生的特征的一些信息是可用的，但是对于这些特征与研究生入学考试中考生的表现之间的关系以及表现与研究生课程特征之间的关系知之甚少。本研究的目的是调查来自四个文化区域和两个大国(中国和印度)的国际考生在GRE®考试和托福®考试中的不同表现模式，以及与人口统计学和研究生课程特征的关系。使用有限混合模型，我们调查了打算在美国攻读研究生课程的国际学生在GRE和托福考试中最常见的分数概况;评估与档案相关的人口统计学和大学水平因素;并评估这些档案是否与性别、预期的研究领域和预期的学位水平存在差异。结果显示了以下大致的结果模式:(a)除中东地区外，大多数国家和文化区域在GRE和托福部分有三到四个潜在的特征，代表低、中、高的分数;(b)在儒家亚洲发现了两种高表现，一种是GRE定量推理得分较高，另一种是GRE口头和托福得分较高;(c)与其他GRE和托福考试成绩相比，中国考生在GRE定量推理部分的成绩最高;(d)总体而言，成绩较差的学生多次参加托福和GRE考试与学业成绩有关系;(e)不论国家或文化区域如何，总的来说，在大多数概况中，男子的人数都比妇女多;(f)考生表现出对科学、技术、工程和数学为基础的领域和硕士学位的偏好，但这在不同国家和文化地区有所不同。讨论了对未来研究的启示。

{"title":"Exploring GRE® and TOEFL® Score Profiles of International Students Intending to Pursue a Graduate Degree in the United States","authors":"Katrina Roohr, Margarita Olivera-Aguilar, Jennifer Bochenek, Vinetha Belur","doi":"10.1002/ets2.12343","DOIUrl":"https://doi.org/10.1002/ets2.12343","url":null,"abstract":"The United States continues to be a top destination for international students pursuing an advanced degree. Some information about the characteristics of international students applying to graduate programs in the United States is available, but little is known about how these characteristics are related to test taker performance on graduate admissions tests and how performance may be related to graduate program characteristics. The purpose of this study was to investigate different patterns of performance of international test takers from four cultural regions and two large countries (China and India) on both the GRE® test and the TOEFL® test and the relationship with demographic and graduate program characteristics. Using finite mixture modeling, we investigated the most common score profiles using GRE and TOEFL for international students intending to pursue a graduate program within the United States; evaluated the demographic and college-level factors related to the profiles; and evaluated whether the profiles were differentially associated with gender, intended field of study, and intended degree level. Results showed the following broad patterns of results: (a) Most countries and cultural regions, except for the Middle East, had three or four latent profiles representing low, medium, and high scores on the GRE and TOEFL sections; (b) two high-performing profiles were found in Confucian Asia, one with higher GRE Quantitative Reasoning scores and the other with higher scores on GRE Verbal and TOEFL; (c) regardless of profile, test takers from China performed highest on the GRE Quantitative Reasoning section as compared to other GRE and TOEFL section scores; (d) in general, there was a relationship with students in the lower performing profiles taking the TOEFL and GRE multiple times; (e) regardless of country or cultural region, men were represented more than women overall and across most of the profiles; and (f) test takers showed a preference for science-, technology-, engineering-, and mathematics-based fields and master's degrees, but this varied across country and cultural region. Implications for future research are discussed.","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2022 1","pages":"1-27"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12343","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"109172488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Influence of Selected-Response Format Variants on Test Characteristics and Test-Taking Effort: An Empirical Study 选择反应形式变异对测试特征和测试努力的影响：一项实证研究

Q3 Social Sciences

ETS Research Report Series

Pub Date : 2022-01-02 DOI: 10.1002/ets2.12345

Hongwen Guo, Joseph A. Rios, Guangming Ling, Zhen Wang, Lin Gu, Zhitong Yang, Lydia O. Liu

Different variants of the selected-response (SR) item type have been developed for various reasons (i.e., simulating realistic situations, examining critical-thinking and/or problem-solving skills). Generally, the variants of SR item format are more complex than the traditional multiple-choice (MC) items, which may be more challenging to test takers and thus may discourage their test engagement on low-stakes assessments. Low test-taking effort has been shown to distort test scores and thereby diminish score validity. We used data collected from a large-scale assessment to investigate how variants of the SR item format may impact test properties and test engagement. Results show that the studied variants of SR item format were generally harder and more time consuming compared to the traditional MC item format, but they did not show negative impact on test-taking effort. However, item position had a dominating influence on nonresponse rates and rapid-guessing rates in a cumulative fashion, even though the effect sizes were relatively small in the studied data.

由于各种原因(例如，模拟现实情况，考察批判性思维和/或解决问题的能力)，已经开发了选择反应(SR)项目类型的不同变体。一般来说，多选题形式的变体比传统的多项选择题更复杂，这可能对考生更具挑战性，从而可能阻碍他们对低风险评估的参与。低应试努力已被证明会扭曲考试成绩，从而降低分数的有效性。我们使用从大规模评估中收集的数据来调查SR项目格式的变体如何影响测试属性和测试参与。结果表明，与传统的MC题型相比，被研究的SR题型变体普遍较难且耗时更长，但对应试努力没有负面影响。然而，物品位置对无反应率和快速猜测率具有累积的主导影响，尽管研究数据中的效应量相对较小。

{"title":"Influence of Selected-Response Format Variants on Test Characteristics and Test-Taking Effort: An Empirical Study","authors":"Hongwen Guo, Joseph A. Rios, Guangming Ling, Zhen Wang, Lin Gu, Zhitong Yang, Lydia O. Liu","doi":"10.1002/ets2.12345","DOIUrl":"10.1002/ets2.12345","url":null,"abstract":"Different variants of the selected-response (SR) item type have been developed for various reasons (i.e., simulating realistic situations, examining critical-thinking and/or problem-solving skills). Generally, the variants of SR item format are more complex than the traditional multiple-choice (MC) items, which may be more challenging to test takers and thus may discourage their test engagement on low-stakes assessments. Low test-taking effort has been shown to distort test scores and thereby diminish score validity. We used data collected from a large-scale assessment to investigate how variants of the SR item format may impact test properties and test engagement. Results show that the studied variants of SR item format were generally harder and more time consuming compared to the traditional MC item format, but they did not show negative impact on test-taking effort. However, item position had a dominating influence on nonresponse rates and rapid-guessing rates in a cumulative fashion, even though the effect sizes were relatively small in the studied data.","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2022 1","pages":"1-20"},"PeriodicalIF":0.0,"publicationDate":"2022-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12345","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47274215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Comparing Test-Taking Behaviors of English Language Learners (ELLs) to Non-ELL Students: Use of Response Time in Measurement Comparability Research 比较英语学习者和非英语学习者的应试行为:反应时间在测量可比性研究中的应用

Q3 Social Sciences

ETS Research Report Series

Pub Date : 2021-11-29 DOI: 10.1002/ets2.12340

Hongwen Guo, Kadriye Ercikan

In this report, we demonstrate use of differential response time (DRT) methodology, an extension of differential item functioning methodology, for examining differences in how students from different backgrounds engage with assessment tasks. We analyze response time data from a digitally delivered mathematics assessment to examine timing differences between English language learner (ELL) and non-ELL student groups. When matched on the total sum scores of the studied item form, results showed that ELLs spent a significantly longer time on most items compared to the non-ELLs who performed similarly on the test form. When matched on the total response time, results showed that ELL students spent a significantly longer time on items in the first half of the form but a shorter time on items in the second half. This research demonstrates the usefulness of DRT methodology in gaining insights about the differential engagement of students with assessment tasks.

在本报告中，我们演示了差分响应时间(DRT)方法的使用，这是差分项目功能方法的扩展，用于检查来自不同背景的学生如何参与评估任务的差异。我们分析了来自数字交付数学评估的响应时间数据，以检查英语学习者(ELL)和非ELL学生群体之间的时间差异。当对所研究的项目表格的总得分进行匹配时，结果表明，与在测试表格上表现相似的非ELLs相比，ELLs在大多数项目上花费的时间明显更长。在对总反应时间进行匹配时，结果显示，英语学习者在表格的前半部分花费的时间明显更长，而在表格的后半部分花费的时间更短。这项研究证明了DRT方法在了解学生对评估任务的不同参与方面的有用性。

引用次数: 3

Impact of Categorization and Scaling on Classification Agreement and Prediction Accuracy Statistics 分类和标度对分类一致性和预测精度统计的影响

Q3 Social Sciences

ETS Research Report Series

Pub Date : 2021-11-24 DOI: 10.1002/ets2.12339

Wei Wang, Neil J. Dorans

Agreement statistics and measures of prediction accuracy are often used to assess the quality of two measures of a construct. Agreement statistics are appropriate for measures that are supposed to be interchangeable, whereas prediction accuracy statistics are appropriate for situations where one variable is the target and the other variables are predictors. Using bivariate normality assumptions, we analytically examine the impact of categorization of a continuous variable and mean/sigma scaling on different measures of agreement and different measures of prediction accuracy. We vary the degree of relationship (squared correlation) between two continuous measures of a construct and the degree to which these measures are reduced to fewer and fewer categories (categorization). The main findings include that (a) categorization influences all the statistics investigated, (b) the correlation between the continuous variables affects the values of the statistics, and (c) scaling a prediction of a target variable to have the same mean and variability as the target increases agreement (according to Cohen's kappa and quadratic weighted kappa) but does so at the expense of prediction accuracy. The implications of these results for scoring of essays by humans or machines are also discussed.

一致性统计和预测准确性的度量通常用于评估一个结构的两个度量的质量。协议统计信息适用于应该是可互换的度量，而预测准确性统计信息适用于一个变量是目标而其他变量是预测器的情况。使用二元正态性假设，我们分析了连续变量的分类和均值/西格玛缩放对不同的一致性度量和不同的预测精度度量的影响。我们改变一个结构的两个连续测量之间的关系程度(平方相关)，以及这些测量被减少到越来越少的类别的程度(分类)。主要发现包括:(a)分类影响所有调查的统计数据，(b)连续变量之间的相关性影响统计数据的值，以及(c)缩放目标变量的预测，使其具有与目标相同的平均值和可变性，从而增加一致性(根据Cohen的kappa和二次加权kappa)，但这样做是以牺牲预测准确性为代价的。还讨论了这些结果对人类或机器评分的影响。

{"title":"Impact of Categorization and Scaling on Classification Agreement and Prediction Accuracy Statistics","authors":"Wei Wang, Neil J. Dorans","doi":"10.1002/ets2.12339","DOIUrl":"10.1002/ets2.12339","url":null,"abstract":"Agreement statistics and measures of prediction accuracy are often used to assess the quality of two measures of a construct. Agreement statistics are appropriate for measures that are supposed to be interchangeable, whereas prediction accuracy statistics are appropriate for situations where one variable is the target and the other variables are predictors. Using bivariate normality assumptions, we analytically examine the impact of categorization of a continuous variable and mean/sigma scaling on different measures of agreement and different measures of prediction accuracy. We vary the degree of relationship (squared correlation) between two continuous measures of a construct and the degree to which these measures are reduced to fewer and fewer categories (categorization). The main findings include that (a) categorization influences all the statistics investigated, (b) the correlation between the continuous variables affects the values of the statistics, and (c) scaling a prediction of a target variable to have the same mean and variability as the target increases agreement (according to Cohen's kappa and quadratic weighted kappa) but does so at the expense of prediction accuracy. The implications of these results for scoring of essays by humans or machines are also discussed.","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-20"},"PeriodicalIF":0.0,"publicationDate":"2021-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12339","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44853401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Designing Efficient L2 Writing Assessment Tasks for Low-Proficiency Learners of English 为低水平英语学习者设计高效的第二语言写作评估任务

Q3 Social Sciences

ETS Research Report Series

Pub Date : 2021-11-24 DOI: 10.1002/ets2.12341

Shoko Sasayama, Pablo Garcia Gomez, John M. Norris

This report describes the development of efficient second language (L2) writing assessment tasks designed specifically for low-proficiency learners of English to be included in the TOEFL® Essentials™ test. Based on the can-do descriptors of the Common European Framework of Reference for Languages for the A1 through B1 levels of proficiency, four task types were identified to be prototypical candidate writing tasks for the target test-taker population (i.e., adolescent and adult low-proficiency English learners). Those four task types included: (a) Describe a Photo, (b) Write a Review, (c) Chat With a Friend, and (d) Write an E-mail. These task types were also considered efficient in the framework of the test in that they had the potential to be accessible to low-proficiency learners and to elicit sufficient spontaneous writing for assessment purposes within a short period of time. In the current study, eight assessment tasks, two for each task type, were developed and piloted with 169 A1–B1 learners of English from Japan and Colombia. The findings revealed that the Describe a Photo and Write an E-mail tasks performed the best in eliciting substantial language use and emphasizing distinct performance attributes, both characteristics needed for efficiently measuring test takers' writing proficiency as well as discriminating among proficiency levels at the lower end of the spectrum. The report concludes by highlighting some observations on L2 writing assessment task design for low-proficiency learners of English.

本报告描述了专为低水平英语学习者设计的高效第二语言(L2)写作评估任务的开发，并将其纳入托福®基本要素™测试。基于《欧洲共同语言参考框架》中针对A1至B1熟练程度的“能做”描述符，我们确定了四种任务类型作为目标考生群体(即青少年和成人低水平英语学习者)的典型候选写作任务。这四种任务类型包括:(a)描述一张照片，(b)写一篇评论，(c)与朋友聊天，(d)写一封电子邮件。在测试的框架中，这些任务类型也被认为是有效的，因为它们有可能被低水平的学习者所接受，并在短时间内引发足够的自发写作以达到评估目的。在目前的研究中，开发了八个评估任务，每个任务类型两个，并在169名来自日本和哥伦比亚的A1-B1英语学习者中进行了试点。研究结果显示，“描述一张照片”和“写一封电子邮件”任务在引出大量语言使用和强调不同的表现属性方面表现最好，这两个特征都需要有效地衡量考生的写作水平，以及区分较低水平的熟练程度。报告最后强调了对低水平英语学习者第二语言写作评估任务设计的一些观察。

{"title":"Designing Efficient L2 Writing Assessment Tasks for Low-Proficiency Learners of English","authors":"Shoko Sasayama, Pablo Garcia Gomez, John M. Norris","doi":"10.1002/ets2.12341","DOIUrl":"10.1002/ets2.12341","url":null,"abstract":"This report describes the development of efficient second language (L2) writing assessment tasks designed specifically for low-proficiency learners of English to be included in the TOEFL® Essentials™ test. Based on the can-do descriptors of the Common European Framework of Reference for Languages for the A1 through B1 levels of proficiency, four task types were identified to be prototypical candidate writing tasks for the target test-taker population (i.e., adolescent and adult low-proficiency English learners). Those four task types included: (a) Describe a Photo, (b) Write a Review, (c) Chat With a Friend, and (d) Write an E-mail. These task types were also considered efficient in the framework of the test in that they had the potential to be accessible to low-proficiency learners and to elicit sufficient spontaneous writing for assessment purposes within a short period of time. In the current study, eight assessment tasks, two for each task type, were developed and piloted with 169 A1–B1 learners of English from Japan and Colombia. The findings revealed that the Describe a Photo and Write an E-mail tasks performed the best in eliciting substantial language use and emphasizing distinct performance attributes, both characteristics needed for efficiently measuring test takers' writing proficiency as well as discriminating among proficiency levels at the lower end of the spectrum. The report concludes by highlighting some observations on L2 writing assessment task design for low-proficiency learners of English.","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-31"},"PeriodicalIF":0.0,"publicationDate":"2021-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12341","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46674905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Developing an Innovative Elicited Imitation Task for Efficient English Proficiency Assessment 开发一种创新的引出式模仿任务用于高效英语水平评估

Q3 Social Sciences

ETS Research Report Series

Pub Date : 2021-11-17 DOI: 10.1002/ets2.12338

Larry Davis, John Norris

The elicited imitation task (EIT), in which language learners listen to a series of spoken sentences and repeat each one verbatim, is a commonly used measure of language proficiency in second language acquisition research. The TOEFL® Essentials™ test includes an EIT as a holistic measure of speaking proficiency, referred to as the “Listen and Repeat” task type. In this report, we describe the design considerations that informed the development of the EIT for TOEFL Essentials. We also report the results of a series of investigations conducted during the prototyping and pilot phases of test development, which were undertaken with the goal of confirming task design specifications, evaluating scoring performance, and obtaining initial validity evidence to support score interpretation and use of the EIT in the TOEFL Essentials test. We found that task design variables generally performed as expected. The length of input sentence was strongly associated with performance (Pearson r = .88), consistent with the construct measured by the EIT, while other task variables not directly related to the EIT construct did not impact performance (e.g., graphics, speaker accent, and response time). Scorers drawn from TOEFL iBT test raters were able to score responses consistently with over 98% exact or adjacent interrater agreement on a 6-point scale, and scores on the pilot version of the EIT were highly reliable (Cronbach's α = .93 on the 15-item pilot version). Correlations between EIT scores and other measures were generally as expected: Correlations with other speaking tasks were high (.78–.84) and slightly to somewhat lower for other language measures (.73 for writing, .68 for listening, and .57 for reading). Correlation with an independent measure of holistic language proficiency (C-test) was moderately high (.69), as expected. We discuss the study findings in terms of the TOEFL Essentials test validity argument and point out limitations to the current results along with future research needs. Overall, we believe that the findings provide initial support to warrant the use of the EIT as operationalized in the TOEFL Essentials test.

诱导模仿任务(EIT)是语言学习者听一系列口语句子并逐字重复的任务，是第二语言习得研究中常用的语言熟练程度测量方法。托福®Essentials™考试包括一个EIT测试，作为口语能力的整体衡量标准，被称为“听和重复”任务类型。在本报告中，我们描述了为TOEFL Essentials开发EIT的设计考虑因素。我们还报告了在测试开发的原型和试点阶段进行的一系列调查的结果，这些调查的目的是确认任务设计规范，评估评分表现，并获得初步有效性证据，以支持在托福基本考试中解释分数和使用EIT。我们发现任务设计变量的表现与预期一致。输入句子的长度与表现密切相关(Pearson r = 0.88)，这与EIT测量的结构一致，而其他与EIT结构不直接相关的任务变量(例如，图形，说话者口音和反应时间)不影响表现。从托福网考评分员中抽取的评分者能够在6分制的评分中保持98%以上的准确或接近的一致性，并且EIT试点版本的分数是高度可靠的(在15项试点版本中Cronbach's α = 0.93)。EIT得分与其他指标之间的相关性总体上与预期一致:与其他口语任务的相关性较高(0.78 - 0.84)，而与其他语言指标的相关性略低(0.78 - 0.84)。写作73分，听力68分，阅读57分)。与整体语言能力的独立测量(C-test)的相关性中等高(0.69)，正如预期的那样。我们根据托福基本测试的有效性论点讨论了研究结果，并指出了当前结果的局限性以及未来的研究需求。总的来说，我们认为这些发现为在托福基础考试中使用EIT提供了初步支持。

{"title":"Developing an Innovative Elicited Imitation Task for Efficient English Proficiency Assessment","authors":"Larry Davis, John Norris","doi":"10.1002/ets2.12338","DOIUrl":"10.1002/ets2.12338","url":null,"abstract":"The elicited imitation task (EIT), in which language learners listen to a series of spoken sentences and repeat each one verbatim, is a commonly used measure of language proficiency in second language acquisition research. The TOEFL® Essentials™ test includes an EIT as a holistic measure of speaking proficiency, referred to as the “Listen and Repeat” task type. In this report, we describe the design considerations that informed the development of the EIT for TOEFL Essentials. We also report the results of a series of investigations conducted during the prototyping and pilot phases of test development, which were undertaken with the goal of confirming task design specifications, evaluating scoring performance, and obtaining initial validity evidence to support score interpretation and use of the EIT in the TOEFL Essentials test. We found that task design variables generally performed as expected. The length of input sentence was strongly associated with performance (Pearson r = .88), consistent with the construct measured by the EIT, while other task variables not directly related to the EIT construct did not impact performance (e.g., graphics, speaker accent, and response time). Scorers drawn from TOEFL iBT test raters were able to score responses consistently with over 98% exact or adjacent interrater agreement on a 6-point scale, and scores on the pilot version of the EIT were highly reliable (Cronbach's α = .93 on the 15-item pilot version). Correlations between EIT scores and other measures were generally as expected: Correlations with other speaking tasks were high (.78–.84) and slightly to somewhat lower for other language measures (.73 for writing, .68 for listening, and .57 for reading). Correlation with an independent measure of holistic language proficiency (C-test) was moderately high (.69), as expected. We discuss the study findings in terms of the TOEFL Essentials test validity argument and point out limitations to the current results along with future research needs. Overall, we believe that the findings provide initial support to warrant the use of the EIT as operationalized in the TOEFL Essentials test.","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2021 1","pages":"1-30"},"PeriodicalIF":0.0,"publicationDate":"2021-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12338","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47986785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7