Language Testing最新文献_第6页

Assessing the speaking proficiency of L2 Chinese learners: Review of the Hanyu Shuiping Kouyu Kaoshi 第二语言汉语学习者口语能力的评估:《汉语口语测试》述评

IF 4.1 1区文学 0 LANGUAGE & LINGUISTICS

Language Testing

Pub Date : 2023-04-27 DOI: 10.1177/02655322231163470

Albert W. Li

I have seen a couple of international students that achieved good scores on the HSK level 5— the advanced-level Chinese proficiency test, and yet [they] can barely communicate at all in Chinese, not even daily conversation like “how was your weekend?” (A professor who teaches Chinese at a Confucius Institute in the USA, Interview, February 26, 2022)

我见过一些国际学生在HSK五级考试中取得了很好的成绩，但他们几乎无法用中文交流，甚至连“周末过得怎么样？”这样的日常对话都无法进行。（美国孔子学院教授中文，访谈，2022年2月26日）

引用次数: 0

Speaking performances, stakeholder perceptions, and test scores: Extrapolating from the Duolingo English test to the university 演讲表现、利益相关者的看法和考试成绩:从多邻国英语考试到大学的推断

IF 4.1 1区文学 0 LANGUAGE & LINGUISTICS

Language Testing

Pub Date : 2023-04-24 DOI: 10.1177/02655322231165984

Daniel R. Isbell, Dustin Crowther, H. Nishizawa

The extrapolation of test scores to a target domain—that is, association between test performances and relevant real-world outcomes—is critical to valid score interpretation and use. This study examined the relationship between Duolingo English Test (DET) speaking scores and university stakeholders’ evaluation of DET speaking performances. A total of 190 university stakeholders (45 faculty members, 39 administrative staff, 53 graduate students, 53 undergraduate students) evaluated the comprehensibility (ease of understanding) and academic acceptability of 100 DET test-takers’ speaking performances. Academic acceptability was judged based on speakers’ suitability for communicative roles in the university context including undergraduate study, group work in courses, graduate study, and teaching. Analyses indicated a large correlation between aggregate measures of comprehensibility and acceptability ( r = .98). Acceptability ratings varied according to role: acceptability for teaching was held to a notably higher standard than acceptability for undergraduate study. Stakeholder groups also differed in their ratings, with faculty tending to be more lenient in their ratings of comprehensibility and acceptability than undergraduate students and staff. Finally, both comprehensibility and acceptability measures correlated strongly with speakers’ official DET scores and subscores ( r ⩾ .74–.89), providing some support for the extrapolation of DET scores to academic contexts.

将测试成绩外推到目标领域——即测试表现和相关现实世界结果之间的关联——对于有效的成绩解释和使用至关重要。本研究考察了多林戈英语测试（DET）口语成绩与大学利益相关者对DET口语表现的评价之间的关系。共有190名大学利益相关者（45名教职员工、39名行政人员、53名研究生和53名本科生）评估了100名DET考生的口语表现的可理解性（易理解性）和学术可接受性。学术可接受性是根据演讲者在大学环境中是否适合担任交际角色来判断的，包括本科生学习、课程小组工作、研究生学习和教学。分析表明，可理解性和可接受性的总体指标之间存在很大的相关性（r = .98）。可接受性评级因角色而异：教学可接受性的标准明显高于本科生学习可接受性。利益相关者群体的评分也有所不同，与本科生和教职员工相比，教职员工在可理解性和可接受性方面的评分往往更为宽松。最后，可理解性和可接受性测量与说话者的官方DET分数和分量表密切相关（r ⩾ .74–.89），为DET分数外推到学术背景提供了一些支持。

{"title":"Speaking performances, stakeholder perceptions, and test scores: Extrapolating from the Duolingo English test to the university","authors":"Daniel R. Isbell, Dustin Crowther, H. Nishizawa","doi":"10.1177/02655322231165984","DOIUrl":"https://doi.org/10.1177/02655322231165984","url":null,"abstract":"The extrapolation of test scores to a target domain—that is, association between test performances and relevant real-world outcomes—is critical to valid score interpretation and use. This study examined the relationship between Duolingo English Test (DET) speaking scores and university stakeholders’ evaluation of DET speaking performances. A total of 190 university stakeholders (45 faculty members, 39 administrative staff, 53 graduate students, 53 undergraduate students) evaluated the comprehensibility (ease of understanding) and academic acceptability of 100 DET test-takers’ speaking performances. Academic acceptability was judged based on speakers’ suitability for communicative roles in the university context including undergraduate study, group work in courses, graduate study, and teaching. Analyses indicated a large correlation between aggregate measures of comprehensibility and acceptability ( r = .98). Acceptability ratings varied according to role: acceptability for teaching was held to a notably higher standard than acceptability for undergraduate study. Stakeholder groups also differed in their ratings, with faculty tending to be more lenient in their ratings of comprehensibility and acceptability than undergraduate students and staff. Finally, both comprehensibility and acceptability measures correlated strongly with speakers’ official DET scores and subscores ( r ⩾ .74–.89), providing some support for the extrapolation of DET scores to academic contexts.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"1 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2023-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41653795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Establishing meaning recall and meaning recognition vocabulary knowledge as distinct psychometric constructs in relation to reading proficiency 将意义回忆和意义识别词汇知识建立为与阅读能力相关的不同心理测量结构

IF 4.1 1区文学 0 LANGUAGE & LINGUISTICS

Language Testing

Pub Date : 2023-04-24 DOI: 10.1177/02655322231162853

J. Stewart, Henrik Gyllstad, Christopher Nicklin, Stuart Mclean

The purpose of this paper is to (a) establish whether meaning recall and meaning recognition item formats test psychometrically distinct constructs of vocabulary knowledge which measure separate skills, and, if so, (b) determine whether each construct possesses unique properties predictive of L2 reading proficiency. Factor analyses and hierarchical regression were conducted on results derived from the two vocabulary item formats in order to test this hypothesis. The results indicated that although the two-factor model had better fit and meaning recall and meaning recognition can be considered distinct psychometrically, discriminant validity between the two factors is questionable. In hierarchical regression models, meaning recognition knowledge did not make a statistically significant contribution to explaining reading proficiency over meaning recall knowledge. However, when the roles were reversed, meaning recall did make a significant contribution to the model beyond the variance explained by meaning recognition alone. The results suggest that meaning recognition does not tap into unique aspects of vocabulary knowledge and provide empirical support for meaning recall as a superior predictor of reading proficiency for research purposes.

本文的目的是（a）确定意义回忆和意义识别项目格式是否测试衡量不同技能的词汇知识的心理上不同的结构，如果是，（b）确定每个结构是否具有预测二语阅读水平的独特特性。为了验证这一假设，对两种词汇项目格式的结果进行了因子分析和层次回归。结果表明，尽管双因素模型具有更好的拟合性，并且在心理测量学上可以认为意义回忆和意义识别是不同的，但这两个因素之间的判别有效性值得怀疑。在层次回归模型中，与意义回忆知识相比，意义识别知识在解释阅读能力方面没有统计学上的显著贡献。然而，当角色颠倒时，意义回忆确实对模型做出了重大贡献，而不仅仅是意义识别所解释的方差。研究结果表明，意义识别并没有利用词汇知识的独特方面，并为意义回忆作为研究目的的阅读能力的高级预测因子提供了实证支持。

{"title":"Establishing meaning recall and meaning recognition vocabulary knowledge as distinct psychometric constructs in relation to reading proficiency","authors":"J. Stewart, Henrik Gyllstad, Christopher Nicklin, Stuart Mclean","doi":"10.1177/02655322231162853","DOIUrl":"https://doi.org/10.1177/02655322231162853","url":null,"abstract":"The purpose of this paper is to (a) establish whether meaning recall and meaning recognition item formats test psychometrically distinct constructs of vocabulary knowledge which measure separate skills, and, if so, (b) determine whether each construct possesses unique properties predictive of L2 reading proficiency. Factor analyses and hierarchical regression were conducted on results derived from the two vocabulary item formats in order to test this hypothesis. The results indicated that although the two-factor model had better fit and meaning recall and meaning recognition can be considered distinct psychometrically, discriminant validity between the two factors is questionable. In hierarchical regression models, meaning recognition knowledge did not make a statistically significant contribution to explaining reading proficiency over meaning recall knowledge. However, when the roles were reversed, meaning recall did make a significant contribution to the model beyond the variance explained by meaning recognition alone. The results suggest that meaning recognition does not tap into unique aspects of vocabulary knowledge and provide empirical support for meaning recall as a superior predictor of reading proficiency for research purposes.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":" ","pages":""},"PeriodicalIF":4.1,"publicationDate":"2023-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48433203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Modeling local item dependence in C-tests with the loglinear Rasch model 基于对数线性Rasch模型的C-检验局部项目依赖性建模

IF 4.1 1区文学 0 LANGUAGE & LINGUISTICS

Language Testing

Pub Date : 2023-04-15 DOI: 10.1177/02655322231155109

Purya Baghaei, K. Christensen

C-tests are gap-filling tests mainly used as rough and economical measures of second-language proficiency for placement and research purposes. A C-test usually consists of several short independent passages where the second half of every other word is deleted. Owing to their interdependent structure, C-test items violate the local independence assumption of IRT models. This poses some problems for IRT analysis of C-tests. A few strategies and psychometric models have been suggested and employed in the literature to circumvent the problem. In this research, a new psychometric model, namely, the loglinear Rasch model, is used for C-tests and the results are compared with the dichotomous Rasch model where local item dependence is ignored. Findings showed that the loglinear Rasch model fits significantly better than the dichotomous Rasch model. Examination of the locally dependent items did not reveal anything as regards their contents. However, it did reveal that 50% of the dependent items were adjacent items. Implications of the study for modeling local dependence in C-tests using different approaches are discussed.

C测试是填补空白的测试，主要用于安置和研究目的，作为第二语言水平的粗略和经济的衡量标准。C测试通常由几个独立的短文组成，其中每隔一个单词的后半部分都被删除。由于其相互依赖的结构，C测试项目违反了IRT模型的局部独立性假设。这给C检验的IRT分析带来了一些问题。文献中提出并采用了一些策略和心理测量模型来规避这个问题。在本研究中，一种新的心理测量模型，即对数线性Rasch模型，被用于C测试，并将结果与忽略局部项目依赖性的二分Rasch模型进行了比较。研究结果表明，对数线性Rasch模型明显优于二分Rasch模型。对当地相关物品的检查没有发现其内容。然而，它确实显示，50%的依赖项目是相邻项目。讨论了该研究对使用不同方法模拟C测试中的局部依赖性的意义。

引用次数: 0

Examining the predictive validity of the Duolingo English Test: Evidence from a major UK university 检验多邻国英语测试的预测效度:来自英国一所主要大学的证据

IF 4.1 1区文学 0 LANGUAGE & LINGUISTICS

Language Testing

Pub Date : 2023-04-03 DOI: 10.1177/02655322231158550

T. Isaacs, Ruolin Hu, D. Trenkic, J. Varga

The COVID-19 pandemic has changed the university admissions and proficiency testing landscape. One change has been the meteoric rise in use of the fully automated Duolingo English Test (DET) for university entrance purposes, offering test-takers a cheaper, shorter, accessible alternative. This rapid response study is the first to investigate the predictive value of DET test scores in relation to university students’ academic attainment, taking into account students’ degree level, academic discipline, and nationality. We also compared DET test-takers’ academic performance with that of students admitted using traditional proficiency tests. Credit-weighted first-year academic grades of 1881 DET test-takers (1389 postgraduate, 492 undergraduate) enrolled at a large, research-intensive London university in Autumn 2020 were positively associated with DET Overall scores for postgraduate students (adj. r = .195) but not undergraduate students (adj. r = −.112). This result was mirrored in correlational patterns for students admitted through IELTS (n = 2651) and TOEFL iBT (n = 436), contributing to criterion-related validity evidence. Students admitted with DET enjoyed lower academic success than the IELTS and TOEFL iBT test-takers, although sample characteristics may have shaped this finding. We discuss implications for establishing cut scores and harnessing test-takers’ academic language development through pre-sessional and in-sessional support.

新冠肺炎疫情改变了大学招生和能力测试的格局。其中一个变化是，在大学入学考试中，全自动多邻国英语考试(DET)的使用迅速增加，为考生提供了一个更便宜、更短、更方便的选择。这项快速反应研究首次在考虑学生的学位水平、学科和国籍的情况下，研究DET考试成绩与大学生学业成就的预测价值。我们还比较了DET考生与使用传统能力测试录取的学生的学习成绩。2020年秋季，伦敦一所大型研究型大学招收了1881名DET考生(1389名研究生，492名本科生)，他们的学分加权第一年学术成绩与研究生DET总分呈正相关(adjj . r = 0.195)，但与本科生DET总分无关(adjj . r = - 0.112)。这一结果反映在通过雅思(2651)和托福网考(436)录取的学生的相关模式中，有助于提供与标准相关的效度证据。通过DET录取的学生比雅思和托福网考的学生在学业上的成功要低，尽管样本特征可能影响了这一发现。我们讨论了通过课前和课中支持来建立cut分数和利用考生的学术语言发展的含义。

{"title":"Examining the predictive validity of the Duolingo English Test: Evidence from a major UK university","authors":"T. Isaacs, Ruolin Hu, D. Trenkic, J. Varga","doi":"10.1177/02655322231158550","DOIUrl":"https://doi.org/10.1177/02655322231158550","url":null,"abstract":"The COVID-19 pandemic has changed the university admissions and proficiency testing landscape. One change has been the meteoric rise in use of the fully automated Duolingo English Test (DET) for university entrance purposes, offering test-takers a cheaper, shorter, accessible alternative. This rapid response study is the first to investigate the predictive value of DET test scores in relation to university students’ academic attainment, taking into account students’ degree level, academic discipline, and nationality. We also compared DET test-takers’ academic performance with that of students admitted using traditional proficiency tests. Credit-weighted first-year academic grades of 1881 DET test-takers (1389 postgraduate, 492 undergraduate) enrolled at a large, research-intensive London university in Autumn 2020 were positively associated with DET Overall scores for postgraduate students (adj. r = .195) but not undergraduate students (adj. r = −.112). This result was mirrored in correlational patterns for students admitted through IELTS (n = 2651) and TOEFL iBT (n = 436), contributing to criterion-related validity evidence. Students admitted with DET enjoyed lower academic success than the IELTS and TOEFL iBT test-takers, although sample characteristics may have shaped this finding. We discuss implications for establishing cut scores and harnessing test-takers’ academic language development through pre-sessional and in-sessional support.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"40 1","pages":"748 - 770"},"PeriodicalIF":4.1,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45194036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Temporal fluency and floor/ceiling scoring of intermediate and advanced speech on the ACTFL Spanish Oral Proficiency Interview–computer ACTFL西班牙语口语能力面试中中级和高级语言的时间流利性和最低/最高分数——计算机

IF 4.1 1区文学 0 LANGUAGE & LINGUISTICS

Language Testing

Pub Date : 2023-04-01 DOI: 10.1177/02655322221114614

Troy L. Cox, Alan V. Brown, Gregory L. Thompson

The rating of proficiency tests that use the Inter-agency Roundtable (ILR) and American Council on the Teaching of Foreign Languages (ACTFL) guidelines claims that each major level is based on hierarchal linguistic functions that require mastery of multidimensional traits in such a way that each level subsumes the levels beneath it. These characteristics are part of what is commonly referred to as floor and ceiling scoring. In this binary approach to scoring that differentiates between sustained performance and linguistic breakdown, raters evaluate many features including vocabulary use, grammatical accuracy, pronunciation, and pragmatics, yet there has been very little empirical validation on the practice of floor/ceiling scoring. This study examined the relationship between temporal oral fluency, prompt type, and proficiency level based on a data set comprised of 147 Oral Proficiency Interview - computer (OPIc) exam responses whose ratings ranged from Intermediate Low to Advanced High [AH]. As speakers progressed in proficiency, they were more fluent. In terms of floor and ceiling scoring, the prompts that elicited speech a level above the sustained level generally resulted in speech that was slower and had more breakdown than the floor-level prompts, though the differences were slight and not significantly different. Thus, temporal fluency features alone are insufficient in floor/ceiling scoring but are likely a contributing feature.

使用机构间圆桌会议(ILR)和美国外语教学委员会(ACTFL)指导方针的能力测试评级声称，每个主要级别都是基于分层的语言功能，需要掌握多维特征，这样每个级别都包含了它下面的级别。这些特征是通常被称为地板和天花板评分的一部分。在这种区分持续表现和语言崩溃的二元评分方法中，评分者评估许多特征，包括词汇使用、语法准确性、发音和语用，但对最低/最高评分的实践很少有经验验证。本研究基于147份口语水平面试-计算机(OPIc)考试回答的数据集，调查了时间口语流利度、提示类型和熟练程度之间的关系，这些回答的评分范围从中低到高级高[AH]。随着说话者熟练程度的提高，他们的口语也越来越流利。在最低和最高评分方面，引发高于持续水平的语音提示通常会导致比最低水平提示更慢的语音和更多的崩溃，尽管差异很小，没有显著差异。因此，单独的时间流畅性特征在下限/上限评分中是不够的，但可能是一个有贡献的特征。

{"title":"Temporal fluency and floor/ceiling scoring of intermediate and advanced speech on the ACTFL Spanish Oral Proficiency Interview–computer","authors":"Troy L. Cox, Alan V. Brown, Gregory L. Thompson","doi":"10.1177/02655322221114614","DOIUrl":"https://doi.org/10.1177/02655322221114614","url":null,"abstract":"The rating of proficiency tests that use the Inter-agency Roundtable (ILR) and American Council on the Teaching of Foreign Languages (ACTFL) guidelines claims that each major level is based on hierarchal linguistic functions that require mastery of multidimensional traits in such a way that each level subsumes the levels beneath it. These characteristics are part of what is commonly referred to as floor and ceiling scoring. In this binary approach to scoring that differentiates between sustained performance and linguistic breakdown, raters evaluate many features including vocabulary use, grammatical accuracy, pronunciation, and pragmatics, yet there has been very little empirical validation on the practice of floor/ceiling scoring. This study examined the relationship between temporal oral fluency, prompt type, and proficiency level based on a data set comprised of 147 Oral Proficiency Interview - computer (OPIc) exam responses whose ratings ranged from Intermediate Low to Advanced High [AH]. As speakers progressed in proficiency, they were more fluent. In terms of floor and ceiling scoring, the prompts that elicited speech a level above the sustained level generally resulted in speech that was slower and had more breakdown than the floor-level prompts, though the differences were slight and not significantly different. Thus, temporal fluency features alone are insufficient in floor/ceiling scoring but are likely a contributing feature.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"40 1","pages":"325 - 351"},"PeriodicalIF":4.1,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47966349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The distribution of cognates and their impact on response accuracy in the EIKEN tests EIKEN测试中同源物的分布及其对反应准确性的影响

IF 4.1 1区文学 0 LANGUAGE & LINGUISTICS

Language Testing

Pub Date : 2023-03-26 DOI: 10.1177/02655322231158551

David Allen, Keita Nakamura

Although there is abundant evidence for the use of first-language (L1) knowledge by bilinguals when using a second language (L2), investigation into the impact of L1 knowledge in large-scale L2 language assessments and discussion of how such impact may be controlled has received little attention in the language assessment literature. This study examines these issues through investigating the use of L1-Japanese loanword knowledge in test items targeting L2-English lexical knowledge in the Reading section of EIKEN grade-level tests, which are primarily taken by Japanese learners of English. First, the proportion of English target words that have loanwords in Japanese was determined through analysis of corpus-derived wordlists, revealing that the distribution of such items is broadly similar to that in language in general. Second, the impact of loanword frequency in Japanese (and cognate status) was demonstrated through statistical analysis of response data for the items. Taken together, the findings highlight the scope and impact of such cognate items in large-scale language assessments. Discussion centers on how test developers can and/or should deal with the inclusion of cognate words in terms of context validity and test fairness.

尽管有大量证据表明双语者在使用第二语言时使用第一语言知识，但在语言评估文献中，对母语知识在大规模第二语言评估中的影响的调查以及如何控制这种影响的讨论却很少受到关注。本研究通过调查EIKEN年级水平测试中l2 -英语词汇知识部分的测试项目中l1 -日语外来词知识的使用情况来探讨这些问题，这些测试主要由日语学习者参加。首先，通过对语料库衍生词表的分析，确定英语目标词在日语中有外来词的比例，发现这些词的分布与一般语言中的分布大致相似。其次，通过对日语外来词频率和同源状态的统计分析，论证了外来词频率和同源状态对日语外来词的影响。综上所述，这些发现突出了此类同源项目在大规模语言评估中的范围和影响。讨论的中心是测试开发人员如何能够和/或应该根据上下文有效性和测试公平性来处理同源词的包含。

{"title":"The distribution of cognates and their impact on response accuracy in the EIKEN tests","authors":"David Allen, Keita Nakamura","doi":"10.1177/02655322231158551","DOIUrl":"https://doi.org/10.1177/02655322231158551","url":null,"abstract":"Although there is abundant evidence for the use of first-language (L1) knowledge by bilinguals when using a second language (L2), investigation into the impact of L1 knowledge in large-scale L2 language assessments and discussion of how such impact may be controlled has received little attention in the language assessment literature. This study examines these issues through investigating the use of L1-Japanese loanword knowledge in test items targeting L2-English lexical knowledge in the Reading section of EIKEN grade-level tests, which are primarily taken by Japanese learners of English. First, the proportion of English target words that have loanwords in Japanese was determined through analysis of corpus-derived wordlists, revealing that the distribution of such items is broadly similar to that in language in general. Second, the impact of loanword frequency in Japanese (and cognate status) was demonstrated through statistical analysis of response data for the items. Taken together, the findings highlight the scope and impact of such cognate items in large-scale language assessments. Discussion centers on how test developers can and/or should deal with the inclusion of cognate words in terms of context validity and test fairness.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"40 1","pages":"771 - 795"},"PeriodicalIF":4.1,"publicationDate":"2023-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47752709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Measuring the development of general language skills in English as a foreign language—Longitudinal invariance of the C-test 衡量对外英语通用语言技能的发展——C测试的纵向不变性

IF 4.1 1区文学 0 LANGUAGE & LINGUISTICS

Language Testing

Pub Date : 2023-03-25 DOI: 10.1177/02655322231159829

Birger Schnoor, J. Hartig, Thorsten Klinger, Alexander Naumann, I. Usanova

Research on assessing English as a foreign language (EFL) development has been growing recently. However, empirical evidence from longitudinal analyses based on substantial samples is still needed. In such settings, tests for measuring language development must meet high standards of test quality such as validity, reliability, and objectivity, as well as allow for valid interpretations of change scores, requiring longitudinal measurement invariance. The current study has a methodological focus and aims to examine the measurement invariance of a C-test used to assess EFL development in monolingual and bilingual secondary school students (n = 1956) in Germany. We apply longitudinal confirmatory factor analysis to test invariance hypotheses and obtain proficiency estimates comparable over time. As a result, we achieve residual longitudinal measurement invariance. Furthermore, our analyses support the appropriateness of altering texts in a longitudinal C-test design, which allows for the anchoring of texts between waves to establish comparability of the measurements over time using the information of the repeated texts to estimate the change in the test scores. If used in such a design, a C-test provides reliable, valid, and efficient measures for EFL development in secondary education in bilingual and monolingual students in Germany.

近年来，对外语评价的研究越来越多。然而，仍然需要基于大量样本的纵向分析的经验证据。在这种情况下，测量语言发展的测试必须满足测试质量的高标准，如有效性、可靠性和客观性，以及允许对变化分数的有效解释，要求纵向测量不变性。目前的研究侧重于方法论，旨在检验用于评估德国单语和双语中学生(n = 1956)英语发展的c测试的测量不变性。我们应用纵向验证性因子分析来检验不变性假设，并获得可比较的熟练程度估计。因此，我们实现了剩余的纵向测量不变性。此外，我们的分析支持在纵向c测试设计中改变文本的适当性，该设计允许在波浪之间锚定文本，以利用重复文本的信息来估计测试分数的变化，从而建立随时间测量的可比性。如果采用这样的设计，c测试为德国双语和单语学生在中学教育中的英语发展提供了可靠、有效和有效的措施。

{"title":"Measuring the development of general language skills in English as a foreign language—Longitudinal invariance of the C-test","authors":"Birger Schnoor, J. Hartig, Thorsten Klinger, Alexander Naumann, I. Usanova","doi":"10.1177/02655322231159829","DOIUrl":"https://doi.org/10.1177/02655322231159829","url":null,"abstract":"Research on assessing English as a foreign language (EFL) development has been growing recently. However, empirical evidence from longitudinal analyses based on substantial samples is still needed. In such settings, tests for measuring language development must meet high standards of test quality such as validity, reliability, and objectivity, as well as allow for valid interpretations of change scores, requiring longitudinal measurement invariance. The current study has a methodological focus and aims to examine the measurement invariance of a C-test used to assess EFL development in monolingual and bilingual secondary school students (n = 1956) in Germany. We apply longitudinal confirmatory factor analysis to test invariance hypotheses and obtain proficiency estimates comparable over time. As a result, we achieve residual longitudinal measurement invariance. Furthermore, our analyses support the appropriateness of altering texts in a longitudinal C-test design, which allows for the anchoring of texts between waves to establish comparability of the measurements over time using the information of the repeated texts to estimate the change in the test scores. If used in such a design, a C-test provides reliable, valid, and efficient measures for EFL development in secondary education in bilingual and monolingual students in Germany.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"40 1","pages":"796 - 819"},"PeriodicalIF":4.1,"publicationDate":"2023-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48729791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Operationalizing the reading-into-writing construct in analytic rating scales: Effects of different approaches on rating 在分析评分量表中操作读写结构：不同方法对评分的影响

IF 4.1 1区文学 0 LANGUAGE & LINGUISTICS

Language Testing

Pub Date : 2023-03-20 DOI: 10.1177/02655322231155561

Santi B. Lestari, Tineke Brunfaut

Assessing integrated reading-into-writing task performances is known to be challenging, and analytic rating scales have been found to better facilitate the scoring of these performances than other common types of rating scales. However, little is known about how specific operationalizations of the reading-into-writing construct in analytic rating scales may affect rating quality, and by extension score inferences and uses. Using two different analytic rating scales as proxies for two approaches to reading-into-writing construct operationalization, this study investigated the extent to which these approaches affect rating reliability and consistency. Twenty raters rated a set of reading-into-writing performances twice, each time using a different analytic rating scale, and completed post-rating questionnaires. The findings resulting from our convergent explanatory mixed-method research design show that both analytic rating scales functioned well, further supporting the use of analytic rating scales for scoring reading-into-writing. Raters reported that either type of analytic rating scale prompted them to attend to the reading-related aspects of reading-into-writing, although rating these aspects remained more challenging than judging writing-related aspects. The two scales differed, however, in the extent to which they led raters to uniform interpretations of performance difficulty levels. This study has implications for reading-into-writing scale design and rater training.

众所周知，评估阅读与写作任务的综合表现是一项具有挑战性的工作，与其他常见类型的评分量表相比，分析性评分量表更容易对这些表现进行评分。然而，对于分析评分量表中读写结构的具体操作如何影响评分质量，以及通过扩展得分推断和使用，我们知之甚少。本研究使用两种不同的分析性评分量表作为两种读写结构操作方法的代理，调查了这些方法对评分可靠性和一致性的影响程度。20名评分者对一组阅读转化为写作的表现进行了两次评分，每次使用不同的分析评分量表，并完成了评分后的问卷调查。我们的趋同解释混合方法研究设计结果表明，两种分析评分量表都运行良好，进一步支持使用分析评分量量表对阅读进行评分。评分者报告说，任何一种类型的分析性评分量表都促使他们关注将阅读转化为写作的阅读相关方面，尽管对这些方面的评分仍然比判断写作相关方面更具挑战性。然而，这两种量表的不同之处在于，它们在多大程度上导致评分者对表现难度水平的统一解释。这项研究对阅读-写作量表的设计和评分者的培训具有启示意义。

{"title":"Operationalizing the reading-into-writing construct in analytic rating scales: Effects of different approaches on rating","authors":"Santi B. Lestari, Tineke Brunfaut","doi":"10.1177/02655322231155561","DOIUrl":"https://doi.org/10.1177/02655322231155561","url":null,"abstract":"Assessing integrated reading-into-writing task performances is known to be challenging, and analytic rating scales have been found to better facilitate the scoring of these performances than other common types of rating scales. However, little is known about how specific operationalizations of the reading-into-writing construct in analytic rating scales may affect rating quality, and by extension score inferences and uses. Using two different analytic rating scales as proxies for two approaches to reading-into-writing construct operationalization, this study investigated the extent to which these approaches affect rating reliability and consistency. Twenty raters rated a set of reading-into-writing performances twice, each time using a different analytic rating scale, and completed post-rating questionnaires. The findings resulting from our convergent explanatory mixed-method research design show that both analytic rating scales functioned well, further supporting the use of analytic rating scales for scoring reading-into-writing. Raters reported that either type of analytic rating scale prompted them to attend to the reading-related aspects of reading-into-writing, although rating these aspects remained more challenging than judging writing-related aspects. The two scales differed, however, in the extent to which they led raters to uniform interpretations of performance difficulty levels. This study has implications for reading-into-writing scale design and rater training.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"40 1","pages":"684 - 722"},"PeriodicalIF":4.1,"publicationDate":"2023-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44759278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Ukrainian language proficiency test review 乌克兰语能力测试审查

IF 4.1 1区文学 0 LANGUAGE & LINGUISTICS

Language Testing

Pub Date : 2023-03-14 DOI: 10.1177/02655322231156819

Daniil M. Ozernyi, Ruslan Suvorov

The Ukrainian Language Proficiency (ULP) test, officially titled Exam of the level of mastery of the official language (Ispyt na riven’ volodinnya derzhavnoyu movoyu) is a new test launched in Summer 2021. The name of the test in Ukrainian, incidentally, does not contain the words “Ukrainian” or “foreign language.” According to the state regulations (Kabinet Ministriv Ukrayiny [KMU], 2021a; Natsional’na Komisiya zi Standartiv Derzhavnoyi Movy [NKSDM], 2021a, 2021b), the levels of mastery of Ukrainian in the test are aligned with the CEFR levels.1 The test was introduced as a product of the law about the official language of Ukraine, which mandated that civil servants and citizens who are being naturalized are fully able to use Ukrainian in performing their duties. The ULP test comprises two versions: (a) ULP for acquisition of Ukrainian citizenship (Ispyt na riven’ volodinnya derzhavnoyu movoyu (dlya nabuttya hromadyanstva)), and (b) ULP 2.0 for holding civil office (Ispyt na riven’ volodinnya derzhavnoyu movoyu 2.0 (dlya vykonannya sluzhbovyh obov’yazkiv)). To differentiate between the two versions of the test in this review, we will refer to the former version as ULP-C and to the latter version as ULP 2.0. The purpose of this review is to apply Kunnan’s (2018) fairness and justice framework to evaluate both ULP-C and ULP 2.0 since they are united by (a) the alignment with the CEFR scale which poses ULP 2.0 as a continuation of ULP-C, (b) the same

乌克兰语言能力(ULP)测试，正式名称为官方语言掌握水平考试(Ispyt na riven ' volodinnya derzhavnoyu movoyu)，是2021年夏季推出的一项新测试。顺便说一句，乌克兰语的考试名称中没有“乌克兰语”或“外语”这两个词。根据国家规定(Kabinet Ministriv ukraine [KMU]， 2021a;[NKSDM]， 2021a, 2021b])，测试中乌克兰语的掌握水平与CEFR水平一致这项考试是乌克兰官方语言法律的产物，该法律规定，正在入籍的公务员和公民在履行职责时完全能够使用乌克兰语。ULP测试包括两个版本:(a)获得乌克兰公民身份的ULP (Ispyt na riven ' volodinnya derzhavnoyu movoyu (dlya nabuttya hrmadyanstva))和(b)担任公务员的ULP 2.0 (Ispyt na riven ' volodinnya derzhavnoyu movoyu 2.0 (dlya vykonannya sluzhbovyh obov ' yazkiv))。为了区分本综述中测试的两个版本，我们将前一个版本称为ULP- c，后一个版本称为ULP 2.0。本综述的目的是应用昆南(2018)的公平和正义框架来评估ULP- c和ULP 2.0，因为它们是由(a)与CEFR量表的一致性统一的，这使得ULP 2.0作为ULP- c的延续，(b)相同

{"title":"Ukrainian language proficiency test review","authors":"Daniil M. Ozernyi, Ruslan Suvorov","doi":"10.1177/02655322231156819","DOIUrl":"https://doi.org/10.1177/02655322231156819","url":null,"abstract":"The Ukrainian Language Proficiency (ULP) test, officially titled Exam of the level of mastery of the official language (Ispyt na riven’ volodinnya derzhavnoyu movoyu) is a new test launched in Summer 2021. The name of the test in Ukrainian, incidentally, does not contain the words “Ukrainian” or “foreign language.” According to the state regulations (Kabinet Ministriv Ukrayiny [KMU], 2021a; Natsional’na Komisiya zi Standartiv Derzhavnoyi Movy [NKSDM], 2021a, 2021b), the levels of mastery of Ukrainian in the test are aligned with the CEFR levels.1 The test was introduced as a product of the law about the official language of Ukraine, which mandated that civil servants and citizens who are being naturalized are fully able to use Ukrainian in performing their duties. The ULP test comprises two versions: (a) ULP for acquisition of Ukrainian citizenship (Ispyt na riven’ volodinnya derzhavnoyu movoyu (dlya nabuttya hromadyanstva)), and (b) ULP 2.0 for holding civil office (Ispyt na riven’ volodinnya derzhavnoyu movoyu 2.0 (dlya vykonannya sluzhbovyh obov’yazkiv)). To differentiate between the two versions of the test in this review, we will refer to the former version as ULP-C and to the latter version as ULP 2.0. The purpose of this review is to apply Kunnan’s (2018) fairness and justice framework to evaluate both ULP-C and ULP 2.0 since they are united by (a) the alignment with the CEFR scale which poses ULP 2.0 as a continuation of ULP-C, (b) the same","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"40 1","pages":"828 - 839"},"PeriodicalIF":4.1,"publicationDate":"2023-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43284530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0