Pub Date : 2023-04-27DOI: 10.1177/02655322231163470
Albert W. Li
I have seen a couple of international students that achieved good scores on the HSK level 5— the advanced-level Chinese proficiency test, and yet [they] can barely communicate at all in Chinese, not even daily conversation like “how was your weekend?” (A professor who teaches Chinese at a Confucius Institute in the USA, Interview, February 26, 2022)
{"title":"Assessing the speaking proficiency of L2 Chinese learners: Review of the Hanyu Shuiping Kouyu Kaoshi","authors":"Albert W. Li","doi":"10.1177/02655322231163470","DOIUrl":"https://doi.org/10.1177/02655322231163470","url":null,"abstract":"I have seen a couple of international students that achieved good scores on the HSK level 5— the advanced-level Chinese proficiency test, and yet [they] can barely communicate at all in Chinese, not even daily conversation like “how was your weekend?” (A professor who teaches Chinese at a Confucius Institute in the USA, Interview, February 26, 2022)","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"40 1","pages":"1007 - 1021"},"PeriodicalIF":4.1,"publicationDate":"2023-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47536879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-24DOI: 10.1177/02655322231165984
Daniel R. Isbell, Dustin Crowther, H. Nishizawa
The extrapolation of test scores to a target domain—that is, association between test performances and relevant real-world outcomes—is critical to valid score interpretation and use. This study examined the relationship between Duolingo English Test (DET) speaking scores and university stakeholders’ evaluation of DET speaking performances. A total of 190 university stakeholders (45 faculty members, 39 administrative staff, 53 graduate students, 53 undergraduate students) evaluated the comprehensibility (ease of understanding) and academic acceptability of 100 DET test-takers’ speaking performances. Academic acceptability was judged based on speakers’ suitability for communicative roles in the university context including undergraduate study, group work in courses, graduate study, and teaching. Analyses indicated a large correlation between aggregate measures of comprehensibility and acceptability ( r = .98). Acceptability ratings varied according to role: acceptability for teaching was held to a notably higher standard than acceptability for undergraduate study. Stakeholder groups also differed in their ratings, with faculty tending to be more lenient in their ratings of comprehensibility and acceptability than undergraduate students and staff. Finally, both comprehensibility and acceptability measures correlated strongly with speakers’ official DET scores and subscores ( r ⩾ .74–.89), providing some support for the extrapolation of DET scores to academic contexts.
{"title":"Speaking performances, stakeholder perceptions, and test scores: Extrapolating from the Duolingo English test to the university","authors":"Daniel R. Isbell, Dustin Crowther, H. Nishizawa","doi":"10.1177/02655322231165984","DOIUrl":"https://doi.org/10.1177/02655322231165984","url":null,"abstract":"The extrapolation of test scores to a target domain—that is, association between test performances and relevant real-world outcomes—is critical to valid score interpretation and use. This study examined the relationship between Duolingo English Test (DET) speaking scores and university stakeholders’ evaluation of DET speaking performances. A total of 190 university stakeholders (45 faculty members, 39 administrative staff, 53 graduate students, 53 undergraduate students) evaluated the comprehensibility (ease of understanding) and academic acceptability of 100 DET test-takers’ speaking performances. Academic acceptability was judged based on speakers’ suitability for communicative roles in the university context including undergraduate study, group work in courses, graduate study, and teaching. Analyses indicated a large correlation between aggregate measures of comprehensibility and acceptability ( r = .98). Acceptability ratings varied according to role: acceptability for teaching was held to a notably higher standard than acceptability for undergraduate study. Stakeholder groups also differed in their ratings, with faculty tending to be more lenient in their ratings of comprehensibility and acceptability than undergraduate students and staff. Finally, both comprehensibility and acceptability measures correlated strongly with speakers’ official DET scores and subscores ( r ⩾ .74–.89), providing some support for the extrapolation of DET scores to academic contexts.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"1 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2023-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41653795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-24DOI: 10.1177/02655322231162853
J. Stewart, Henrik Gyllstad, Christopher Nicklin, Stuart Mclean
The purpose of this paper is to (a) establish whether meaning recall and meaning recognition item formats test psychometrically distinct constructs of vocabulary knowledge which measure separate skills, and, if so, (b) determine whether each construct possesses unique properties predictive of L2 reading proficiency. Factor analyses and hierarchical regression were conducted on results derived from the two vocabulary item formats in order to test this hypothesis. The results indicated that although the two-factor model had better fit and meaning recall and meaning recognition can be considered distinct psychometrically, discriminant validity between the two factors is questionable. In hierarchical regression models, meaning recognition knowledge did not make a statistically significant contribution to explaining reading proficiency over meaning recall knowledge. However, when the roles were reversed, meaning recall did make a significant contribution to the model beyond the variance explained by meaning recognition alone. The results suggest that meaning recognition does not tap into unique aspects of vocabulary knowledge and provide empirical support for meaning recall as a superior predictor of reading proficiency for research purposes.
{"title":"Establishing meaning recall and meaning recognition vocabulary knowledge as distinct psychometric constructs in relation to reading proficiency","authors":"J. Stewart, Henrik Gyllstad, Christopher Nicklin, Stuart Mclean","doi":"10.1177/02655322231162853","DOIUrl":"https://doi.org/10.1177/02655322231162853","url":null,"abstract":"The purpose of this paper is to (a) establish whether meaning recall and meaning recognition item formats test psychometrically distinct constructs of vocabulary knowledge which measure separate skills, and, if so, (b) determine whether each construct possesses unique properties predictive of L2 reading proficiency. Factor analyses and hierarchical regression were conducted on results derived from the two vocabulary item formats in order to test this hypothesis. The results indicated that although the two-factor model had better fit and meaning recall and meaning recognition can be considered distinct psychometrically, discriminant validity between the two factors is questionable. In hierarchical regression models, meaning recognition knowledge did not make a statistically significant contribution to explaining reading proficiency over meaning recall knowledge. However, when the roles were reversed, meaning recall did make a significant contribution to the model beyond the variance explained by meaning recognition alone. The results suggest that meaning recognition does not tap into unique aspects of vocabulary knowledge and provide empirical support for meaning recall as a superior predictor of reading proficiency for research purposes.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":" ","pages":""},"PeriodicalIF":4.1,"publicationDate":"2023-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48433203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-15DOI: 10.1177/02655322231155109
Purya Baghaei, K. Christensen
C-tests are gap-filling tests mainly used as rough and economical measures of second-language proficiency for placement and research purposes. A C-test usually consists of several short independent passages where the second half of every other word is deleted. Owing to their interdependent structure, C-test items violate the local independence assumption of IRT models. This poses some problems for IRT analysis of C-tests. A few strategies and psychometric models have been suggested and employed in the literature to circumvent the problem. In this research, a new psychometric model, namely, the loglinear Rasch model, is used for C-tests and the results are compared with the dichotomous Rasch model where local item dependence is ignored. Findings showed that the loglinear Rasch model fits significantly better than the dichotomous Rasch model. Examination of the locally dependent items did not reveal anything as regards their contents. However, it did reveal that 50% of the dependent items were adjacent items. Implications of the study for modeling local dependence in C-tests using different approaches are discussed.
{"title":"Modeling local item dependence in C-tests with the loglinear Rasch model","authors":"Purya Baghaei, K. Christensen","doi":"10.1177/02655322231155109","DOIUrl":"https://doi.org/10.1177/02655322231155109","url":null,"abstract":"C-tests are gap-filling tests mainly used as rough and economical measures of second-language proficiency for placement and research purposes. A C-test usually consists of several short independent passages where the second half of every other word is deleted. Owing to their interdependent structure, C-test items violate the local independence assumption of IRT models. This poses some problems for IRT analysis of C-tests. A few strategies and psychometric models have been suggested and employed in the literature to circumvent the problem. In this research, a new psychometric model, namely, the loglinear Rasch model, is used for C-tests and the results are compared with the dichotomous Rasch model where local item dependence is ignored. Findings showed that the loglinear Rasch model fits significantly better than the dichotomous Rasch model. Examination of the locally dependent items did not reveal anything as regards their contents. However, it did reveal that 50% of the dependent items were adjacent items. Implications of the study for modeling local dependence in C-tests using different approaches are discussed.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"40 1","pages":"820 - 827"},"PeriodicalIF":4.1,"publicationDate":"2023-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43887673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-03DOI: 10.1177/02655322231158550
T. Isaacs, Ruolin Hu, D. Trenkic, J. Varga
The COVID-19 pandemic has changed the university admissions and proficiency testing landscape. One change has been the meteoric rise in use of the fully automated Duolingo English Test (DET) for university entrance purposes, offering test-takers a cheaper, shorter, accessible alternative. This rapid response study is the first to investigate the predictive value of DET test scores in relation to university students’ academic attainment, taking into account students’ degree level, academic discipline, and nationality. We also compared DET test-takers’ academic performance with that of students admitted using traditional proficiency tests. Credit-weighted first-year academic grades of 1881 DET test-takers (1389 postgraduate, 492 undergraduate) enrolled at a large, research-intensive London university in Autumn 2020 were positively associated with DET Overall scores for postgraduate students (adj. r = .195) but not undergraduate students (adj. r = −.112). This result was mirrored in correlational patterns for students admitted through IELTS (n = 2651) and TOEFL iBT (n = 436), contributing to criterion-related validity evidence. Students admitted with DET enjoyed lower academic success than the IELTS and TOEFL iBT test-takers, although sample characteristics may have shaped this finding. We discuss implications for establishing cut scores and harnessing test-takers’ academic language development through pre-sessional and in-sessional support.
新冠肺炎疫情改变了大学招生和能力测试的格局。其中一个变化是,在大学入学考试中,全自动多邻国英语考试(DET)的使用迅速增加,为考生提供了一个更便宜、更短、更方便的选择。这项快速反应研究首次在考虑学生的学位水平、学科和国籍的情况下,研究DET考试成绩与大学生学业成就的预测价值。我们还比较了DET考生与使用传统能力测试录取的学生的学习成绩。2020年秋季,伦敦一所大型研究型大学招收了1881名DET考生(1389名研究生,492名本科生),他们的学分加权第一年学术成绩与研究生DET总分呈正相关(adjj . r = 0.195),但与本科生DET总分无关(adjj . r = - 0.112)。这一结果反映在通过雅思(2651)和托福网考(436)录取的学生的相关模式中,有助于提供与标准相关的效度证据。通过DET录取的学生比雅思和托福网考的学生在学业上的成功要低,尽管样本特征可能影响了这一发现。我们讨论了通过课前和课中支持来建立cut分数和利用考生的学术语言发展的含义。
{"title":"Examining the predictive validity of the Duolingo English Test: Evidence from a major UK university","authors":"T. Isaacs, Ruolin Hu, D. Trenkic, J. Varga","doi":"10.1177/02655322231158550","DOIUrl":"https://doi.org/10.1177/02655322231158550","url":null,"abstract":"The COVID-19 pandemic has changed the university admissions and proficiency testing landscape. One change has been the meteoric rise in use of the fully automated Duolingo English Test (DET) for university entrance purposes, offering test-takers a cheaper, shorter, accessible alternative. This rapid response study is the first to investigate the predictive value of DET test scores in relation to university students’ academic attainment, taking into account students’ degree level, academic discipline, and nationality. We also compared DET test-takers’ academic performance with that of students admitted using traditional proficiency tests. Credit-weighted first-year academic grades of 1881 DET test-takers (1389 postgraduate, 492 undergraduate) enrolled at a large, research-intensive London university in Autumn 2020 were positively associated with DET Overall scores for postgraduate students (adj. r = .195) but not undergraduate students (adj. r = −.112). This result was mirrored in correlational patterns for students admitted through IELTS (n = 2651) and TOEFL iBT (n = 436), contributing to criterion-related validity evidence. Students admitted with DET enjoyed lower academic success than the IELTS and TOEFL iBT test-takers, although sample characteristics may have shaped this finding. We discuss implications for establishing cut scores and harnessing test-takers’ academic language development through pre-sessional and in-sessional support.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"40 1","pages":"748 - 770"},"PeriodicalIF":4.1,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45194036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-01DOI: 10.1177/02655322221114614
Troy L. Cox, Alan V. Brown, Gregory L. Thompson
The rating of proficiency tests that use the Inter-agency Roundtable (ILR) and American Council on the Teaching of Foreign Languages (ACTFL) guidelines claims that each major level is based on hierarchal linguistic functions that require mastery of multidimensional traits in such a way that each level subsumes the levels beneath it. These characteristics are part of what is commonly referred to as floor and ceiling scoring. In this binary approach to scoring that differentiates between sustained performance and linguistic breakdown, raters evaluate many features including vocabulary use, grammatical accuracy, pronunciation, and pragmatics, yet there has been very little empirical validation on the practice of floor/ceiling scoring. This study examined the relationship between temporal oral fluency, prompt type, and proficiency level based on a data set comprised of 147 Oral Proficiency Interview - computer (OPIc) exam responses whose ratings ranged from Intermediate Low to Advanced High [AH]. As speakers progressed in proficiency, they were more fluent. In terms of floor and ceiling scoring, the prompts that elicited speech a level above the sustained level generally resulted in speech that was slower and had more breakdown than the floor-level prompts, though the differences were slight and not significantly different. Thus, temporal fluency features alone are insufficient in floor/ceiling scoring but are likely a contributing feature.
{"title":"Temporal fluency and floor/ceiling scoring of intermediate and advanced speech on the ACTFL Spanish Oral Proficiency Interview–computer","authors":"Troy L. Cox, Alan V. Brown, Gregory L. Thompson","doi":"10.1177/02655322221114614","DOIUrl":"https://doi.org/10.1177/02655322221114614","url":null,"abstract":"The rating of proficiency tests that use the Inter-agency Roundtable (ILR) and American Council on the Teaching of Foreign Languages (ACTFL) guidelines claims that each major level is based on hierarchal linguistic functions that require mastery of multidimensional traits in such a way that each level subsumes the levels beneath it. These characteristics are part of what is commonly referred to as floor and ceiling scoring. In this binary approach to scoring that differentiates between sustained performance and linguistic breakdown, raters evaluate many features including vocabulary use, grammatical accuracy, pronunciation, and pragmatics, yet there has been very little empirical validation on the practice of floor/ceiling scoring. This study examined the relationship between temporal oral fluency, prompt type, and proficiency level based on a data set comprised of 147 Oral Proficiency Interview - computer (OPIc) exam responses whose ratings ranged from Intermediate Low to Advanced High [AH]. As speakers progressed in proficiency, they were more fluent. In terms of floor and ceiling scoring, the prompts that elicited speech a level above the sustained level generally resulted in speech that was slower and had more breakdown than the floor-level prompts, though the differences were slight and not significantly different. Thus, temporal fluency features alone are insufficient in floor/ceiling scoring but are likely a contributing feature.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"40 1","pages":"325 - 351"},"PeriodicalIF":4.1,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47966349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-26DOI: 10.1177/02655322231158551
David Allen, Keita Nakamura
Although there is abundant evidence for the use of first-language (L1) knowledge by bilinguals when using a second language (L2), investigation into the impact of L1 knowledge in large-scale L2 language assessments and discussion of how such impact may be controlled has received little attention in the language assessment literature. This study examines these issues through investigating the use of L1-Japanese loanword knowledge in test items targeting L2-English lexical knowledge in the Reading section of EIKEN grade-level tests, which are primarily taken by Japanese learners of English. First, the proportion of English target words that have loanwords in Japanese was determined through analysis of corpus-derived wordlists, revealing that the distribution of such items is broadly similar to that in language in general. Second, the impact of loanword frequency in Japanese (and cognate status) was demonstrated through statistical analysis of response data for the items. Taken together, the findings highlight the scope and impact of such cognate items in large-scale language assessments. Discussion centers on how test developers can and/or should deal with the inclusion of cognate words in terms of context validity and test fairness.
{"title":"The distribution of cognates and their impact on response accuracy in the EIKEN tests","authors":"David Allen, Keita Nakamura","doi":"10.1177/02655322231158551","DOIUrl":"https://doi.org/10.1177/02655322231158551","url":null,"abstract":"Although there is abundant evidence for the use of first-language (L1) knowledge by bilinguals when using a second language (L2), investigation into the impact of L1 knowledge in large-scale L2 language assessments and discussion of how such impact may be controlled has received little attention in the language assessment literature. This study examines these issues through investigating the use of L1-Japanese loanword knowledge in test items targeting L2-English lexical knowledge in the Reading section of EIKEN grade-level tests, which are primarily taken by Japanese learners of English. First, the proportion of English target words that have loanwords in Japanese was determined through analysis of corpus-derived wordlists, revealing that the distribution of such items is broadly similar to that in language in general. Second, the impact of loanword frequency in Japanese (and cognate status) was demonstrated through statistical analysis of response data for the items. Taken together, the findings highlight the scope and impact of such cognate items in large-scale language assessments. Discussion centers on how test developers can and/or should deal with the inclusion of cognate words in terms of context validity and test fairness.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"40 1","pages":"771 - 795"},"PeriodicalIF":4.1,"publicationDate":"2023-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47752709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-25DOI: 10.1177/02655322231159829
Birger Schnoor, J. Hartig, Thorsten Klinger, Alexander Naumann, I. Usanova
Research on assessing English as a foreign language (EFL) development has been growing recently. However, empirical evidence from longitudinal analyses based on substantial samples is still needed. In such settings, tests for measuring language development must meet high standards of test quality such as validity, reliability, and objectivity, as well as allow for valid interpretations of change scores, requiring longitudinal measurement invariance. The current study has a methodological focus and aims to examine the measurement invariance of a C-test used to assess EFL development in monolingual and bilingual secondary school students (n = 1956) in Germany. We apply longitudinal confirmatory factor analysis to test invariance hypotheses and obtain proficiency estimates comparable over time. As a result, we achieve residual longitudinal measurement invariance. Furthermore, our analyses support the appropriateness of altering texts in a longitudinal C-test design, which allows for the anchoring of texts between waves to establish comparability of the measurements over time using the information of the repeated texts to estimate the change in the test scores. If used in such a design, a C-test provides reliable, valid, and efficient measures for EFL development in secondary education in bilingual and monolingual students in Germany.
{"title":"Measuring the development of general language skills in English as a foreign language—Longitudinal invariance of the C-test","authors":"Birger Schnoor, J. Hartig, Thorsten Klinger, Alexander Naumann, I. Usanova","doi":"10.1177/02655322231159829","DOIUrl":"https://doi.org/10.1177/02655322231159829","url":null,"abstract":"Research on assessing English as a foreign language (EFL) development has been growing recently. However, empirical evidence from longitudinal analyses based on substantial samples is still needed. In such settings, tests for measuring language development must meet high standards of test quality such as validity, reliability, and objectivity, as well as allow for valid interpretations of change scores, requiring longitudinal measurement invariance. The current study has a methodological focus and aims to examine the measurement invariance of a C-test used to assess EFL development in monolingual and bilingual secondary school students (n = 1956) in Germany. We apply longitudinal confirmatory factor analysis to test invariance hypotheses and obtain proficiency estimates comparable over time. As a result, we achieve residual longitudinal measurement invariance. Furthermore, our analyses support the appropriateness of altering texts in a longitudinal C-test design, which allows for the anchoring of texts between waves to establish comparability of the measurements over time using the information of the repeated texts to estimate the change in the test scores. If used in such a design, a C-test provides reliable, valid, and efficient measures for EFL development in secondary education in bilingual and monolingual students in Germany.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"40 1","pages":"796 - 819"},"PeriodicalIF":4.1,"publicationDate":"2023-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48729791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-20DOI: 10.1177/02655322231155561
Santi B. Lestari, Tineke Brunfaut
Assessing integrated reading-into-writing task performances is known to be challenging, and analytic rating scales have been found to better facilitate the scoring of these performances than other common types of rating scales. However, little is known about how specific operationalizations of the reading-into-writing construct in analytic rating scales may affect rating quality, and by extension score inferences and uses. Using two different analytic rating scales as proxies for two approaches to reading-into-writing construct operationalization, this study investigated the extent to which these approaches affect rating reliability and consistency. Twenty raters rated a set of reading-into-writing performances twice, each time using a different analytic rating scale, and completed post-rating questionnaires. The findings resulting from our convergent explanatory mixed-method research design show that both analytic rating scales functioned well, further supporting the use of analytic rating scales for scoring reading-into-writing. Raters reported that either type of analytic rating scale prompted them to attend to the reading-related aspects of reading-into-writing, although rating these aspects remained more challenging than judging writing-related aspects. The two scales differed, however, in the extent to which they led raters to uniform interpretations of performance difficulty levels. This study has implications for reading-into-writing scale design and rater training.
{"title":"Operationalizing the reading-into-writing construct in analytic rating scales: Effects of different approaches on rating","authors":"Santi B. Lestari, Tineke Brunfaut","doi":"10.1177/02655322231155561","DOIUrl":"https://doi.org/10.1177/02655322231155561","url":null,"abstract":"Assessing integrated reading-into-writing task performances is known to be challenging, and analytic rating scales have been found to better facilitate the scoring of these performances than other common types of rating scales. However, little is known about how specific operationalizations of the reading-into-writing construct in analytic rating scales may affect rating quality, and by extension score inferences and uses. Using two different analytic rating scales as proxies for two approaches to reading-into-writing construct operationalization, this study investigated the extent to which these approaches affect rating reliability and consistency. Twenty raters rated a set of reading-into-writing performances twice, each time using a different analytic rating scale, and completed post-rating questionnaires. The findings resulting from our convergent explanatory mixed-method research design show that both analytic rating scales functioned well, further supporting the use of analytic rating scales for scoring reading-into-writing. Raters reported that either type of analytic rating scale prompted them to attend to the reading-related aspects of reading-into-writing, although rating these aspects remained more challenging than judging writing-related aspects. The two scales differed, however, in the extent to which they led raters to uniform interpretations of performance difficulty levels. This study has implications for reading-into-writing scale design and rater training.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"40 1","pages":"684 - 722"},"PeriodicalIF":4.1,"publicationDate":"2023-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44759278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-14DOI: 10.1177/02655322231156819
Daniil M. Ozernyi, Ruslan Suvorov
The Ukrainian Language Proficiency (ULP) test, officially titled Exam of the level of mastery of the official language (Ispyt na riven’ volodinnya derzhavnoyu movoyu) is a new test launched in Summer 2021. The name of the test in Ukrainian, incidentally, does not contain the words “Ukrainian” or “foreign language.” According to the state regulations (Kabinet Ministriv Ukrayiny [KMU], 2021a; Natsional’na Komisiya zi Standartiv Derzhavnoyi Movy [NKSDM], 2021a, 2021b), the levels of mastery of Ukrainian in the test are aligned with the CEFR levels.1 The test was introduced as a product of the law about the official language of Ukraine, which mandated that civil servants and citizens who are being naturalized are fully able to use Ukrainian in performing their duties. The ULP test comprises two versions: (a) ULP for acquisition of Ukrainian citizenship (Ispyt na riven’ volodinnya derzhavnoyu movoyu (dlya nabuttya hromadyanstva)), and (b) ULP 2.0 for holding civil office (Ispyt na riven’ volodinnya derzhavnoyu movoyu 2.0 (dlya vykonannya sluzhbovyh obov’yazkiv)). To differentiate between the two versions of the test in this review, we will refer to the former version as ULP-C and to the latter version as ULP 2.0. The purpose of this review is to apply Kunnan’s (2018) fairness and justice framework to evaluate both ULP-C and ULP 2.0 since they are united by (a) the alignment with the CEFR scale which poses ULP 2.0 as a continuation of ULP-C, (b) the same
{"title":"Ukrainian language proficiency test review","authors":"Daniil M. Ozernyi, Ruslan Suvorov","doi":"10.1177/02655322231156819","DOIUrl":"https://doi.org/10.1177/02655322231156819","url":null,"abstract":"The Ukrainian Language Proficiency (ULP) test, officially titled Exam of the level of mastery of the official language (Ispyt na riven’ volodinnya derzhavnoyu movoyu) is a new test launched in Summer 2021. The name of the test in Ukrainian, incidentally, does not contain the words “Ukrainian” or “foreign language.” According to the state regulations (Kabinet Ministriv Ukrayiny [KMU], 2021a; Natsional’na Komisiya zi Standartiv Derzhavnoyi Movy [NKSDM], 2021a, 2021b), the levels of mastery of Ukrainian in the test are aligned with the CEFR levels.1 The test was introduced as a product of the law about the official language of Ukraine, which mandated that civil servants and citizens who are being naturalized are fully able to use Ukrainian in performing their duties. The ULP test comprises two versions: (a) ULP for acquisition of Ukrainian citizenship (Ispyt na riven’ volodinnya derzhavnoyu movoyu (dlya nabuttya hromadyanstva)), and (b) ULP 2.0 for holding civil office (Ispyt na riven’ volodinnya derzhavnoyu movoyu 2.0 (dlya vykonannya sluzhbovyh obov’yazkiv)). To differentiate between the two versions of the test in this review, we will refer to the former version as ULP-C and to the latter version as ULP 2.0. The purpose of this review is to apply Kunnan’s (2018) fairness and justice framework to evaluate both ULP-C and ULP 2.0 since they are united by (a) the alignment with the CEFR scale which poses ULP 2.0 as a continuation of ULP-C, (b) the same","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"40 1","pages":"828 - 839"},"PeriodicalIF":4.1,"publicationDate":"2023-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43284530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}