Pub Date : 2024-05-31DOI: 10.1177/02655322241249754
Shanshan He, Anne-Marie Sénécal, Laura Stansfield, Ruslan Suvorov
Test preparation has garnered considerable attention in second language (L2) education due to the significant implications that successful performance on a language test may have for academic advancement, future career opportunities, and immigration prospects. Meanwhile, an overemphasis on test preparation has been criticized for encouraging the cultivation of construct-irrelevant test-taking strategies at the expense of developing general language proficiency. To systematically explore how test preparation has been investigated in the literature, we conducted a scoping review of 66 studies on L2 test preparation. Specifically, this study examined the key characteristics of publications on test preparation, the main themes explored, the study and participant characteristics, as well as the essential aspects of their research methodologies. The results of this review revealed various trends in the literature on L2 test preparation, such as the exclusive focus on English as the target language, the lack of diversity in stakeholders as participants, the dominance of international language tests, and the paucity of experimental studies that utilize advanced statistical techniques. In addition to interpreting the results of our analysis, we discuss the implications of this scoping review and outline several directions for future research on test preparation.
{"title":"A scoping review of research on second language test preparation","authors":"Shanshan He, Anne-Marie Sénécal, Laura Stansfield, Ruslan Suvorov","doi":"10.1177/02655322241249754","DOIUrl":"https://doi.org/10.1177/02655322241249754","url":null,"abstract":"Test preparation has garnered considerable attention in second language (L2) education due to the significant implications that successful performance on a language test may have for academic advancement, future career opportunities, and immigration prospects. Meanwhile, an overemphasis on test preparation has been criticized for encouraging the cultivation of construct-irrelevant test-taking strategies at the expense of developing general language proficiency. To systematically explore how test preparation has been investigated in the literature, we conducted a scoping review of 66 studies on L2 test preparation. Specifically, this study examined the key characteristics of publications on test preparation, the main themes explored, the study and participant characteristics, as well as the essential aspects of their research methodologies. The results of this review revealed various trends in the literature on L2 test preparation, such as the exclusive focus on English as the target language, the lack of diversity in stakeholders as participants, the dominance of international language tests, and the paucity of experimental studies that utilize advanced statistical techniques. In addition to interpreting the results of our analysis, we discuss the implications of this scoping review and outline several directions for future research on test preparation.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141192784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-17DOI: 10.1177/02655322241241851
Shungo Suzuki, Judit Kormos
The current study examined the extent to which first language (L1) utterance fluency measures can predict second language (L2) fluency and how L2 proficiency moderates the relationship between L1 and L2 fluency. A total of 104 Japanese-speaking learners of English completed different argumentative speech tasks in their L1 and L2. Their speaking performance was analysed using measures of speed, breakdown, and repair fluency. L2 proficiency was operationalised as cognitive fluency. Two factor scores of cognitive fluency—linguistic resources and processing speed—were computed based on performance in a set of linguistic knowledge tests capturing vocabulary knowledge, morphosyntactic processing, and articulatory skills. A series of generalised linear mixed-effects models revealed small-to-moderate effect sizes for the predictive power of L1 utterance fluency measures on their L2 counterparts. Moderator effects of L2 proficiency were found only in speed fluency measures. The relationship between L1 and L2 speed fluency was weaker for L2 learners with wider L2 linguistic resources. Conversely, for those with faster L2 processing speed, the L1-L2 link tended to be stronger. These findings indicate that the L1-L2 fluency link is subject to the complex interplay of phonological differences between learners’ L1 and L2 and their L2 proficiency, offering implications for diagnostic speaking assessment.
{"title":"The moderating role of L2 proficiency in the predictive power of L1 fluency on L2 utterance fluency","authors":"Shungo Suzuki, Judit Kormos","doi":"10.1177/02655322241241851","DOIUrl":"https://doi.org/10.1177/02655322241241851","url":null,"abstract":"The current study examined the extent to which first language (L1) utterance fluency measures can predict second language (L2) fluency and how L2 proficiency moderates the relationship between L1 and L2 fluency. A total of 104 Japanese-speaking learners of English completed different argumentative speech tasks in their L1 and L2. Their speaking performance was analysed using measures of speed, breakdown, and repair fluency. L2 proficiency was operationalised as cognitive fluency. Two factor scores of cognitive fluency—linguistic resources and processing speed—were computed based on performance in a set of linguistic knowledge tests capturing vocabulary knowledge, morphosyntactic processing, and articulatory skills. A series of generalised linear mixed-effects models revealed small-to-moderate effect sizes for the predictive power of L1 utterance fluency measures on their L2 counterparts. Moderator effects of L2 proficiency were found only in speed fluency measures. The relationship between L1 and L2 speed fluency was weaker for L2 learners with wider L2 linguistic resources. Conversely, for those with faster L2 processing speed, the L1-L2 link tended to be stronger. These findings indicate that the L1-L2 fluency link is subject to the complex interplay of phonological differences between learners’ L1 and L2 and their L2 proficiency, offering implications for diagnostic speaking assessment.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140608860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-17DOI: 10.1177/02655322241239356
Suh Keong Kwon, Guoxing Yu
In this study, we examined the effect of visual cues in a second language listening test on test takers’ viewing behaviours and their test performance. Fifty-seven learners of English in Korea took a video-based listening test, with their eye movements recorded, and 23 of them were interviewed individually after the test. The participants viewed the visual cues longer than the items in the multiple-choice questions. Looking at the correct answer choice was related to a higher test score, while looking at the speaker(s) in the video and the distractors of the test items to a lower test score. Viewing the PowerPoint slides showed mixed effects on test performance, depending on different eye-movement measures. Stimulated-recall interviews shed further light on the possible reasons for the different patterns of the participants’ eye movements. Overall, the participants held the positive view that the visual cues aided them in comprehending the aural input and in completing the listening tasks more successfully. We discuss these findings in relation to the authenticity of tasks and the construct relevance of video-based listening tests.
{"title":"The effect of viewing visual cues in a listening comprehension test on second language learners’ test-taking process and performance: An eye-tracking study","authors":"Suh Keong Kwon, Guoxing Yu","doi":"10.1177/02655322241239356","DOIUrl":"https://doi.org/10.1177/02655322241239356","url":null,"abstract":"In this study, we examined the effect of visual cues in a second language listening test on test takers’ viewing behaviours and their test performance. Fifty-seven learners of English in Korea took a video-based listening test, with their eye movements recorded, and 23 of them were interviewed individually after the test. The participants viewed the visual cues longer than the items in the multiple-choice questions. Looking at the correct answer choice was related to a higher test score, while looking at the speaker(s) in the video and the distractors of the test items to a lower test score. Viewing the PowerPoint slides showed mixed effects on test performance, depending on different eye-movement measures. Stimulated-recall interviews shed further light on the possible reasons for the different patterns of the participants’ eye movements. Overall, the participants held the positive view that the visual cues aided them in comprehending the aural input and in completing the listening tasks more successfully. We discuss these findings in relation to the authenticity of tasks and the construct relevance of video-based listening tests.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140608771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-17DOI: 10.1177/02655322241246574
Salomé Villa Larenas
{"title":"Book review: From assessment to feedback by Inez De Florio","authors":"Salomé Villa Larenas","doi":"10.1177/02655322241246574","DOIUrl":"https://doi.org/10.1177/02655322241246574","url":null,"abstract":"","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140614754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-16DOI: 10.1177/02655322241241849
Junlan Pan, Emma Marsden
Tests of Aptitude for Language Learning (TALL) is an openly accessible internet-based battery to measure the multifaceted construct of foreign language aptitude, using language domain–specific instruments and L1-sensitive instructions and stimuli. This brief report introduces the components of this theory-informed battery and methodological considerations for developing it into an open research instrument. It also presents the preliminary results from the initial validation of TALL carried out on data collected from Chinese L1 participants ( n = 165) from a university setting who took two rounds of tests (with counterbalanced test items) with a minimum 30-day interval. The results of data analyses at subtest, item, and battery levels suggest that, in general, TALL has satisfactory reliability and can be used to measure aptitude conceptualized in the theoretical frameworks on which it has been developed. This report also highlights the value of TALL as a convenient data collection tool openly accessible to any researcher for free, its potential for facilitating an open data pool for high-quality syntheses of aptitude-related research findings, and its implications for Open Research practices in testing language-related constructs.
{"title":"Developing internet-based Tests of Aptitude for Language Learning (TALL): An open research endeavour","authors":"Junlan Pan, Emma Marsden","doi":"10.1177/02655322241241849","DOIUrl":"https://doi.org/10.1177/02655322241241849","url":null,"abstract":"Tests of Aptitude for Language Learning (TALL) is an openly accessible internet-based battery to measure the multifaceted construct of foreign language aptitude, using language domain–specific instruments and L1-sensitive instructions and stimuli. This brief report introduces the components of this theory-informed battery and methodological considerations for developing it into an open research instrument. It also presents the preliminary results from the initial validation of TALL carried out on data collected from Chinese L1 participants ( n = 165) from a university setting who took two rounds of tests (with counterbalanced test items) with a minimum 30-day interval. The results of data analyses at subtest, item, and battery levels suggest that, in general, TALL has satisfactory reliability and can be used to measure aptitude conceptualized in the theoretical frameworks on which it has been developed. This report also highlights the value of TALL as a convenient data collection tool openly accessible to any researcher for free, its potential for facilitating an open data pool for high-quality syntheses of aptitude-related research findings, and its implications for Open Research practices in testing language-related constructs.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140698518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-10DOI: 10.1177/02655322241239362
Reeta Neittaanmäki, Iasonas Lamprianou
This article focuses on rater severity and consistency and their relation to different types of rater experience over a long period of time. The article is based on longitudinal data collected from 2009 to 2019 from the second language Finnish speaking subtest in the National Certificates of Language Proficiency in Finland. The study investigated whether rater severity and consistency are affected differently by different types of rater experience and by skipping rating sessions. The data consisted of 45 rating sessions with 104 raters and 59,899 examinees and were analyzed using the Many-Facets Rasch model and generalized linear mixed models. The results showed that when the raters gained more rating experience, they became slightly more lenient, but different types of experience had quantitatively different magnitudes of impact. In addition, skipping rating sessions, and in that way disconnecting from the rater community, increased the likelihood of a rater to be inconsistent. Finally, we provide methodological recommendations for future research and consider implications for practice.
{"title":"All types of experience are equal, but some are more equal: The effect of different types of experience on rater severity and rater consistency","authors":"Reeta Neittaanmäki, Iasonas Lamprianou","doi":"10.1177/02655322241239362","DOIUrl":"https://doi.org/10.1177/02655322241239362","url":null,"abstract":"This article focuses on rater severity and consistency and their relation to different types of rater experience over a long period of time. The article is based on longitudinal data collected from 2009 to 2019 from the second language Finnish speaking subtest in the National Certificates of Language Proficiency in Finland. The study investigated whether rater severity and consistency are affected differently by different types of rater experience and by skipping rating sessions. The data consisted of 45 rating sessions with 104 raters and 59,899 examinees and were analyzed using the Many-Facets Rasch model and generalized linear mixed models. The results showed that when the raters gained more rating experience, they became slightly more lenient, but different types of experience had quantitatively different magnitudes of impact. In addition, skipping rating sessions, and in that way disconnecting from the rater community, increased the likelihood of a rater to be inconsistent. Finally, we provide methodological recommendations for future research and consider implications for practice.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140567507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-10DOI: 10.1177/02655322241239363
Reeta Neittaanmäki, Iasonas Lamprianou
This article focuses on rater severity and consistency and their relation to major changes in the rating system in a high-stakes testing context. The study is based on longitudinal data collected from 2009 to 2019 from the second language (L2) Finnishspeaking subtest in the National Certificates of Language Proficiency in Finland. We investigated whether rater severity and consistency changed over that period and whether the changes could be explained by major changes in the rating system, such as the change of lead examiner, the modus of rating and training (on-site or remote), and the composition of the rater group. The data consisted of 45 rating sessions with 104 raters and 59,899 examinees and were analysed using the Many-Facets Rasch model and generalized linear mixed models. The analyses indicated that raters as a group became somewhat more lenient over time. In addition, the results showed that the rater community and its practices, the lead examiners, and the modus of rating and training can influence the rating behaviour. Finally, we elaborate on implications for both research and practice.
{"title":"Communal factors in rater severity and consistency over time in high-stakes oral assessment","authors":"Reeta Neittaanmäki, Iasonas Lamprianou","doi":"10.1177/02655322241239363","DOIUrl":"https://doi.org/10.1177/02655322241239363","url":null,"abstract":"This article focuses on rater severity and consistency and their relation to major changes in the rating system in a high-stakes testing context. The study is based on longitudinal data collected from 2009 to 2019 from the second language (L2) Finnishspeaking subtest in the National Certificates of Language Proficiency in Finland. We investigated whether rater severity and consistency changed over that period and whether the changes could be explained by major changes in the rating system, such as the change of lead examiner, the modus of rating and training (on-site or remote), and the composition of the rater group. The data consisted of 45 rating sessions with 104 raters and 59,899 examinees and were analysed using the Many-Facets Rasch model and generalized linear mixed models. The analyses indicated that raters as a group became somewhat more lenient over time. In addition, the results showed that the rater community and its practices, the lead examiners, and the modus of rating and training can influence the rating behaviour. Finally, we elaborate on implications for both research and practice.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140567501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-26DOI: 10.1177/02655322241239348
U. Knoch, Jason Fan
While several test concordance tables have been published, the research underpinning such tables has rarely been examined in detail. This study aimed to survey the publically available studies or documentation underpinning the test concordance tables of the providers of four major international language tests, all accepted by the Australian Department of Home Affairs for Australian visa purposes. To evaluate the concordance studies, we first identified the good practice principles in concordance research through a review of both the relevant literature and leading professional standards in the field of educational measurement and language assessment. Next, we reviewed the concordance studies against the identified good practice principles. Our findings revealed that the information supplied by test providers varied, with some making the full research papers available, whereas others providing little information about their underpinning research. None of the concordance studies fulfilled all the good practice principles. Based on the findings of this study, we offer recommendations for future concordance research in the field of language testing as well as suggestions for practice.
{"title":"Test score comparison tables: How well are they serving test users?","authors":"U. Knoch, Jason Fan","doi":"10.1177/02655322241239348","DOIUrl":"https://doi.org/10.1177/02655322241239348","url":null,"abstract":"While several test concordance tables have been published, the research underpinning such tables has rarely been examined in detail. This study aimed to survey the publically available studies or documentation underpinning the test concordance tables of the providers of four major international language tests, all accepted by the Australian Department of Home Affairs for Australian visa purposes. To evaluate the concordance studies, we first identified the good practice principles in concordance research through a review of both the relevant literature and leading professional standards in the field of educational measurement and language assessment. Next, we reviewed the concordance studies against the identified good practice principles. Our findings revealed that the information supplied by test providers varied, with some making the full research papers available, whereas others providing little information about their underpinning research. None of the concordance studies fulfilled all the good practice principles. Based on the findings of this study, we offer recommendations for future concordance research in the field of language testing as well as suggestions for practice.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140379890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-12DOI: 10.1177/02655322231222600
Tia M. Fechter, Heeyeon Yoon
This study evaluated the efficacy of two proposed methods in an operational standard-setting study conducted for a high-stakes language proficiency test of the U.S. government. The goal was to seek low-cost modifications to the existing Yes/No Angoff method to increase the validity and reliability of the recommended cut scores using a convergent mixed-methods study design. The study used the Yes/No ratings as the baseline method in two rounds of ratings, while differentiating the two methods by incorporating item maps and an Ordered Item Booklet, each of which is an integral tool of the Mapmark and the Bookmark methods. The results showed that the internal validity evidence is similar across both methods, especially after Round 2 ratings. When procedural validity evidence was considered, however, a preference emerged for the method where panelists conducted the initial ratings unbeknownst to the empirical item difficulty information, and then such information was provided on an item map as part of the Round 1 feedback. The findings highlight the importance of evaluating both internal and procedural validity evidence when considering standard-setting methods.
{"title":"Evaluating methodological enhancements to the Yes/No Angoff standard-setting method in language proficiency assessment","authors":"Tia M. Fechter, Heeyeon Yoon","doi":"10.1177/02655322231222600","DOIUrl":"https://doi.org/10.1177/02655322231222600","url":null,"abstract":"This study evaluated the efficacy of two proposed methods in an operational standard-setting study conducted for a high-stakes language proficiency test of the U.S. government. The goal was to seek low-cost modifications to the existing Yes/No Angoff method to increase the validity and reliability of the recommended cut scores using a convergent mixed-methods study design. The study used the Yes/No ratings as the baseline method in two rounds of ratings, while differentiating the two methods by incorporating item maps and an Ordered Item Booklet, each of which is an integral tool of the Mapmark and the Bookmark methods. The results showed that the internal validity evidence is similar across both methods, especially after Round 2 ratings. When procedural validity evidence was considered, however, a preference emerged for the method where panelists conducted the initial ratings unbeknownst to the empirical item difficulty information, and then such information was provided on an item map as part of the Round 1 feedback. The findings highlight the importance of evaluating both internal and procedural validity evidence when considering standard-setting methods.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139784401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}