Pub Date : 2023-12-30DOI: 10.1177/02655322231217979
Yu-Tzu Chang, Ann Tai Choe, Daniel Holden, Daniel R. Isbell
In this Brief Report, we describe an evaluation of and revisions to a rubric adapted from the Jacobs et al. (1981) ESL COMPOSITION PROFILE, with four rubric categories and 20-point rating scales, in the context of an intensive English program writing placement test. Analysis of 4 years of rating data (2016–2021, including 434 essays) using many-facet Rasch measurement demonstrated that the 20-point rating scales of the Jacobs et al. rubric functioned poorly due to (a) questionably small distinctions in writing quality between successive score categories and (b) the presence of several disordered categories. We reanalyzed the score data after collapsing the 20-point scales into 4-point scales to simulate a revision to the rubric. This reanalysis appeared promising, with well-ordered and distinct score categories, and only a trivial decrease in person separation reliability. After implementing this revision to the rubric, we examined data from recent administrations (2022–2023, including 93 essays) to evaluate scale functioning. As in the simulation, scale categories were well-ordered and distinct in operational rating. Moreover, no raters demonstrated exceedingly poor fit using the revised rubric. Findings hold implications for other programs adopting/adapting the PROFILE or a similar rubric.
{"title":"Making each point count: Revising a local adaptation of the Jacobs et al. (1981) ESL COMPOSITION PROFILE rubric","authors":"Yu-Tzu Chang, Ann Tai Choe, Daniel Holden, Daniel R. Isbell","doi":"10.1177/02655322231217979","DOIUrl":"https://doi.org/10.1177/02655322231217979","url":null,"abstract":"In this Brief Report, we describe an evaluation of and revisions to a rubric adapted from the Jacobs et al. (1981) ESL COMPOSITION PROFILE, with four rubric categories and 20-point rating scales, in the context of an intensive English program writing placement test. Analysis of 4 years of rating data (2016–2021, including 434 essays) using many-facet Rasch measurement demonstrated that the 20-point rating scales of the Jacobs et al. rubric functioned poorly due to (a) questionably small distinctions in writing quality between successive score categories and (b) the presence of several disordered categories. We reanalyzed the score data after collapsing the 20-point scales into 4-point scales to simulate a revision to the rubric. This reanalysis appeared promising, with well-ordered and distinct score categories, and only a trivial decrease in person separation reliability. After implementing this revision to the rubric, we examined data from recent administrations (2022–2023, including 93 essays) to evaluate scale functioning. As in the simulation, scale categories were well-ordered and distinct in operational rating. Moreover, no raters demonstrated exceedingly poor fit using the revised rubric. Findings hold implications for other programs adopting/adapting the PROFILE or a similar rubric.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":" 2","pages":""},"PeriodicalIF":4.1,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139140765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-29DOI: 10.1177/02655322231210217
Yunwen Su, Sun-Young Shin
Rating scales that language testers design should be tailored to the specific test purpose and score use as well as reflect the target construct. Researchers have long argued for the value of data-driven scales for classroom performance assessment, because they are specific to pedagogical tasks and objectives, have rich descriptors to offer useful diagnostic information, and exhibit robust content representativeness and stable measurement properties. This sequential mixed methods study compares two data-driven rating scales with multiple criteria that use different formats for pragmatic performance. They were developed using roleplays performed by 43 second-language learners of Mandarin—the hierarchical-binary (HB) scale, developed through close analysis of performance data, and the multi-trait (MT) scale derived from the HB, which has the same criteria but takes the format of an analytic scale. Results revealed the influence of format, albeit to a limited extent: MT showed a marginal advantage over HB in terms of overall reliability, practicality, and discriminatory power, though measurement properties of the two scales were largely comparable. All raters were positive about the pedagogical value of both scales. This study reveals that rater perceptions of the ease of use and effectiveness of both scales provide further insights into scale functioning.
{"title":"Comparing two formats of data-driven rating scales for classroom assessment of pragmatic performance with roleplays","authors":"Yunwen Su, Sun-Young Shin","doi":"10.1177/02655322231210217","DOIUrl":"https://doi.org/10.1177/02655322231210217","url":null,"abstract":"Rating scales that language testers design should be tailored to the specific test purpose and score use as well as reflect the target construct. Researchers have long argued for the value of data-driven scales for classroom performance assessment, because they are specific to pedagogical tasks and objectives, have rich descriptors to offer useful diagnostic information, and exhibit robust content representativeness and stable measurement properties. This sequential mixed methods study compares two data-driven rating scales with multiple criteria that use different formats for pragmatic performance. They were developed using roleplays performed by 43 second-language learners of Mandarin—the hierarchical-binary (HB) scale, developed through close analysis of performance data, and the multi-trait (MT) scale derived from the HB, which has the same criteria but takes the format of an analytic scale. Results revealed the influence of format, albeit to a limited extent: MT showed a marginal advantage over HB in terms of overall reliability, practicality, and discriminatory power, though measurement properties of the two scales were largely comparable. All raters were positive about the pedagogical value of both scales. This study reveals that rater perceptions of the ease of use and effectiveness of both scales provide further insights into scale functioning.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"52 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139210387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-29DOI: 10.1177/02655322231210231
Huiying Cai, Xun Yan
Rater comments tend to be qualitatively analyzed to indicate raters’ application of rating scales. This study applied natural language processing (NLP) techniques to quantify meaningful, behavioral information from a corpus of rater comments and triangulated that information with a many-facet Rasch measurement (MFRM) analysis of rater scores. The data consisted of ratings on 987 essays by 36 raters (a total of 3948 analytic scores and 1974 rater comments) on a post-admission English Placement Test (EPT) at a large US university. We computed a set of comment-based features based on the analytic components and evaluative language the raters used to infer whether raters were aligned to the scale. For data triangulation, we performed correlation analyses between the MFRM measures of rater performance and the comment-based measures. Although the EPT raters showed overall satisfactory performance, we found meaningful associations between rater comments and performance features. In particular, raters with higher precision and fit to what the Rasch model predicts used more analytic components and used evaluative language more similar to the scale descriptors. These findings suggest that NLP techniques have the potential to help language testers analyze rater comments and understand rater behavior.
{"title":"Triangulating NLP-based analysis of rater comments and MFRM: An innovative approach to investigating raters’ application of rating scales in writing assessment","authors":"Huiying Cai, Xun Yan","doi":"10.1177/02655322231210231","DOIUrl":"https://doi.org/10.1177/02655322231210231","url":null,"abstract":"Rater comments tend to be qualitatively analyzed to indicate raters’ application of rating scales. This study applied natural language processing (NLP) techniques to quantify meaningful, behavioral information from a corpus of rater comments and triangulated that information with a many-facet Rasch measurement (MFRM) analysis of rater scores. The data consisted of ratings on 987 essays by 36 raters (a total of 3948 analytic scores and 1974 rater comments) on a post-admission English Placement Test (EPT) at a large US university. We computed a set of comment-based features based on the analytic components and evaluative language the raters used to infer whether raters were aligned to the scale. For data triangulation, we performed correlation analyses between the MFRM measures of rater performance and the comment-based measures. Although the EPT raters showed overall satisfactory performance, we found meaningful associations between rater comments and performance features. In particular, raters with higher precision and fit to what the Rasch model predicts used more analytic components and used evaluative language more similar to the scale descriptors. These findings suggest that NLP techniques have the potential to help language testers analyze rater comments and understand rater behavior.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"2 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139212101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-21DOI: 10.1177/02655322231198499
Thi My Hang Nguyen, Peter Gu, Averil Coxhead
Despite extensive research on assessing collocational knowledge, valid measures of academic collocations remain elusive. With the present study, we begin an argument-based approach to validate two Academic Collocation Tests (ACTs) that assess the ability to recognize and produce academic collocations (i.e., two-word units such as key element and well established) in written contexts. A total of 343 tertiary students completed a background questionnaire (including demographic information, IELTS scores, and learning experience), the ACTs, and the Vocabulary Size Test. Forty-four participants also took part in post-test interviews to share reflections on the tests and retook the ACTs verbally. The findings showed that the scoring inference based on analyses of test item characteristics, testing conditions, and scoring procedures was partially supported. The generalization inference, based on the consistency of item measures and testing occasions, was justified. The extrapolation inference, drawn from correlations with other measures and factors such as collocation frequency and learning experience, received partial support. Suggestions for increasing the degree of support for the inferences are discussed. The present study reinforces the value of validation research and generates the momentum for test developers to continue this practice with other vocabulary tests.
{"title":"Argument-based validation of Academic Collocation Tests","authors":"Thi My Hang Nguyen, Peter Gu, Averil Coxhead","doi":"10.1177/02655322231198499","DOIUrl":"https://doi.org/10.1177/02655322231198499","url":null,"abstract":"Despite extensive research on assessing collocational knowledge, valid measures of academic collocations remain elusive. With the present study, we begin an argument-based approach to validate two Academic Collocation Tests (ACTs) that assess the ability to recognize and produce academic collocations (i.e., two-word units such as key element and well established) in written contexts. A total of 343 tertiary students completed a background questionnaire (including demographic information, IELTS scores, and learning experience), the ACTs, and the Vocabulary Size Test. Forty-four participants also took part in post-test interviews to share reflections on the tests and retook the ACTs verbally. The findings showed that the scoring inference based on analyses of test item characteristics, testing conditions, and scoring procedures was partially supported. The generalization inference, based on the consistency of item measures and testing occasions, was justified. The extrapolation inference, drawn from correlations with other measures and factors such as collocation frequency and learning experience, received partial support. Suggestions for increasing the degree of support for the inferences are discussed. The present study reinforces the value of validation research and generates the momentum for test developers to continue this practice with other vocabulary tests.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135512996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-14DOI: 10.1177/02655322231200808
Michael D. Carey, Stefan Szocs
This controlled experimental study investigated the interaction of variables associated with rating the pronunciation component of high-stakes English-language-speaking tests such as IELTS and TOEFL iBT. One hundred experienced raters who were all either familiar or unfamiliar with Brazilian-accented English or Papua New Guinean Tok Pisin-accented English, respectively, were presented with speech samples in audio-only or audio-visual mode. Two-way ordinal regression with post hoc pairwise comparisons found that the presentation mode interacted significantly with accent familiarity to increase comprehensibility ratings (χ² = 88.005, df = 3, p < .0001), with presentation mode having a stronger effect in the interaction than accent familiarity (χ² = 59.328, df = 1, p < .0001). Based on odds ratios, raters were significantly more likely to score comprehensibility higher when the presentation mode was audio-visual (compared to audio-only) for both the unfamiliar (91% more likely) and familiar speakers (92.3% more likely). The results suggest that semi-direct speaking tests using audio-only or audio-visual modes of presentation should be evaluated through research to ascertain how accent familiarity and presentation mode interact to variably affect comprehensibility ratings. Such research may be beneficial to investigate the virtual modes of speaking test delivery that have emerged post-COVID-19.
这项对照实验研究调查了与雅思和新托福等高风险英语口语考试中发音部分评分相关的变量之间的相互作用。100名经验丰富的评分者分别熟悉或不熟悉巴西口音英语或巴布亚新几内亚托克-比索口音英语,他们以纯音频或视听的方式观看了语音样本。双向有序回归与事后两两比较发现,呈现方式与口音熟悉度显著相互作用,可提高可理解性评分(χ²= 88.005,df = 3, p <.0001),呈现方式对互动的影响强于口音熟悉度(χ²= 59.328,df = 1, p <。)。根据比值比,对于不熟悉的演讲者(可能性高出91%)和熟悉的演讲者(可能性高出92.3%),当呈现模式为视听(与纯音频相比)时,评分者更有可能给可理解性打更高的分。结果表明,应该通过研究来评估使用纯音频或视听呈现模式的半直接口语测试,以确定口音熟悉度和呈现模式如何相互作用以变量影响可理解性评分。这样的研究可能有助于调查covid -19后出现的口语测试虚拟模式。
{"title":"Revisiting raters’ accent familiarity in speaking tests: Evidence that presentation mode interacts with accent familiarity to variably affect comprehensibility ratings","authors":"Michael D. Carey, Stefan Szocs","doi":"10.1177/02655322231200808","DOIUrl":"https://doi.org/10.1177/02655322231200808","url":null,"abstract":"This controlled experimental study investigated the interaction of variables associated with rating the pronunciation component of high-stakes English-language-speaking tests such as IELTS and TOEFL iBT. One hundred experienced raters who were all either familiar or unfamiliar with Brazilian-accented English or Papua New Guinean Tok Pisin-accented English, respectively, were presented with speech samples in audio-only or audio-visual mode. Two-way ordinal regression with post hoc pairwise comparisons found that the presentation mode interacted significantly with accent familiarity to increase comprehensibility ratings (χ² = 88.005, df = 3, p < .0001), with presentation mode having a stronger effect in the interaction than accent familiarity (χ² = 59.328, df = 1, p < .0001). Based on odds ratios, raters were significantly more likely to score comprehensibility higher when the presentation mode was audio-visual (compared to audio-only) for both the unfamiliar (91% more likely) and familiar speakers (92.3% more likely). The results suggest that semi-direct speaking tests using audio-only or audio-visual modes of presentation should be evaluated through research to ascertain how accent familiarity and presentation mode interact to variably affect comprehensibility ratings. Such research may be beneficial to investigate the virtual modes of speaking test delivery that have emerged post-COVID-19.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135804141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-07DOI: 10.1177/02655322231202947
Jennifer Randall, Mya Poe, David Slomp, Maria Elena Oliveri
Educational assessments, from kindergarden to 12th grade (K-12) to licensure, have a long, well-documented history of oppression and marginalization. In this paper, we (the authors) ask the field of educational assessment/measurement to actively disrupt the White supremacist and racist logics that fuel this marginalization and re-orient itself toward assessment justice. We describe how a justice-oriented, antiracist validity (JAV) approach to validation processes can support assessment justice efforts, specifically with respect to language assessment. Relying on antiracist principles and critical quantitative methodologies, a JAV approach proposes a set of critical questions to consider when gathering validity evidence, with potential utility for language testers.
{"title":"Our validity looks like justice. Does yours?","authors":"Jennifer Randall, Mya Poe, David Slomp, Maria Elena Oliveri","doi":"10.1177/02655322231202947","DOIUrl":"https://doi.org/10.1177/02655322231202947","url":null,"abstract":"Educational assessments, from kindergarden to 12th grade (K-12) to licensure, have a long, well-documented history of oppression and marginalization. In this paper, we (the authors) ask the field of educational assessment/measurement to actively disrupt the White supremacist and racist logics that fuel this marginalization and re-orient itself toward assessment justice. We describe how a justice-oriented, antiracist validity (JAV) approach to validation processes can support assessment justice efforts, specifically with respect to language assessment. Relying on antiracist principles and critical quantitative methodologies, a JAV approach proposes a set of critical questions to consider when gathering validity evidence, with potential utility for language testers.","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"298 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135254616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-01DOI: 10.1177/02655322231186222
Lynda Taylor, Jayanti Banerjee
Several papers reference the concept of universal design as the preferred theoretical foundation for language test design and development (Christensen et al., 2023; Kim et al., 2023; Guzman-Orth et al., 2023). This approach, originally derived from the field of architecture in the United States (Case, 2008), proposes a set of principles whereby assessments are intentionally and proactively designed from the earliest stage of construction to be maximally accessible to all users, regardless of any special needs they may have (cf., the planning and design of public buildings as disabled-friendly). In the context of language test design and construction, this typically entails giving all test takers access to a broad range of universal but optional tools (e.g., magnifier, colour overlay) to enhance test accessibility. Recent technological developments for text
{"title":"Language assessment accommodations: Issues and challenges for the future","authors":"Lynda Taylor, Jayanti Banerjee","doi":"10.1177/02655322231186222","DOIUrl":"https://doi.org/10.1177/02655322231186222","url":null,"abstract":"Several papers reference the concept of universal design as the preferred theoretical foundation for language test design and development (Christensen et al., 2023; Kim et al., 2023; Guzman-Orth et al., 2023). This approach, originally derived from the field of architecture in the United States (Case, 2008), proposes a set of principles whereby assessments are intentionally and proactively designed from the earliest stage of construction to be maximally accessible to all users, regardless of any special needs they may have (cf., the planning and design of public buildings as disabled-friendly). In the context of language test design and construction, this typically entails giving all test takers access to a broad range of universal but optional tools (e.g., magnifier, colour overlay) to enhance test accessibility. Recent technological developments for text","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135605480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-01DOI: 10.1177/02655322231186221
Lynda Taylor, Jayanti Banerjee
the development and implementation of a special accommodations policy associated with a large-scale, localized computer-based language test designed to assess the English skills needed in the Singaporean workplace context. They analyzed both operational test data and interview data to investigate three main lines of enquiry: different stakeholders’ perceptions of the appropriateness and effectiveness of the accommodations; the impact of the accommodations on test-takers’ future opportunities; and stakeholder perceptions of key factors that play a role in accommodations. Their findings prompted recommendations on improving special accommodations policy development, dissemination
{"title":"Accommodations in language testing and assessment: Safeguarding equity, access, and inclusion","authors":"Lynda Taylor, Jayanti Banerjee","doi":"10.1177/02655322231186221","DOIUrl":"https://doi.org/10.1177/02655322231186221","url":null,"abstract":"the development and implementation of a special accommodations policy associated with a large-scale, localized computer-based language test designed to assess the English skills needed in the Singaporean workplace context. They analyzed both operational test data and interview data to investigate three main lines of enquiry: different stakeholders’ perceptions of the appropriateness and effectiveness of the accommodations; the impact of the accommodations on test-takers’ future opportunities; and stakeholder perceptions of key factors that play a role in accommodations. Their findings prompted recommendations on improving special accommodations policy development, dissemination","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135605482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-25DOI: 10.1177/02655322231199501
Susy Macqueen
{"title":"Book review: J. Fox and N. Artemeva. <i>Reconsidering Context in Language Assessment: Transdisciplinary Perspectives, Social Theories, and Validity</i>","authors":"Susy Macqueen","doi":"10.1177/02655322231199501","DOIUrl":"https://doi.org/10.1177/02655322231199501","url":null,"abstract":"","PeriodicalId":17928,"journal":{"name":"Language Testing","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135815696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}