Developed by China Language Assessment (CLA), the English Test for International Communication Advanced (ETIC Advanced) assesses one’s ability to perform English language tasks in international workplace contexts. ETIC Advanced is only composed of writing and speaking tasks, featured with authentic constructed response format. However, the elicitation of extended responses from candidates would call for human raters to make judgments, thus raising a critical issue of rating quality. This study aimed to evaluate rater judgements on the writing tasks of ETIC Advanced. Data in the study represented scores from 186 candidates who performed all writing tasks: Letter Writing, Report Writing, and Proposal Writing (n=3,348 ratings). Rating was conducted by six certified raters based on a six-point three-category analytical rating scale. Generalizability theory (GT) and Many-Facets Rasch Model (MFRM) were applied to analyse the scores from different perspectives. Results from GT indicated that raters’ inconsistency and interaction with other aspects resulted in a relatively low proportion of overall score variance, and that the ratings sufficed for generalization. MFRM analysis revealed that the six raters differed significantly in severity, yet remained consistent in their own judgements. Bias analyses indicated that the raters tended to assign more biased scores to low-proficient candidates and the Content category of rating scale. The study serves to demonstrate the use of both GT and MFRM to evaluate rater judgments on language performance tests. The findings of this study have implications for ETIC rater training.
{"title":"Evaluating rater judgments on ETIC Advanced writing tasks: An application of generalizability theory and Many-Facet","authors":"Jiayu Wang, Kaizhou Luo","doi":"10.58379/vmak1620","DOIUrl":"https://doi.org/10.58379/vmak1620","url":null,"abstract":"Developed by China Language Assessment (CLA), the English Test for International Communication Advanced (ETIC Advanced) assesses one’s ability to perform English language tasks in international workplace contexts. ETIC Advanced is only composed of writing and speaking tasks, featured with authentic constructed response format. However, the elicitation of extended responses from candidates would call for human raters to make judgments, thus raising a critical issue of rating quality. This study aimed to evaluate rater judgements on the writing tasks of ETIC Advanced. Data in the study represented scores from 186 candidates who performed all writing tasks: Letter Writing, Report Writing, and Proposal Writing (n=3,348 ratings). Rating was conducted by six certified raters based on a six-point three-category analytical rating scale. Generalizability theory (GT) and Many-Facets Rasch Model (MFRM) were applied to analyse the scores from different perspectives. Results from GT indicated that raters’ inconsistency and interaction with other aspects resulted in a relatively low proportion of overall score variance, and that the ratings sufficed for generalization. MFRM analysis revealed that the six raters differed significantly in severity, yet remained consistent in their own judgements. Bias analyses indicated that the raters tended to assign more biased scores to low-proficient candidates and the Content category of rating scale. The study serves to demonstrate the use of both GT and MFRM to evaluate rater judgments on language performance tests. The findings of this study have implications for ETIC rater training.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74964941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"T. McNamara, U. Knoch & J. Fan. Fairness, Justice, and Language Assessment","authors":"Troy L. Cox","doi":"10.58379/nrax8588","DOIUrl":"https://doi.org/10.58379/nrax8588","url":null,"abstract":"<jats:p>n/a</jats:p>","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73175295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rie Koizumi, Emiko Kaneko, E. Setoguchi, Yo In’nami
Attempts are underway to develop prototype tasks, based on a Japanese version of the Common European Framework of Reference for Languages (CEFR; Council of Europe, 2001; CEFR-J; Negishi, Takada, & Tono, 2013). As part of this larger project, the current paper reports on the creation of spoken interaction tasks for five levels (Pre-A1, A1.1, A1.2, A1.3, and A2.1). Tasks were undertaken by 66 Japanese university students. Two raters evaluated their interactions using a three-level holistic rating scale, and 20% of the performances were double rated. The spoken ratings were analysed using many-facet Rasch measurement (MFRM) and generalizability theory (G-theory). MFRM showed that all the tasks fit the Rasch model well, the scale functioned satisfactorily, and the difficulty of the tasks generally concurred with CEFR-J levels. Results from G-theory that employed the p x t design, including tasks as a facet, showed the different proportion of variance accounted for by tasks, as well as the number of tasks that could be required to ensure sufficiently high reliability. The MFRM and G-theory results effectively revealed areas for improving spoken interaction tasks; the results also showed the usefulness of combining the two methods for task development and revision.
目前正在根据日文版的《欧洲共同语言参考框架》(CEFR;欧洲委员会,2001;CEFR-J;根岸、高田和托诺,2013)。作为这个更大项目的一部分,本文报告了五个级别(Pre-A1, A1.1, A1.2, A1.3和A2.1)的口语交互任务的创建。66名日本大学生承担了这些任务。两名评分员使用三级整体评分量表评估他们的互动,20%的表演被双重评分。采用多面拉希测量(MFRM)和泛化理论(G-theory)对口语评分进行了分析。MFRM结果表明,所有任务均符合Rasch模型,量表功能良好,任务难度与CEFR-J水平基本一致。采用p x t设计的g理论的结果,包括任务作为一个面,显示了任务占方差的不同比例,以及确保足够高的可靠性所需的任务数量。MFRM和g理论结果有效地揭示了口语互动任务有待改进的领域;结果还显示了将两种方法结合起来进行任务开发和复习的有效性。
{"title":"Examination of CEFR-J spoken interaction tasks using many-facet Rasch measurement and generalizability theory","authors":"Rie Koizumi, Emiko Kaneko, E. Setoguchi, Yo In’nami","doi":"10.58379/bswy7332","DOIUrl":"https://doi.org/10.58379/bswy7332","url":null,"abstract":"Attempts are underway to develop prototype tasks, based on a Japanese version of the Common European Framework of Reference for Languages (CEFR; Council of Europe, 2001; CEFR-J; Negishi, Takada, & Tono, 2013). As part of this larger project, the current paper reports on the creation of spoken interaction tasks for five levels (Pre-A1, A1.1, A1.2, A1.3, and A2.1). Tasks were undertaken by 66 Japanese university students. Two raters evaluated their interactions using a three-level holistic rating scale, and 20% of the performances were double rated. The spoken ratings were analysed using many-facet Rasch measurement (MFRM) and generalizability theory (G-theory). MFRM showed that all the tasks fit the Rasch model well, the scale functioned satisfactorily, and the difficulty of the tasks generally concurred with CEFR-J levels. Results from G-theory that employed the p x t design, including tasks as a facet, showed the different proportion of variance accounted for by tasks, as well as the number of tasks that could be required to ensure sufficiently high reliability. The MFRM and G-theory results effectively revealed areas for improving spoken interaction tasks; the results also showed the usefulness of combining the two methods for task development and revision.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74110307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Japan has recently been promoting university entrance examination reform with the goal of positively influencing students’ English learning, but the extent to which entrance examinations themselves affect English learning is not known. The promotion of better learning requires changing the factors that affect learning behavior, rather than merely modifying existing examinations or introducing new ones. This study investigated the factors determining Japanese students’ English learning while they prepared for high-stakes university entrance examinations, aiming to construct a model that explicates how test-related and test-independent factors are intertwined. Semi-structured interviews were conducted with 14 first-year university students asking how they had prepared for their examinations and why they had chosen particular preparation methods. After thematic analysis, four main factors in student learning behavior were identified (examination, student views, school, and examination-independent factors) and their relationships explored. The study findings provide useful insights for policymakers in English as a foreign language (EFL) educational contexts, where English tests are used as part of language education policies. Furthermore, the proposed model is theoretically important as it explains the complex washback mechanism and deepens our understanding of why intended washback effects on learning are not necessarily achieved.
{"title":"An investigation of factors involved in Japanese students’ English learning behavior during test preparation","authors":"Takanori Sato","doi":"10.58379/fsbq6351","DOIUrl":"https://doi.org/10.58379/fsbq6351","url":null,"abstract":"Japan has recently been promoting university entrance examination reform with the goal of positively influencing students’ English learning, but the extent to which entrance examinations themselves affect English learning is not known. The promotion of better learning requires changing the factors that affect learning behavior, rather than merely modifying existing examinations or introducing new ones. This study investigated the factors determining Japanese students’ English learning while they prepared for high-stakes university entrance examinations, aiming to construct a model that explicates how test-related and test-independent factors are intertwined. Semi-structured interviews were conducted with 14 first-year university students asking how they had prepared for their examinations and why they had chosen particular preparation methods. After thematic analysis, four main factors in student learning behavior were identified (examination, student views, school, and examination-independent factors) and their relationships explored. The study findings provide useful insights for policymakers in English as a foreign language (EFL) educational contexts, where English tests are used as part of language education policies. Furthermore, the proposed model is theoretically important as it explains the complex washback mechanism and deepens our understanding of why intended washback effects on learning are not necessarily achieved.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80086381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Geisa Dávila Pérez, Frank van Splunder, L. Baten, Jan Van Maele, Yoennis Díaz Moreno
This paper discusses language assessment by means of video recordings, particularly its use for benchmarking purposes regarding language proficiency in a Cuban academic context. It is based on videotaped oral presentation assignments of Cuban PhD students for peer and teacher assessment. In order to avoid bias and provide validity to the results, the PhD students’ videotaped oral presentation assignments have been rated by language testing experts from three different Flemish universities, which are included in the Interuniversity Testing Consortium (IUTC). A selection of these assignments will be transferred to the university Moodle platform, and this compilation may be used to enable the start of a Cuban corpus of internationally rated presentations of academic English. Therefore, the results obtained will provide language teachers with a growing database of video recordings to facilitate benchmarking activities and promote standardized assessment in the Cuban academic context.
{"title":"Benchmarking video presentations for CEFR usage in Cuba","authors":"Geisa Dávila Pérez, Frank van Splunder, L. Baten, Jan Van Maele, Yoennis Díaz Moreno","doi":"10.58379/tvkg6591","DOIUrl":"https://doi.org/10.58379/tvkg6591","url":null,"abstract":"This paper discusses language assessment by means of video recordings, particularly its use for benchmarking purposes regarding language proficiency in a Cuban academic context. It is based on videotaped oral presentation assignments of Cuban PhD students for peer and teacher assessment. In order to avoid bias and provide validity to the results, the PhD students’ videotaped oral presentation assignments have been rated by language testing experts from three different Flemish universities, which are included in the Interuniversity Testing Consortium (IUTC). A selection of these assignments will be transferred to the university Moodle platform, and this compilation may be used to enable the start of a Cuban corpus of internationally rated presentations of academic English. Therefore, the results obtained will provide language teachers with a growing database of video recordings to facilitate benchmarking activities and promote standardized assessment in the Cuban academic context.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"340 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74837664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This two-phased, sequential mixed-methods study investigates how raters are influenced by different rating scales on a college-level English as a second language (ESL) writing placement test. In Phase I, nine certified raters rated 152 essays using a holistic, profile-based scale; in Phase II, they rated 200 essays using a binary, analytic scale developed based on the holistic scale and 100 essays using both rating scales. Ratings were examined both quantitatively through Rasch modeling and qualitatively via think-aloud protocols and semi-structured interviews. Findings from Phase I revealed that, despite satisfactory internal consistency, the raters demonstrated relatively low rater agreement and individual differences in their use of the holistic scale. Findings from Phase II showed that the binary, analytic scale led to much improvement in rater consensus and rater consistency. Another finding from Phase II suggests that the binary, analytic scale helped the raters deconstruct the holistic scale, reducing their cognitive burden. This study represents a creative use of a binary, analytic scale to guide raters through a holistic rating scale. Implications regarding how a rating scale affects rating behavior and performance are discussed.
{"title":"An investigation into rater performance with a holistic scale and a binary, analytic scale on an ESL writing placement test","authors":"Hyunji Hayley Park, Xun Yan","doi":"10.58379/nkdc1529","DOIUrl":"https://doi.org/10.58379/nkdc1529","url":null,"abstract":"This two-phased, sequential mixed-methods study investigates how raters are influenced by different rating scales on a college-level English as a second language (ESL) writing placement test. In Phase I, nine certified raters rated 152 essays using a holistic, profile-based scale; in Phase II, they rated 200 essays using a binary, analytic scale developed based on the holistic scale and 100 essays using both rating scales. Ratings were examined both quantitatively through Rasch modeling and qualitatively via think-aloud protocols and semi-structured interviews. Findings from Phase I revealed that, despite satisfactory internal consistency, the raters demonstrated relatively low rater agreement and individual differences in their use of the holistic scale. Findings from Phase II showed that the binary, analytic scale led to much improvement in rater consensus and rater consistency. Another finding from Phase II suggests that the binary, analytic scale helped the raters deconstruct the holistic scale, reducing their cognitive burden. This study represents a creative use of a binary, analytic scale to guide raters through a holistic rating scale. Implications regarding how a rating scale affects rating behavior and performance are discussed.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84172200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study aims to investigate the relationship between the noun phrase complexity of advanced Chinese EFL learners’ integrated writing and the score assigned by expert raters. Their written performance was also compared with those of native English speakers (NS) at university level with particular reference to the use of noun phrases. One hundred and twenty integrated writing samples were collected from an English writing test administered in a southeastern province of China. Results showed that there was a moderately positive correlation between the use of complex nominals in test-takers’ writing and the corresponding score. More specifically, non-native speakers of English (NNS) and NS groups differed significantly in the majority of noun phrase complexity measures. The implications are discussed concerning noun phrase complexity as a more reliable measure of syntactic complexity for an integrated writing test.
{"title":"Noun phrase complexity in integrated writing produced by advanced Chinese EFL learners","authors":"Lirong Xu","doi":"10.58379/lawy6296","DOIUrl":"https://doi.org/10.58379/lawy6296","url":null,"abstract":"This study aims to investigate the relationship between the noun phrase complexity of advanced Chinese EFL learners’ integrated writing and the score assigned by expert raters. Their written performance was also compared with those of native English speakers (NS) at university level with particular reference to the use of noun phrases. One hundred and twenty integrated writing samples were collected from an English writing test administered in a southeastern province of China. Results showed that there was a moderately positive correlation between the use of complex nominals in test-takers’ writing and the corresponding score. More specifically, non-native speakers of English (NNS) and NS groups differed significantly in the majority of noun phrase complexity measures. The implications are discussed concerning noun phrase complexity as a more reliable measure of syntactic complexity for an integrated writing test.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81840823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Drawing upon discussions of fairness in the field of language assessment, this systematic review study explores how the Rasch model has been used to investigate and enhance fairness in language assessment. To that end, we collected and systematically reviewed the empirical studies that used the Rasch model, published in four leading journals in the field from 2000 to 2018. A total of 139 articles were collected and subsequently coded in NVivo 11, using the open coding method. In addition, matrix coding analysis was implemented to explore the relationship between the topics that were identified and the language constructs that constituted the focus of the collected articles. Five broad themes were extracted from the coding process, including: 1) rater effects; 2) language test design and evaluation; 3) differential group performance; 4) evaluation of rating criteria, and 5) standard setting. Representative studies under each category were used to illustrate how the Rasch model was utilised to investigate test fairness. Findings of this study have important implications for language assessment development and evaluation. In addition, the findings also identified a few avenues in the application of the Rasch model which language assessment researchers should explore in future studies.
{"title":"Fairness in language assessment: What can the Rasch model offer?","authors":"Jason Fan, U. Knoch","doi":"10.58379/jrwg5233","DOIUrl":"https://doi.org/10.58379/jrwg5233","url":null,"abstract":"Drawing upon discussions of fairness in the field of language assessment, this systematic review study explores how the Rasch model has been used to investigate and enhance fairness in language assessment. To that end, we collected and systematically reviewed the empirical studies that used the Rasch model, published in four leading journals in the field from 2000 to 2018. A total of 139 articles were collected and subsequently coded in NVivo 11, using the open coding method. In addition, matrix coding analysis was implemented to explore the relationship between the topics that were identified and the language constructs that constituted the focus of the collected articles. Five broad themes were extracted from the coding process, including: 1) rater effects; 2) language test design and evaluation; 3) differential group performance; 4) evaluation of rating criteria, and 5) standard setting. Representative studies under each category were used to illustrate how the Rasch model was utilised to investigate test fairness. Findings of this study have important implications for language assessment development and evaluation. In addition, the findings also identified a few avenues in the application of the Rasch model which language assessment researchers should explore in future studies.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87619288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study investigates rater variability with regard to examinees’ levels and rating criteria in paired speaking assessment. 12 raters completed rater training and scored 102 examinees’ paired speaking performances using analytical rating criteria that reflect various features of paired speaking performance. The raters were fairly consistent in their overall ratings, but differed in their severity. The bias analyses using many-facet Rasch measurement revealed that a higher level of rater bias interaction was found for the rating criteria compared to those of the examinees’ levels and the pairing type which reflects a level difference between two examinees. In particular, the most challenging rating category Language Use attracted significant bias interactions. However, the raters did not display more frequent bias interactions based on the interaction-specific rating categories, such as Engaging with Interaction and Turn Organization. Furthermore, the raters tended to reverse their severity patterns across the rating categories. In the rater and examinee bias interactions, the raters tended to show more frequent bias toward the low-level examinees. However, no significant rater bias was found based on the pairing type that consisted of high-level and low-level examinees. These findings have implications for rater training in paired speaking assessment.
{"title":"Rater variability across examinees and rating criteria in paired speaking assessment","authors":"S. Youn","doi":"10.58379/yvwq3768","DOIUrl":"https://doi.org/10.58379/yvwq3768","url":null,"abstract":"This study investigates rater variability with regard to examinees’ levels and rating criteria in paired speaking assessment. 12 raters completed rater training and scored 102 examinees’ paired speaking performances using analytical rating criteria that reflect various features of paired speaking performance. The raters were fairly consistent in their overall ratings, but differed in their severity. The bias analyses using many-facet Rasch measurement revealed that a higher level of rater bias interaction was found for the rating criteria compared to those of the examinees’ levels and the pairing type which reflects a level difference between two examinees. In particular, the most challenging rating category Language Use attracted significant bias interactions. However, the raters did not display more frequent bias interactions based on the interaction-specific rating categories, such as Engaging with Interaction and Turn Organization. Furthermore, the raters tended to reverse their severity patterns across the rating categories. In the rater and examinee bias interactions, the raters tended to show more frequent bias toward the low-level examinees. However, no significant rater bias was found based on the pairing type that consisted of high-level and low-level examinees. These findings have implications for rater training in paired speaking assessment.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"545 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78168361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Training writing raters in large-scale tests is commonly conducted face-to-face but bringing raters together for training is difficult and expensive. For this reason, more and more testing agencies are exploring technological advances with the aim of providing training online. A number of studies have examined whether online rater training is a feasible alternative to face-to-face training.This mixed methods study compared two groups of new raters, one trained online using an online training platform and the other trained using the conventional face-to-face rater training procedures. Raters who passed accreditation were also compared in the reliability of their subsequent operational ratings. The findings show that no significant differences between the rating behaviour of the two groups were identified on the writing test. The qualitative data also showed that, in general, the raters enjoyed both modes of training and felt generally sufficiently trained although some specific problems were encountered. Results on the operational ratings in the first five months after completing the training showed no significant differences between the two training groups. The paper concludes with some implications for training raters in online environments and sets out a possible programme for further research.
{"title":"Evaluating the relative effectiveness of online and face-to-face training for new writing raters","authors":"U. Knoch, J. Fairbairn, C. Myford, A. Huisman","doi":"10.58379/zvmm4117","DOIUrl":"https://doi.org/10.58379/zvmm4117","url":null,"abstract":"Training writing raters in large-scale tests is commonly conducted face-to-face but bringing raters together for training is difficult and expensive. For this reason, more and more testing agencies are exploring technological advances with the aim of providing training online. A number of studies have examined whether online rater training is a feasible alternative to face-to-face training.This mixed methods study compared two groups of new raters, one trained online using an online training platform and the other trained using the conventional face-to-face rater training procedures. Raters who passed accreditation were also compared in the reliability of their subsequent operational ratings. The findings show that no significant differences between the rating behaviour of the two groups were identified on the writing test. The qualitative data also showed that, in general, the raters enjoyed both modes of training and felt generally sufficiently trained although some specific problems were encountered. Results on the operational ratings in the first five months after completing the training showed no significant differences between the two training groups. The paper concludes with some implications for training raters in online environments and sets out a possible programme for further research.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81157255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}