Studies in Language Assessment最新文献

英文中文

Evaluating rater judgments on ETIC Advanced writing tasks: An application of generalizability theory and Many-Facet 评量ETIC进阶写作任务的评量判断:概括性理论与多面向的应用

Q4 LINGUISTICS

Studies in Language Assessment

Pub Date : 2019-01-01 DOI: 10.58379/vmak1620

Jiayu Wang, Kaizhou Luo

Developed by China Language Assessment (CLA), the English Test for International Communication Advanced (ETIC Advanced) assesses one’s ability to perform English language tasks in international workplace contexts. ETIC Advanced is only composed of writing and speaking tasks, featured with authentic constructed response format. However, the elicitation of extended responses from candidates would call for human raters to make judgments, thus raising a critical issue of rating quality. This study aimed to evaluate rater judgements on the writing tasks of ETIC Advanced. Data in the study represented scores from 186 candidates who performed all writing tasks: Letter Writing, Report Writing, and Proposal Writing (n=3,348 ratings). Rating was conducted by six certified raters based on a six-point three-category analytical rating scale. Generalizability theory (GT) and Many-Facets Rasch Model (MFRM) were applied to analyse the scores from different perspectives. Results from GT indicated that raters’ inconsistency and interaction with other aspects resulted in a relatively low proportion of overall score variance, and that the ratings sufficed for generalization. MFRM analysis revealed that the six raters differed significantly in severity, yet remained consistent in their own judgements. Bias analyses indicated that the raters tended to assign more biased scores to low-proficient candidates and the Content category of rating scale. The study serves to demonstrate the use of both GT and MFRM to evaluate rater judgments on language performance tests. The findings of this study have implications for ETIC rater training.

由中国语言评估中心(CLA)开发的国际交流高级英语测试(ETIC高级)评估一个人在国际工作环境中执行英语语言任务的能力。ETIC高级只由写作和口语任务组成，具有真实的构建响应格式。然而，要从候选人那里得到更多的回答，就需要人类评级员做出判断，从而提出了评级质量的关键问题。本研究旨在评估评分者对ETIC高级写作任务的判断。研究中的数据代表了186名候选人的得分，他们完成了所有的写作任务:写信、写报告和写提案(n= 3348个评分)。评分由六名认证评分员根据六分三类分析评分量表进行。应用概化理论(Generalizability theory, GT)和多面拉希模型(many - facet Rasch Model, MFRM)从不同角度分析得分。GT结果表明，评分者的不一致性和与其他方面的相互作用导致总分方差比例相对较低，评分足以泛化。MFRM分析显示，六个评分者在严重程度上存在显著差异，但在他们自己的判断上保持一致。偏倚分析表明，评分者倾向于给低熟练程度的考生和评定量表的内容类别更偏倚的分数。本研究旨在证明使用GT和MFRM来评估语言表现测试中的评分判断。本研究结果对ETIC评分员的训练具有启示意义。

{"title":"Evaluating rater judgments on ETIC Advanced writing tasks: An application of generalizability theory and Many-Facet","authors":"Jiayu Wang, Kaizhou Luo","doi":"10.58379/vmak1620","DOIUrl":"https://doi.org/10.58379/vmak1620","url":null,"abstract":"Developed by China Language Assessment (CLA), the English Test for International Communication Advanced (ETIC Advanced) assesses one’s ability to perform English language tasks in international workplace contexts. ETIC Advanced is only composed of writing and speaking tasks, featured with authentic constructed response format. However, the elicitation of extended responses from candidates would call for human raters to make judgments, thus raising a critical issue of rating quality. This study aimed to evaluate rater judgements on the writing tasks of ETIC Advanced. Data in the study represented scores from 186 candidates who performed all writing tasks: Letter Writing, Report Writing, and Proposal Writing (n=3,348 ratings). Rating was conducted by six certified raters based on a six-point three-category analytical rating scale. Generalizability theory (GT) and Many-Facets Rasch Model (MFRM) were applied to analyse the scores from different perspectives. Results from GT indicated that raters’ inconsistency and interaction with other aspects resulted in a relatively low proportion of overall score variance, and that the ratings sufficed for generalization. MFRM analysis revealed that the six raters differed significantly in severity, yet remained consistent in their own judgements. Bias analyses indicated that the raters tended to assign more biased scores to low-proficient candidates and the Content category of rating scale. The study serves to demonstrate the use of both GT and MFRM to evaluate rater judgments on language performance tests. The findings of this study have implications for ETIC rater training.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74964941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

T. McNamara, U. Knoch & J. Fan. Fairness, Justice, and Language Assessment T.麦克纳马拉，U. Knoch和J. Fan。公平、正义和语言评估

Q4 LINGUISTICS

Studies in Language Assessment

Pub Date : 2019-01-01 DOI: 10.58379/nrax8588

Troy L. Cox

n/a

N/A

引用次数: 0

Examination of CEFR-J spoken interaction tasks using many-facet Rasch measurement and generalizability theory 用多面Rasch测量和推广理论检验CEFR-J口语交互任务

Q4 LINGUISTICS

Studies in Language Assessment

Pub Date : 2019-01-01 DOI: 10.58379/bswy7332

Rie Koizumi, Emiko Kaneko, E. Setoguchi, Yo In’nami

Attempts are underway to develop prototype tasks, based on a Japanese version of the Common European Framework of Reference for Languages (CEFR; Council of Europe, 2001; CEFR-J; Negishi, Takada, & Tono, 2013). As part of this larger project, the current paper reports on the creation of spoken interaction tasks for five levels (Pre-A1, A1.1, A1.2, A1.3, and A2.1). Tasks were undertaken by 66 Japanese university students. Two raters evaluated their interactions using a three-level holistic rating scale, and 20% of the performances were double rated. The spoken ratings were analysed using many-facet Rasch measurement (MFRM) and generalizability theory (G-theory). MFRM showed that all the tasks fit the Rasch model well, the scale functioned satisfactorily, and the difficulty of the tasks generally concurred with CEFR-J levels. Results from G-theory that employed the p x t design, including tasks as a facet, showed the different proportion of variance accounted for by tasks, as well as the number of tasks that could be required to ensure sufficiently high reliability. The MFRM and G-theory results effectively revealed areas for improving spoken interaction tasks; the results also showed the usefulness of combining the two methods for task development and revision.

目前正在根据日文版的《欧洲共同语言参考框架》(CEFR;欧洲委员会，2001;CEFR-J;根岸、高田和托诺，2013)。作为这个更大项目的一部分，本文报告了五个级别(Pre-A1, A1.1, A1.2, A1.3和A2.1)的口语交互任务的创建。66名日本大学生承担了这些任务。两名评分员使用三级整体评分量表评估他们的互动，20%的表演被双重评分。采用多面拉希测量(MFRM)和泛化理论(G-theory)对口语评分进行了分析。MFRM结果表明，所有任务均符合Rasch模型，量表功能良好，任务难度与CEFR-J水平基本一致。采用p x t设计的g理论的结果，包括任务作为一个面，显示了任务占方差的不同比例，以及确保足够高的可靠性所需的任务数量。MFRM和g理论结果有效地揭示了口语互动任务有待改进的领域;结果还显示了将两种方法结合起来进行任务开发和复习的有效性。

{"title":"Examination of CEFR-J spoken interaction tasks using many-facet Rasch measurement and generalizability theory","authors":"Rie Koizumi, Emiko Kaneko, E. Setoguchi, Yo In’nami","doi":"10.58379/bswy7332","DOIUrl":"https://doi.org/10.58379/bswy7332","url":null,"abstract":"Attempts are underway to develop prototype tasks, based on a Japanese version of the Common European Framework of Reference for Languages (CEFR; Council of Europe, 2001; CEFR-J; Negishi, Takada, & Tono, 2013). As part of this larger project, the current paper reports on the creation of spoken interaction tasks for five levels (Pre-A1, A1.1, A1.2, A1.3, and A2.1). Tasks were undertaken by 66 Japanese university students. Two raters evaluated their interactions using a three-level holistic rating scale, and 20% of the performances were double rated. The spoken ratings were analysed using many-facet Rasch measurement (MFRM) and generalizability theory (G-theory). MFRM showed that all the tasks fit the Rasch model well, the scale functioned satisfactorily, and the difficulty of the tasks generally concurred with CEFR-J levels. Results from G-theory that employed the p x t design, including tasks as a facet, showed the different proportion of variance accounted for by tasks, as well as the number of tasks that could be required to ensure sufficiently high reliability. The MFRM and G-theory results effectively revealed areas for improving spoken interaction tasks; the results also showed the usefulness of combining the two methods for task development and revision.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74110307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

An investigation of factors involved in Japanese students’ English learning behavior during test preparation 日本学生备考期间英语学习行为的相关因素调查

Q4 LINGUISTICS

Studies in Language Assessment

Pub Date : 2019-01-01 DOI: 10.58379/fsbq6351

Takanori Sato

Japan has recently been promoting university entrance examination reform with the goal of positively influencing students’ English learning, but the extent to which entrance examinations themselves affect English learning is not known. The promotion of better learning requires changing the factors that affect learning behavior, rather than merely modifying existing examinations or introducing new ones. This study investigated the factors determining Japanese students’ English learning while they prepared for high-stakes university entrance examinations, aiming to construct a model that explicates how test-related and test-independent factors are intertwined. Semi-structured interviews were conducted with 14 first-year university students asking how they had prepared for their examinations and why they had chosen particular preparation methods. After thematic analysis, four main factors in student learning behavior were identified (examination, student views, school, and examination-independent factors) and their relationships explored. The study findings provide useful insights for policymakers in English as a foreign language (EFL) educational contexts, where English tests are used as part of language education policies. Furthermore, the proposed model is theoretically important as it explains the complex washback mechanism and deepens our understanding of why intended washback effects on learning are not necessarily achieved.

日本最近一直在推进以积极影响学生英语学习为目标的高考改革，但高考本身对英语学习的影响程度尚不清楚。促进更好的学习需要改变影响学习行为的因素，而不仅仅是修改现有的考试或引入新的考试。本研究调查了日本学生在为高考做准备时英语学习的影响因素，旨在构建一个模型来解释考试相关因素和考试无关因素是如何相互交织的。我们对14名大学一年级学生进行了半结构化访谈，询问他们是如何准备考试的，以及为什么他们选择了特定的准备方法。通过主题分析，确定了影响学生学习行为的四个主要因素(考试、学生观点、学校和与考试无关的因素)，并探讨了它们之间的关系。研究结果为英语作为外语(EFL)教育背景下的政策制定者提供了有用的见解，其中英语考试被用作语言教育政策的一部分。此外，所提出的模型在理论上很重要，因为它解释了复杂的反拨机制，并加深了我们对为什么不一定能达到预期的学习反拨效果的理解。

{"title":"An investigation of factors involved in Japanese students’ English learning behavior during test preparation","authors":"Takanori Sato","doi":"10.58379/fsbq6351","DOIUrl":"https://doi.org/10.58379/fsbq6351","url":null,"abstract":"Japan has recently been promoting university entrance examination reform with the goal of positively influencing students’ English learning, but the extent to which entrance examinations themselves affect English learning is not known. The promotion of better learning requires changing the factors that affect learning behavior, rather than merely modifying existing examinations or introducing new ones. This study investigated the factors determining Japanese students’ English learning while they prepared for high-stakes university entrance examinations, aiming to construct a model that explicates how test-related and test-independent factors are intertwined. Semi-structured interviews were conducted with 14 first-year university students asking how they had prepared for their examinations and why they had chosen particular preparation methods. After thematic analysis, four main factors in student learning behavior were identified (examination, student views, school, and examination-independent factors) and their relationships explored. The study findings provide useful insights for policymakers in English as a foreign language (EFL) educational contexts, where English tests are used as part of language education policies. Furthermore, the proposed model is theoretically important as it explains the complex washback mechanism and deepens our understanding of why intended washback effects on learning are not necessarily achieved.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80086381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Benchmarking video presentations for CEFR usage in Cuba 古巴CEFR使用的基准视频演示

Q4 LINGUISTICS

Studies in Language Assessment

Pub Date : 2019-01-01 DOI: 10.58379/tvkg6591

Geisa Dávila Pérez, Frank van Splunder, L. Baten, Jan Van Maele, Yoennis Díaz Moreno

This paper discusses language assessment by means of video recordings, particularly its use for benchmarking purposes regarding language proficiency in a Cuban academic context. It is based on videotaped oral presentation assignments of Cuban PhD students for peer and teacher assessment. In order to avoid bias and provide validity to the results, the PhD students’ videotaped oral presentation assignments have been rated by language testing experts from three different Flemish universities, which are included in the Interuniversity Testing Consortium (IUTC). A selection of these assignments will be transferred to the university Moodle platform, and this compilation may be used to enable the start of a Cuban corpus of internationally rated presentations of academic English. Therefore, the results obtained will provide language teachers with a growing database of video recordings to facilitate benchmarking activities and promote standardized assessment in the Cuban academic context.

本文讨论了通过录像进行的语言评估，特别是在古巴学术背景下，将录像用于对语言能力进行基准测试。它是根据古巴博士生的口头陈述作业录像，以供同行和教师评价。为了避免偏见并保证结果的有效性，博士生的口头陈述作业录像由三所不同佛兰德大学的语言测试专家进行评分，这三所大学都是国际大学间测试联盟(IUTC)的成员。这些作业的一部分将被转移到大学的Moodle平台上，这些汇编可以用来启动古巴国际学术英语演讲语料库。因此，所取得的成果将为语文教师提供一个不断增加的录像数据库，以促进古巴学术方面的基准活动和促进标准化评价。

引用次数: 0

An investigation into rater performance with a holistic scale and a binary, analytic scale on an ESL writing placement test 用整体量表和二元分析量表对ESL写作分班测试中评分者表现的调查

Q4 LINGUISTICS

Studies in Language Assessment

Pub Date : 2019-01-01 DOI: 10.58379/nkdc1529

Hyunji Hayley Park, Xun Yan

This two-phased, sequential mixed-methods study investigates how raters are influenced by different rating scales on a college-level English as a second language (ESL) writing placement test. In Phase I, nine certified raters rated 152 essays using a holistic, profile-based scale; in Phase II, they rated 200 essays using a binary, analytic scale developed based on the holistic scale and 100 essays using both rating scales. Ratings were examined both quantitatively through Rasch modeling and qualitatively via think-aloud protocols and semi-structured interviews. Findings from Phase I revealed that, despite satisfactory internal consistency, the raters demonstrated relatively low rater agreement and individual differences in their use of the holistic scale. Findings from Phase II showed that the binary, analytic scale led to much improvement in rater consensus and rater consistency. Another finding from Phase II suggests that the binary, analytic scale helped the raters deconstruct the holistic scale, reducing their cognitive burden. This study represents a creative use of a binary, analytic scale to guide raters through a holistic rating scale. Implications regarding how a rating scale affects rating behavior and performance are discussed.

本研究分为两个阶段，顺序混合方法研究了评分者在大学英语作为第二语言(ESL)写作分班测试中如何受到不同评分量表的影响。在第一阶段，九名认证评分员使用基于个人资料的整体量表对152篇论文进行评分;在第二阶段，他们使用基于整体量表开发的二元分析量表对200篇文章进行评分，100篇文章使用两种评分量表。评级通过Rasch模型进行定量检查，通过有声思考协议和半结构化访谈进行定性检查。第一阶段的调查结果显示，尽管内部一致性令人满意，但评分者在使用整体量表时表现出相对较低的评分一致性和个体差异。第二阶段的研究结果表明，二元分析量表在评分一致性和评分一致性方面有很大改善。第二阶段的另一个发现表明，二元分析量表帮助评分者解构了整体量表，减轻了他们的认知负担。本研究创造性地使用了二元分析量表来指导评分者完成整体评分量表。讨论了评定量表如何影响评定行为和表现的含义。

{"title":"An investigation into rater performance with a holistic scale and a binary, analytic scale on an ESL writing placement test","authors":"Hyunji Hayley Park, Xun Yan","doi":"10.58379/nkdc1529","DOIUrl":"https://doi.org/10.58379/nkdc1529","url":null,"abstract":"This two-phased, sequential mixed-methods study investigates how raters are influenced by different rating scales on a college-level English as a second language (ESL) writing placement test. In Phase I, nine certified raters rated 152 essays using a holistic, profile-based scale; in Phase II, they rated 200 essays using a binary, analytic scale developed based on the holistic scale and 100 essays using both rating scales. Ratings were examined both quantitatively through Rasch modeling and qualitatively via think-aloud protocols and semi-structured interviews. Findings from Phase I revealed that, despite satisfactory internal consistency, the raters demonstrated relatively low rater agreement and individual differences in their use of the holistic scale. Findings from Phase II showed that the binary, analytic scale led to much improvement in rater consensus and rater consistency. Another finding from Phase II suggests that the binary, analytic scale helped the raters deconstruct the holistic scale, reducing their cognitive burden. This study represents a creative use of a binary, analytic scale to guide raters through a holistic rating scale. Implications regarding how a rating scale affects rating behavior and performance are discussed.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84172200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Noun phrase complexity in integrated writing produced by advanced Chinese EFL learners 高级中国英语学习者综合写作中名词短语的复杂性

Q4 LINGUISTICS

Studies in Language Assessment

Pub Date : 2019-01-01 DOI: 10.58379/lawy6296

Lirong Xu

This study aims to investigate the relationship between the noun phrase complexity of advanced Chinese EFL learners’ integrated writing and the score assigned by expert raters. Their written performance was also compared with those of native English speakers (NS) at university level with particular reference to the use of noun phrases. One hundred and twenty integrated writing samples were collected from an English writing test administered in a southeastern province of China. Results showed that there was a moderately positive correlation between the use of complex nominals in test-takers’ writing and the corresponding score. More specifically, non-native speakers of English (NNS) and NS groups differed significantly in the majority of noun phrase complexity measures. The implications are discussed concerning noun phrase complexity as a more reliable measure of syntactic complexity for an integrated writing test.

本研究旨在探讨高级中国英语学习者综合写作名词短语复杂程度与专家评分的关系。他们的书面表现也与大学水平的英语母语者(NS)进行了比较，特别是在名词短语的使用方面。从中国东南某省的英语写作测试中收集了120个综合写作样本。结果表明，考生在写作中使用复杂名词与相应的分数之间存在中度正相关。更具体地说，非英语母语者(NNS)和非英语母语者组在大多数名词短语复杂性测量上存在显著差异。本文讨论了名词短语复杂度作为综合写作测试中更为可靠的句法复杂度指标的意义。

引用次数: 4

Fairness in language assessment: What can the Rasch model offer? 语言评估的公平性:Rasch模型能提供什么?

Q4 LINGUISTICS

Studies in Language Assessment

Pub Date : 2019-01-01 DOI: 10.58379/jrwg5233

Jason Fan, U. Knoch

Drawing upon discussions of fairness in the field of language assessment, this systematic review study explores how the Rasch model has been used to investigate and enhance fairness in language assessment. To that end, we collected and systematically reviewed the empirical studies that used the Rasch model, published in four leading journals in the field from 2000 to 2018. A total of 139 articles were collected and subsequently coded in NVivo 11, using the open coding method. In addition, matrix coding analysis was implemented to explore the relationship between the topics that were identified and the language constructs that constituted the focus of the collected articles. Five broad themes were extracted from the coding process, including: 1) rater effects; 2) language test design and evaluation; 3) differential group performance; 4) evaluation of rating criteria, and 5) standard setting. Representative studies under each category were used to illustrate how the Rasch model was utilised to investigate test fairness. Findings of this study have important implications for language assessment development and evaluation. In addition, the findings also identified a few avenues in the application of the Rasch model which language assessment researchers should explore in future studies.

借鉴语言评估领域关于公平性的讨论，本系统回顾研究探讨了如何使用Rasch模型来调查和提高语言评估中的公平性。为此，我们收集并系统地回顾了使用Rasch模型的实证研究，这些研究发表在2000年至2018年该领域的四本主要期刊上。共收集139篇文献，在NVivo 11中采用开放编码方法进行编码。此外，采用矩阵编码分析来探索识别的主题与构成收集文章重点的语言结构之间的关系。从编码过程中提取了五大主题，包括:1)比例效应;2)语言测试设计与评价;3)差异组绩效;4)评估评级标准;5)制定标准。每个类别下的代表性研究被用来说明如何利用Rasch模型来调查测试公平性。本研究结果对语言评估的发展和评价具有重要意义。此外，研究结果还确定了Rasch模型应用的一些途径，这些途径是语言评估研究者在未来的研究中应该探索的。

{"title":"Fairness in language assessment: What can the Rasch model offer?","authors":"Jason Fan, U. Knoch","doi":"10.58379/jrwg5233","DOIUrl":"https://doi.org/10.58379/jrwg5233","url":null,"abstract":"Drawing upon discussions of fairness in the field of language assessment, this systematic review study explores how the Rasch model has been used to investigate and enhance fairness in language assessment. To that end, we collected and systematically reviewed the empirical studies that used the Rasch model, published in four leading journals in the field from 2000 to 2018. A total of 139 articles were collected and subsequently coded in NVivo 11, using the open coding method. In addition, matrix coding analysis was implemented to explore the relationship between the topics that were identified and the language constructs that constituted the focus of the collected articles. Five broad themes were extracted from the coding process, including: 1) rater effects; 2) language test design and evaluation; 3) differential group performance; 4) evaluation of rating criteria, and 5) standard setting. Representative studies under each category were used to illustrate how the Rasch model was utilised to investigate test fairness. Findings of this study have important implications for language assessment development and evaluation. In addition, the findings also identified a few avenues in the application of the Rasch model which language assessment researchers should explore in future studies.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87619288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Rater variability across examinees and rating criteria in paired speaking assessment 在配对口语评估中，考生之间的评分差异和评分标准

Q4 LINGUISTICS

Studies in Language Assessment

Pub Date : 2018-01-01 DOI: 10.58379/yvwq3768

S. Youn

This study investigates rater variability with regard to examinees’ levels and rating criteria in paired speaking assessment. 12 raters completed rater training and scored 102 examinees’ paired speaking performances using analytical rating criteria that reflect various features of paired speaking performance. The raters were fairly consistent in their overall ratings, but differed in their severity. The bias analyses using many-facet Rasch measurement revealed that a higher level of rater bias interaction was found for the rating criteria compared to those of the examinees’ levels and the pairing type which reflects a level difference between two examinees. In particular, the most challenging rating category Language Use attracted significant bias interactions. However, the raters did not display more frequent bias interactions based on the interaction-specific rating categories, such as Engaging with Interaction and Turn Organization. Furthermore, the raters tended to reverse their severity patterns across the rating categories. In the rater and examinee bias interactions, the raters tended to show more frequent bias toward the low-level examinees. However, no significant rater bias was found based on the pairing type that consisted of high-level and low-level examinees. These findings have implications for rater training in paired speaking assessment.

本研究探讨了配对口语测试中考生水平和评分标准的变异性。12名评分员完成了评分员培训，并使用反映配对口语表现各种特征的分析评分标准对102名考生的配对口语表现进行评分。评分者在总体评分上相当一致，但在严重程度上有所不同。使用多面Rasch测量的偏倚分析表明，与考生水平和配对类型相比，评分标准存在更高水平的偏倚相互作用，反映了两个考生之间的水平差异。特别是，最具挑战性的评级类别语言使用吸引了显著的偏见互动。然而，评分者并没有表现出更频繁的偏见互动，这是基于特定于互动的评级类别，比如参与互动和回合组织。此外，评分者倾向于在评分类别中扭转他们的严重程度模式。在评分者与考生的偏见互动中，评分者对低水平考生的偏见更频繁。然而，基于高水平和低水平考生的配对类型，没有发现显著的偏倚。这些发现对配对口语评估的评分训练具有启示意义。

{"title":"Rater variability across examinees and rating criteria in paired speaking assessment","authors":"S. Youn","doi":"10.58379/yvwq3768","DOIUrl":"https://doi.org/10.58379/yvwq3768","url":null,"abstract":"This study investigates rater variability with regard to examinees’ levels and rating criteria in paired speaking assessment. 12 raters completed rater training and scored 102 examinees’ paired speaking performances using analytical rating criteria that reflect various features of paired speaking performance. The raters were fairly consistent in their overall ratings, but differed in their severity. The bias analyses using many-facet Rasch measurement revealed that a higher level of rater bias interaction was found for the rating criteria compared to those of the examinees’ levels and the pairing type which reflects a level difference between two examinees. In particular, the most challenging rating category Language Use attracted significant bias interactions. However, the raters did not display more frequent bias interactions based on the interaction-specific rating categories, such as Engaging with Interaction and Turn Organization. Furthermore, the raters tended to reverse their severity patterns across the rating categories. In the rater and examinee bias interactions, the raters tended to show more frequent bias toward the low-level examinees. However, no significant rater bias was found based on the pairing type that consisted of high-level and low-level examinees. These findings have implications for rater training in paired speaking assessment.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"545 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78168361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Evaluating the relative effectiveness of online and face-to-face training for new writing raters 评估新写作评分者在线和面对面培训的相对有效性

Q4 LINGUISTICS

Studies in Language Assessment

Pub Date : 2018-01-01 DOI: 10.58379/zvmm4117

U. Knoch, J. Fairbairn, C. Myford, A. Huisman

Training writing raters in large-scale tests is commonly conducted face-to-face but bringing raters together for training is difficult and expensive. For this reason, more and more testing agencies are exploring technological advances with the aim of providing training online. A number of studies have examined whether online rater training is a feasible alternative to face-to-face training.This mixed methods study compared two groups of new raters, one trained online using an online training platform and the other trained using the conventional face-to-face rater training procedures. Raters who passed accreditation were also compared in the reliability of their subsequent operational ratings. The findings show that no significant differences between the rating behaviour of the two groups were identified on the writing test. The qualitative data also showed that, in general, the raters enjoyed both modes of training and felt generally sufficiently trained although some specific problems were encountered. Results on the operational ratings in the first five months after completing the training showed no significant differences between the two training groups. The paper concludes with some implications for training raters in online environments and sets out a possible programme for further research.

在大规模考试中，培训写作评分员通常是面对面进行的，但将评分员聚集在一起进行培训既困难又昂贵。出于这个原因，越来越多的测试机构正在探索技术进步，目的是提供在线培训。许多研究调查了在线评分员培训是否是面对面培训的可行替代方案。这项混合方法的研究比较了两组新的评分员，一组使用在线培训平台进行在线培训，另一组使用传统的面对面评分员培训程序进行培训。通过认证的评价者也在其后续操作评级的可靠性方面进行了比较。研究结果表明，在写作测试中，两组学生的评分行为没有显著差异。定性数据还表明，一般来说，评分员喜欢这两种训练方式，虽然遇到了一些具体问题，但总的来说他们感到训练充分。在完成培训后的前5个月，两个培训组的操作评分结果没有显着差异。论文最后对在线环境下的评分员培训提出了一些建议，并提出了进一步研究的可能方案。

{"title":"Evaluating the relative effectiveness of online and face-to-face training for new writing raters","authors":"U. Knoch, J. Fairbairn, C. Myford, A. Huisman","doi":"10.58379/zvmm4117","DOIUrl":"https://doi.org/10.58379/zvmm4117","url":null,"abstract":"Training writing raters in large-scale tests is commonly conducted face-to-face but bringing raters together for training is difficult and expensive. For this reason, more and more testing agencies are exploring technological advances with the aim of providing training online. A number of studies have examined whether online rater training is a feasible alternative to face-to-face training.This mixed methods study compared two groups of new raters, one trained online using an online training platform and the other trained using the conventional face-to-face rater training procedures. Raters who passed accreditation were also compared in the reliability of their subsequent operational ratings. The findings show that no significant differences between the rating behaviour of the two groups were identified on the writing test. The qualitative data also showed that, in general, the raters enjoyed both modes of training and felt generally sufficiently trained although some specific problems were encountered. Results on the operational ratings in the first five months after completing the training showed no significant differences between the two training groups. The paper concludes with some implications for training raters in online environments and sets out a possible programme for further research.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81157255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Studies in Language Assessment

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀