Corpus-based Approaches to Spoken L2 Production

IF 2.2 0 LANGUAGE & LINGUISTICS International Journal of Learner Corpus Research Pub Date : 2019-09-24 DOI:10.1075/ijlcr.00008.int

V. Brezina, Dana Gablasova, Tony McEnery

{"title":"Corpus-based Approaches to Spoken L2 Production","authors":"V. Brezina, Dana Gablasova, Tony McEnery","doi":"10.1075/ijlcr.00008.int","DOIUrl":null,"url":null,"abstract":"From the perspective of the compilers, a corpus is a journey. This particular journey – the process of the design and compilation of the Trinity Lancaster Corpus (TLC), the largest spoken learner corpus of (interactive) English to date – took over five years. It involved more than 3,500 hours of transcription time1 with many more hours spent on quality checking and post-processing of the data. This simple statistic shows why learner corpora of spoken language are still relatively rare, despite the fact that they provide a unique insight into spontaneous language production (McEnery, Brezina, Gablasova & Banerjee 2019). While the advances in computational technology allow better data processing and more efficient analysis, the starting point of a spoken (learner) corpus is still the recording of speech and its manual transcription. This method is considerably more reliable in capturing the details of spoken language than any existing voice recognition system. This is true for spoken L1 (McEnery 2018) as well as spoken L2 data (Gilquin 2015). The difference between the performance of an experienced transcriber and a state-ofthe-art automated system is immediately obvious from the comparison shown in Table 1. For meaningful linguistic analysis, only the sample transcript shown on the left (from the TLC) is suitable as it represents an accurate account of the spoken production. Building a spoken learner corpus is thus a resource-intensive project. The compilation of the TLC was made possible by research collaboration between Lancaster University and Trinity College London, a major international testing board. The project was supported by the Economic and Social Research Council (ESRC) and Trinity College London.2","PeriodicalId":29715,"journal":{"name":"International Journal of Learner Corpus Research","volume":" ","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2019-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Learner Corpus Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1075/ijlcr.00008.int","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}

引用次数: 6

Abstract

From the perspective of the compilers, a corpus is a journey. This particular journey – the process of the design and compilation of the Trinity Lancaster Corpus (TLC), the largest spoken learner corpus of (interactive) English to date – took over five years. It involved more than 3,500 hours of transcription time1 with many more hours spent on quality checking and post-processing of the data. This simple statistic shows why learner corpora of spoken language are still relatively rare, despite the fact that they provide a unique insight into spontaneous language production (McEnery, Brezina, Gablasova & Banerjee 2019). While the advances in computational technology allow better data processing and more efficient analysis, the starting point of a spoken (learner) corpus is still the recording of speech and its manual transcription. This method is considerably more reliable in capturing the details of spoken language than any existing voice recognition system. This is true for spoken L1 (McEnery 2018) as well as spoken L2 data (Gilquin 2015). The difference between the performance of an experienced transcriber and a state-ofthe-art automated system is immediately obvious from the comparison shown in Table 1. For meaningful linguistic analysis, only the sample transcript shown on the left (from the TLC) is suitable as it represents an accurate account of the spoken production. Building a spoken learner corpus is thus a resource-intensive project. The compilation of the TLC was made possible by research collaboration between Lancaster University and Trinity College London, a major international testing board. The project was supported by the Economic and Social Research Council (ESRC) and Trinity College London.2

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于语料库的第二语言口语生成方法

从编纂者的角度来看，语料库是一段旅程。这段特殊的旅程——三一兰开斯特语料库（TLC）的设计和编译过程——花了五年多的时间，这是迄今为止最大的（交互式）英语口语学习者语料库。它涉及3500多个小时的转录时间1，其中更多的时间用于数据的质量检查和后处理。这个简单的统计数据表明了为什么口语学习者语料库仍然相对罕见，尽管它们为自发的语言产生提供了独特的见解（McEnery，Brezina，Gablasova和Banerjee 2019）。虽然计算技术的进步允许更好的数据处理和更有效的分析，但口语（学习者）语料库的起点仍然是语音的记录及其手动转录。这种方法在捕捉口语细节方面比任何现有的语音识别系统都要可靠得多。口语L1（McEnery 2018）和口语L2数据（Gilquin 2015）都是如此。从表1中所示的比较中，经验丰富的转录器和现有技术的自动化系统的性能之间的差异是显而易见的。为了进行有意义的语言分析，只有左边（TLC）显示的样本转录本是合适的，因为它代表了对口语产生的准确描述。因此，建立口语学习者语料库是一个资源密集型项目。兰开斯特大学和伦敦三一学院（一个主要的国际测试委员会）之间的研究合作使TLC的编制成为可能。该项目得到了经济和社会研究委员会（ESRC）和伦敦三一学院的支持。2

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Journal of Learner Corpus Research

CiteScore

3.40

自引率

27.30%

发文量