Early Modern Multiloquent Authors (EMMA): Designing a large-scale corpus of individuals’ languages

ICAME journal : computers in English linguistics Pub Date : 2019-03-01 DOI:10.2478/icame-2019-0004

P. Petré, Lynn Anthonissen, Sara Budts, Enrique Manjavacas, Emma-Louise Silva, William H. Standing, Odile A. O. Strik

{"title":"Early Modern Multiloquent Authors (EMMA): Designing a large-scale corpus of individuals’ languages","authors":"P. Petré, Lynn Anthonissen, Sara Budts, Enrique Manjavacas, Emma-Louise Silva, William H. Standing, Odile A. O. Strik","doi":"10.2478/icame-2019-0004","DOIUrl":null,"url":null,"abstract":"Abstract The present article provides a detailed description of the corpus of Early Modern Multiloquent Authors (EMMA), as well as two small case studies that illustrate its benefits. As a large-scale specialized corpus, EMMA tries to strike the right balance between big data and sociolinguistic coverage. It comprises the writings of 50 carefully selected authors across five generations, mostly taken from the 17th-century London society. EMMA enables the study of language as both a social and cognitive phenomenon and allows us to explore the interaction between the individual and aggregate levels. The first part of the article is a detailed description of EMMA’s first release as well as the sociolinguistic and methodological principles that underlie its design and compilation. We cover the conceptual decisions and practical implementations at various stages of the compilation process: from text-markup, encoding and data preprocessing to metadata enrichment and verification. In the second part, we present two small case studies to illustrate how rich contextualization can guide the interpretation of quantitative corpus-linguistic findings. The first case study compares the past tense formation of strong verbs in writers without access to higher education to that of writers with an extensive training in Latin. The second case study relates s/th-variation in the language of a single writer, Margaret Cavendish, to major shifts in her personal life.","PeriodicalId":73271,"journal":{"name":"ICAME journal : computers in English linguistics","volume":"63 5 1","pages":"122 - 83"},"PeriodicalIF":0.0000,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICAME journal : computers in English linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/icame-2019-0004","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

Abstract

Abstract The present article provides a detailed description of the corpus of Early Modern Multiloquent Authors (EMMA), as well as two small case studies that illustrate its benefits. As a large-scale specialized corpus, EMMA tries to strike the right balance between big data and sociolinguistic coverage. It comprises the writings of 50 carefully selected authors across five generations, mostly taken from the 17th-century London society. EMMA enables the study of language as both a social and cognitive phenomenon and allows us to explore the interaction between the individual and aggregate levels. The first part of the article is a detailed description of EMMA’s first release as well as the sociolinguistic and methodological principles that underlie its design and compilation. We cover the conceptual decisions and practical implementations at various stages of the compilation process: from text-markup, encoding and data preprocessing to metadata enrichment and verification. In the second part, we present two small case studies to illustrate how rich contextualization can guide the interpretation of quantitative corpus-linguistic findings. The first case study compares the past tense formation of strong verbs in writers without access to higher education to that of writers with an extensive training in Latin. The second case study relates s/th-variation in the language of a single writer, Margaret Cavendish, to major shifts in her personal life.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

早期现代多语作者(EMMA):设计一个大规模的个人语言语料库

摘要本文提供了早期现代多语作者(EMMA)语料库的详细描述，以及两个小的案例研究，说明其好处。作为一个大规模的专业语料库，EMMA试图在大数据和社会语言学覆盖之间取得适当的平衡。它包括50位精心挑选的五代作家的作品，大部分来自17世纪的伦敦社会。EMMA使语言作为一种社会现象和认知现象进行研究，并使我们能够探索个体和集体层面之间的相互作用。文章的第一部分详细描述了EMMA的第一个版本，以及作为其设计和编译基础的社会语言学和方法论原则。我们涵盖了编译过程各个阶段的概念决策和实际实现:从文本标记、编码和数据预处理到元数据充实和验证。在第二部分中，我们提出了两个小的案例研究来说明丰富的语境化如何指导定量语料库语言发现的解释。第一个案例研究比较了没有受过高等教育的作家和受过广泛拉丁语训练的作家的强烈动词的过去时形式。第二个案例研究将一位作家玛格丽特·卡文迪什的语言变化与她个人生活的重大转变联系起来。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ICAME journal : computers in English linguistics

自引率

0.00%

发文量

审稿时长

32 weeks