Studying language evolution in the age of big data

IF 2.1 0 LANGUAGE & LINGUISTICS Journal of Language Evolution Pub Date : 2018-06-08 DOI:10.1093/JOLE/LZY004

Tanmoy Bhattacharya, Nancy Retzlaff, Damián E. Blasi, W. Bruce Croft, Michael Cysouw, D. Hruschka, I. Maddieson, Lydia Müller, E. Smith, P. Stadler, George Starostin, Hyejin Youn

{"title":"Studying language evolution in the age of big data","authors":"Tanmoy Bhattacharya, Nancy Retzlaff, Damián E. Blasi, W. Bruce Croft, Michael Cysouw, D. Hruschka, I. Maddieson, Lydia Müller, E. Smith, P. Stadler, George Starostin, Hyejin Youn","doi":"10.1093/JOLE/LZY004","DOIUrl":null,"url":null,"abstract":"\n The increasing availability of large digital corpora of cross-linguistic data is revolutionizing many branches of linguistics. Overall, it has triggered a shift of attention from detailed questions about individual features to more global patterns amenable to rigorous, but statistical, analyses. This engenders an approach based on successive approximations where models with simplified assumptions result in frameworks that can then be systematically refined, always keeping explicit the methodological commitments and the assumed prior knowledge. Therefore, they can resolve disputes between competing frameworks quantitatively by separating the support provided by the data from the underlying assumptions. These methods, though, often appear as a ‘black box’ to traditional practitioners. In fact, the switch to a statistical view complicates comparison of the results from these newer methods with traditional understanding, sometimes leading to misinterpretation and overly broad claims. We describe here this evolving methodological shift, attributed to the advent of big, but often incomplete and poorly curated data, emphasizing the underlying similarity of the newer quantitative to the traditional comparative methods and discussing when and to what extent the former have advantages over the latter. In this review, we cover briefly both randomization tests for detecting patterns in a largely model-independent fashion and phylolinguistic methods for a more model-based analysis of these patterns. We foresee a fruitful division of labor between the ability to computationally process large volumes of data and the trained linguistic insight identifying worthy prior commitments and interesting hypotheses in need of comparison.","PeriodicalId":37118,"journal":{"name":"Journal of Language Evolution","volume":" ","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2018-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1093/JOLE/LZY004","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Language Evolution","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/JOLE/LZY004","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}

引用次数: 13

Abstract

The increasing availability of large digital corpora of cross-linguistic data is revolutionizing many branches of linguistics. Overall, it has triggered a shift of attention from detailed questions about individual features to more global patterns amenable to rigorous, but statistical, analyses. This engenders an approach based on successive approximations where models with simplified assumptions result in frameworks that can then be systematically refined, always keeping explicit the methodological commitments and the assumed prior knowledge. Therefore, they can resolve disputes between competing frameworks quantitatively by separating the support provided by the data from the underlying assumptions. These methods, though, often appear as a ‘black box’ to traditional practitioners. In fact, the switch to a statistical view complicates comparison of the results from these newer methods with traditional understanding, sometimes leading to misinterpretation and overly broad claims. We describe here this evolving methodological shift, attributed to the advent of big, but often incomplete and poorly curated data, emphasizing the underlying similarity of the newer quantitative to the traditional comparative methods and discussing when and to what extent the former have advantages over the latter. In this review, we cover briefly both randomization tests for detecting patterns in a largely model-independent fashion and phylolinguistic methods for a more model-based analysis of these patterns. We foresee a fruitful division of labor between the ability to computationally process large volumes of data and the trained linguistic insight identifying worthy prior commitments and interesting hypotheses in need of comparison.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

研究大数据时代的语言进化

跨语言数据的大型数字语料库的日益可用性正在颠覆语言学的许多分支。总的来说，它引发了人们的注意力从关于个体特征的详细问题转移到更全局的模式，这些模式可以进行严格但统计的分析。这产生了一种基于逐次逼近的方法，其中具有简化假设的模型产生了可以系统地细化的框架，始终保持明确的方法承诺和假设的先验知识。因此，他们可以通过将数据提供的支持与基本假设分开，在数量上解决竞争框架之间的争议。然而，这些方法对传统从业者来说往往是一个“黑匣子”。事实上，向统计学观点的转变使这些新方法的结果与传统理解的比较变得复杂，有时会导致误解和过于宽泛的说法。我们在这里描述了这种不断演变的方法论转变，归因于大量但往往不完整且策划不当的数据的出现，强调了新的定量方法与传统比较方法的潜在相似性，并讨论了前者在何时以及在多大程度上比后者具有优势。在这篇综述中，我们简要介绍了以很大程度上独立于模型的方式检测模式的随机化测试和对这些模式进行更基于模型的分析的分类方法。我们预见到，在计算处理大量数据的能力和经过训练的语言洞察力之间将进行富有成效的分工，以确定有价值的先前承诺和需要比较的有趣假设。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Language Evolution Social Sciences-Linguistics and Language

CiteScore

4.50

自引率

7.70%

发文量