{"title":"Towards robust complexity indices in linguistic typology","authors":"Y. Oh, F. Pellegrino","doi":"10.1075/sl.22034.oh","DOIUrl":null,"url":null,"abstract":"\nThere is high hope that corpus-based approaches to language complexity will contribute to explaining linguistic diversity. Several complexity indices have consequently been proposed to compare different aspects among languages, especially in phonology and morphology. However, their robustness against changes in corpus size and content hasn’t been systematically assessed, thus impeding comparability between studies. Here, we systematically test the robustness of four complexity indices estimated from raw texts and either routinely utilized in crosslinguistic studies (Type-Token Ratio and word-level Entropy) or more recently proposed (Word Information Density and Lexical Diversity). Our results on 47 languages strongly suggest that traditional indices are more prone to fluctuation than the newer ones. Additionally, we confirm with Word Information Density the existence of a cross-linguistic trade-off between word-internal and across-word distributions of information. Finally, we implement a proof of concept suggesting that modern deep-learning language models can improve the comparability across languages with non-parallel datasets.","PeriodicalId":46377,"journal":{"name":"Studies in Language","volume":null,"pages":null},"PeriodicalIF":0.5000,"publicationDate":"2022-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Studies in Language","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1075/sl.22034.oh","RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
引用次数: 0
Abstract
There is high hope that corpus-based approaches to language complexity will contribute to explaining linguistic diversity. Several complexity indices have consequently been proposed to compare different aspects among languages, especially in phonology and morphology. However, their robustness against changes in corpus size and content hasn’t been systematically assessed, thus impeding comparability between studies. Here, we systematically test the robustness of four complexity indices estimated from raw texts and either routinely utilized in crosslinguistic studies (Type-Token Ratio and word-level Entropy) or more recently proposed (Word Information Density and Lexical Diversity). Our results on 47 languages strongly suggest that traditional indices are more prone to fluctuation than the newer ones. Additionally, we confirm with Word Information Density the existence of a cross-linguistic trade-off between word-internal and across-word distributions of information. Finally, we implement a proof of concept suggesting that modern deep-learning language models can improve the comparability across languages with non-parallel datasets.
期刊介绍:
Studies in Language provides a forum for the discussion of issues in contemporary linguistics from discourse-pragmatic, functional, and typological perspectives. Areas of central concern are: discourse grammar; syntactic, morphological and semantic universals; pragmatics; grammaticalization and grammaticalization theory; and the description of problems in individual languages from a discourse-pragmatic, functional, and typological perspective. Special emphasis is placed on works which contribute to the development of discourse-pragmatic, functional, and typological theory and which explore the application of empirical methodology to the analysis of grammar.