{"title":"从数量类型学角度看斯拉夫语言分类中的词汇多样性","authors":"Chenliang Zhou, Haitao Liu","doi":"10.1093/llc/fqad042","DOIUrl":null,"url":null,"abstract":"\n This study proposes a linguistic classification method based on quantitative typology, which leverages a large-scale multilingual parallel corpus to obtain valid language classification result by excluding the influence of covariates such as text genre and semantic content in cross-language comparison. To achieve this, we model the type–token relationships of each Slavic parallel text and calculate the lexical diversity to approximate the morphological complexity of the language. We perform automatic clustering of languages based on these lexical diversity metrics. Our findings show that (1) the lexical diversity metrics can well reflect that the language is located somewhere on the continuum of ‘analytism-synthetism’; (2) the automatic clustering based on these metrics effectively reflects the genealogical classification of Slavic languages; and (3) the geographical distribution of lexical diversity in the region where Slavic languages are spoken shows a monotonic increasing trend from southwest to northeast, which is consistent with the pattern found by previous authors on a global scale. The methodological approach taken in this study is data-driven, with the benefit of being independent of theoretical assumptions and easy for computer processing. This approach can offer a better insight into corpus-based typology and may shed light on the understanding of language as a human-driven complex adaptive system.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":null,"pages":null},"PeriodicalIF":0.7000,"publicationDate":"2023-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Lexical diversity as a lens into the classification of Slavic languages: A quantitative typology perspective\",\"authors\":\"Chenliang Zhou, Haitao Liu\",\"doi\":\"10.1093/llc/fqad042\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n This study proposes a linguistic classification method based on quantitative typology, which leverages a large-scale multilingual parallel corpus to obtain valid language classification result by excluding the influence of covariates such as text genre and semantic content in cross-language comparison. To achieve this, we model the type–token relationships of each Slavic parallel text and calculate the lexical diversity to approximate the morphological complexity of the language. We perform automatic clustering of languages based on these lexical diversity metrics. Our findings show that (1) the lexical diversity metrics can well reflect that the language is located somewhere on the continuum of ‘analytism-synthetism’; (2) the automatic clustering based on these metrics effectively reflects the genealogical classification of Slavic languages; and (3) the geographical distribution of lexical diversity in the region where Slavic languages are spoken shows a monotonic increasing trend from southwest to northeast, which is consistent with the pattern found by previous authors on a global scale. The methodological approach taken in this study is data-driven, with the benefit of being independent of theoretical assumptions and easy for computer processing. This approach can offer a better insight into corpus-based typology and may shed light on the understanding of language as a human-driven complex adaptive system.\",\"PeriodicalId\":45315,\"journal\":{\"name\":\"Digital Scholarship in the Humanities\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2023-06-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital Scholarship in the Humanities\",\"FirstCategoryId\":\"98\",\"ListUrlMain\":\"https://doi.org/10.1093/llc/fqad042\",\"RegionNum\":3,\"RegionCategory\":\"文学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"0\",\"JCRName\":\"HUMANITIES, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Scholarship in the Humanities","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1093/llc/fqad042","RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"HUMANITIES, MULTIDISCIPLINARY","Score":null,"Total":0}
Lexical diversity as a lens into the classification of Slavic languages: A quantitative typology perspective
This study proposes a linguistic classification method based on quantitative typology, which leverages a large-scale multilingual parallel corpus to obtain valid language classification result by excluding the influence of covariates such as text genre and semantic content in cross-language comparison. To achieve this, we model the type–token relationships of each Slavic parallel text and calculate the lexical diversity to approximate the morphological complexity of the language. We perform automatic clustering of languages based on these lexical diversity metrics. Our findings show that (1) the lexical diversity metrics can well reflect that the language is located somewhere on the continuum of ‘analytism-synthetism’; (2) the automatic clustering based on these metrics effectively reflects the genealogical classification of Slavic languages; and (3) the geographical distribution of lexical diversity in the region where Slavic languages are spoken shows a monotonic increasing trend from southwest to northeast, which is consistent with the pattern found by previous authors on a global scale. The methodological approach taken in this study is data-driven, with the benefit of being independent of theoretical assumptions and easy for computer processing. This approach can offer a better insight into corpus-based typology and may shed light on the understanding of language as a human-driven complex adaptive system.
期刊介绍:
DSH or Digital Scholarship in the Humanities is an international, peer reviewed journal which publishes original contributions on all aspects of digital scholarship in the Humanities including, but not limited to, the field of what is currently called the Digital Humanities. Long and short papers report on theoretical, methodological, experimental, and applied research and include results of research projects, descriptions and evaluations of tools, techniques, and methodologies, and reports on work in progress. DSH also publishes reviews of books and resources. Digital Scholarship in the Humanities was previously known as Literary and Linguistic Computing.