{"title":"Modelling crosslinguistic n‑gram correspondence in typologically different languages","authors":"Jiří Milička, V. Cvrček, L. Lukešová","doi":"10.1075/lic.19018.mil","DOIUrl":null,"url":null,"abstract":"Abstract N‑gram analysis (popularized e.g. by Biber et al., 1999 ) has become a popular method for the identification of recurrent language patterns. Although the extraction of n‑grams from a corpus may seem straightforward, it proves to be very challenging when applied cross-linguistically (cf. e.g. Ebeling and Ebeling, 2013 ; Granger and Lefer, 2013 ; Cermakova and Chlumska, 2017 ). The major issue is that the quantities of n‑grams of a certain length in typologically different languages do not correspond. Consequently, n‑grams of a given length may function differently across languages, rendering a direct comparison inadequate. Our paper introduces a function capable of modelling the relation between the quantities of n‑grams in typologically distant languages, using the example of Czech and English (and some other language pairs). Based on our model, we can suggest what n‑gram lengths should be contrasted to better reflect the size of n‑gram inventories in each language. The correspondence may not be intuitive (e.g. a Czech 2-gram may best correspond to an English 2.5-gram), but it still provides researchers with a general guide as to what might be useful to include in their analysis (e.g. in this case 2-grams in Czech and 2- and 3-grams in English).","PeriodicalId":43502,"journal":{"name":"Languages in Contrast","volume":"6 1","pages":""},"PeriodicalIF":0.5000,"publicationDate":"2021-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Languages in Contrast","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1075/lic.19018.mil","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
引用次数: 1
Abstract
Abstract N‑gram analysis (popularized e.g. by Biber et al., 1999 ) has become a popular method for the identification of recurrent language patterns. Although the extraction of n‑grams from a corpus may seem straightforward, it proves to be very challenging when applied cross-linguistically (cf. e.g. Ebeling and Ebeling, 2013 ; Granger and Lefer, 2013 ; Cermakova and Chlumska, 2017 ). The major issue is that the quantities of n‑grams of a certain length in typologically different languages do not correspond. Consequently, n‑grams of a given length may function differently across languages, rendering a direct comparison inadequate. Our paper introduces a function capable of modelling the relation between the quantities of n‑grams in typologically distant languages, using the example of Czech and English (and some other language pairs). Based on our model, we can suggest what n‑gram lengths should be contrasted to better reflect the size of n‑gram inventories in each language. The correspondence may not be intuitive (e.g. a Czech 2-gram may best correspond to an English 2.5-gram), but it still provides researchers with a general guide as to what might be useful to include in their analysis (e.g. in this case 2-grams in Czech and 2- and 3-grams in English).
期刊介绍:
Languages in Contrast aims to publish contrastive studies of two or more languages. Any aspect of language may be covered, including vocabulary, phonology, morphology, syntax, semantics, pragmatics, text and discourse, stylistics, sociolinguistics and psycholinguistics. Languages in Contrast welcomes interdisciplinary studies, particularly those that make links between contrastive linguistics and translation, lexicography, computational linguistics, language teaching, literary and linguistic computing, literary studies and cultural studies.