Modelling crosslinguistic n‑gram correspondence in typologically different languages

IF 0.5 0 LANGUAGE & LINGUISTICS Languages in Contrast Pub Date : 2021-01-12 DOI:10.1075/lic.19018.mil
Jiří Milička, V. Cvrček, L. Lukešová
{"title":"Modelling crosslinguistic n‑gram correspondence in typologically different languages","authors":"Jiří Milička, V. Cvrček, L. Lukešová","doi":"10.1075/lic.19018.mil","DOIUrl":null,"url":null,"abstract":"Abstract N‑gram analysis (popularized e.g. by Biber et al., 1999 ) has become a popular method for the identification of recurrent language patterns. Although the extraction of n‑grams from a corpus may seem straightforward, it proves to be very challenging when applied cross-linguistically (cf. e.g. Ebeling and Ebeling, 2013 ; Granger and Lefer, 2013 ; Cermakova and Chlumska, 2017 ). The major issue is that the quantities of n‑grams of a certain length in typologically different languages do not correspond. Consequently, n‑grams of a given length may function differently across languages, rendering a direct comparison inadequate. Our paper introduces a function capable of modelling the relation between the quantities of n‑grams in typologically distant languages, using the example of Czech and English (and some other language pairs). Based on our model, we can suggest what n‑gram lengths should be contrasted to better reflect the size of n‑gram inventories in each language. The correspondence may not be intuitive (e.g. a Czech 2-gram may best correspond to an English 2.5-gram), but it still provides researchers with a general guide as to what might be useful to include in their analysis (e.g. in this case 2-grams in Czech and 2- and 3-grams in English).","PeriodicalId":43502,"journal":{"name":"Languages in Contrast","volume":null,"pages":null},"PeriodicalIF":0.5000,"publicationDate":"2021-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Languages in Contrast","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1075/lic.19018.mil","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
引用次数: 1

Abstract

Abstract N‑gram analysis (popularized e.g. by Biber et al., 1999 ) has become a popular method for the identification of recurrent language patterns. Although the extraction of n‑grams from a corpus may seem straightforward, it proves to be very challenging when applied cross-linguistically (cf. e.g. Ebeling and Ebeling, 2013 ; Granger and Lefer, 2013 ; Cermakova and Chlumska, 2017 ). The major issue is that the quantities of n‑grams of a certain length in typologically different languages do not correspond. Consequently, n‑grams of a given length may function differently across languages, rendering a direct comparison inadequate. Our paper introduces a function capable of modelling the relation between the quantities of n‑grams in typologically distant languages, using the example of Czech and English (and some other language pairs). Based on our model, we can suggest what n‑gram lengths should be contrasted to better reflect the size of n‑gram inventories in each language. The correspondence may not be intuitive (e.g. a Czech 2-gram may best correspond to an English 2.5-gram), but it still provides researchers with a general guide as to what might be useful to include in their analysis (e.g. in this case 2-grams in Czech and 2- and 3-grams in English).
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
类型学上不同语言的跨语言n图对应建模
N图分析(如Biber等人,1999年推广)已成为识别循环语言模式的流行方法。虽然从语料库中提取n - gram似乎很简单,但当跨语言应用时,它被证明是非常具有挑战性的(参见Ebeling和Ebeling, 2013;Granger and Lefer, 2013;Cermakova and Chlumska, 2017)。主要的问题是,在不同的语言类型中,一定长度的n - gram的数量并不对应。因此,给定长度的n - gram在不同语言中的作用可能不同,这使得直接比较是不充分的。本文以捷克语和英语(以及其他一些语言对)为例,介绍了一个能够对类型学上相距较远的语言中n - gram数量之间的关系进行建模的函数。基于我们的模型,我们可以建议应该对比哪些n - gram长度,以更好地反映每种语言中n - gram库存的大小。这种对应关系可能不是直观的(例如,捷克语中的2克可能最适合英语中的2.5克),但它仍然为研究人员提供了一个通用的指导,告诉他们在分析中可能包括哪些有用的内容(例如,在这种情况下,捷克语中的2克和英语中的2克和3克)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Languages in Contrast
Languages in Contrast LANGUAGE & LINGUISTICS-
CiteScore
1.50
自引率
40.00%
发文量
12
期刊介绍: Languages in Contrast aims to publish contrastive studies of two or more languages. Any aspect of language may be covered, including vocabulary, phonology, morphology, syntax, semantics, pragmatics, text and discourse, stylistics, sociolinguistics and psycholinguistics. Languages in Contrast welcomes interdisciplinary studies, particularly those that make links between contrastive linguistics and translation, lexicography, computational linguistics, language teaching, literary and linguistic computing, literary studies and cultural studies.
期刊最新文献
A contrastive analysis of (-)ish in English and Swedish blogs Reflexivity patterns in West-Slavic languages A contrastive analysis of English deverbal -er synthetic compounds and their Italian equivalents The intricate construction of projection in news reports A contrastive analysis of English deverbal -er synthetic compounds and their Italian equivalents
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1