Investigating Language Relationships in Multilingual Sentence Encoders Through the Lens of Linguistic Typology

IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computational Linguistics Pub Date : 2022-04-13 DOI:10.1162/coli_a_00444
Rochelle Choenni, Ekaterina Shutova
{"title":"Investigating Language Relationships in Multilingual Sentence Encoders Through the Lens of Linguistic Typology","authors":"Rochelle Choenni, Ekaterina Shutova","doi":"10.1162/coli_a_00444","DOIUrl":null,"url":null,"abstract":"Abstract Multilingual sentence encoders have seen much success in cross-lingual model transfer for downstream NLP tasks. The success of this transfer is, however, dependent on the model’s ability to encode the patterns of cross-lingual similarity and variation. Yet, we know relatively little about the properties of individual languages or the general patterns of linguistic variation that the models encode. In this article, we investigate these questions by leveraging knowledge from the field of linguistic typology, which studies and documents structural and semantic variation across languages. We propose methods for separating language-specific subspaces within state-of-the-art multilingual sentence encoders (LASER, M-BERT, XLM, and XLM-R) with respect to a range of typological properties pertaining to lexical, morphological, and syntactic structure. Moreover, we investigate how typological information about languages is distributed across all layers of the models. Our results show interesting differences in encoding linguistic variation associated with different pretraining strategies. In addition, we propose a simple method to study how shared typological properties of languages are encoded in two state-of-the-art multilingual models—M-BERT and XLM-R. The results provide insight into their information-sharing mechanisms and suggest that these linguistic properties are encoded jointly across typologically similar languages in these models.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"48 1","pages":"635-672"},"PeriodicalIF":3.7000,"publicationDate":"2022-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00444","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 12

Abstract

Abstract Multilingual sentence encoders have seen much success in cross-lingual model transfer for downstream NLP tasks. The success of this transfer is, however, dependent on the model’s ability to encode the patterns of cross-lingual similarity and variation. Yet, we know relatively little about the properties of individual languages or the general patterns of linguistic variation that the models encode. In this article, we investigate these questions by leveraging knowledge from the field of linguistic typology, which studies and documents structural and semantic variation across languages. We propose methods for separating language-specific subspaces within state-of-the-art multilingual sentence encoders (LASER, M-BERT, XLM, and XLM-R) with respect to a range of typological properties pertaining to lexical, morphological, and syntactic structure. Moreover, we investigate how typological information about languages is distributed across all layers of the models. Our results show interesting differences in encoding linguistic variation associated with different pretraining strategies. In addition, we propose a simple method to study how shared typological properties of languages are encoded in two state-of-the-art multilingual models—M-BERT and XLM-R. The results provide insight into their information-sharing mechanisms and suggest that these linguistic properties are encoded jointly across typologically similar languages in these models.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
语言类型学视角下的多语言句子编码器语言关系研究
摘要多语言句子编码器在下游NLP任务的跨语言模型转移方面取得了很大成功。然而,这种转移的成功取决于模型对跨语言相似性和变异模式进行编码的能力。然而,我们对个别语言的特性或模型编码的语言变异的一般模式知之甚少。在这篇文章中,我们通过利用语言类型学领域的知识来研究这些问题,该领域研究并记录了不同语言之间的结构和语义变化。我们提出了在最先进的多语言句子编码器(LASER、M-BERT、XLM和XLM-R)中根据与词汇、形态和句法结构有关的一系列类型学特性来分离语言特定子空间的方法。此外,我们还研究了关于语言的类型信息是如何分布在模型的所有层中的。我们的研究结果显示,不同的预训练策略在编码语言变异方面存在着有趣的差异。此外,我们提出了一种简单的方法来研究语言的共享类型学属性是如何在两个最先进的多语言模型中编码的——M-BERT和XLM-R。研究结果深入了解了它们的信息共享机制,并表明这些语言属性是在这些模型中跨类型相似的语言共同编码的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Computational Linguistics
Computational Linguistics 工程技术-计算机:跨学科应用
CiteScore
15.80
自引率
0.00%
发文量
45
审稿时长
>12 weeks
期刊介绍: Computational Linguistics, the longest-running publication dedicated solely to the computational and mathematical aspects of language and the design of natural language processing systems, provides university and industry linguists, computational linguists, AI and machine learning researchers, cognitive scientists, speech specialists, and philosophers with the latest insights into the computational aspects of language research.
期刊最新文献
Generation and Polynomial Parsing of Graph Languages with Non-Structural Reentrancies Languages through the Looking Glass of BPE Compression Capturing Fine-Grained Regional Differences in Language Use through Voting Precinct Embeddings Machine Learning for Ancient Languages: A Survey Statistical Methods for Annotation Analysis by Silviu Paun, Ron Artstein, and Massimo Poesio
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1