Language Embeddings Sometimes Contain Typological Generalizations

IF 5.3 2区计算机科学 Computational Linguistics Pub Date : 2023-09-29 DOI:10.1162/coli_a_00491

Robert Östling, Murathan Kurfalı

{"title":"Language Embeddings Sometimes Contain Typological Generalizations","authors":"Robert Östling, Murathan Kurfalı","doi":"10.1162/coli_a_00491","DOIUrl":null,"url":null,"abstract":"Abstract To what extent can neural network models learn generalizations about language structure, and how do we find out what they have learned? We explore these questions by training neural models for a range of natural language processing tasks on a massively multilingual dataset of Bible translations in 1,295 languages. The learned language representations are then compared to existing typological databases as well as to a novel set of quantitative syntactic and morphological features obtained through annotation projection. We conclude that some generalizations are surprisingly close to traditional features from linguistic typology, but that most of our models, as well as those of previous work, do not appear to have made linguistically meaningful generalizations. Careful attention to details in the evaluation turns out to be essential to avoid false positives. Furthermore, to encourage continued work in this field, we release several resources covering most or all of the languages in our data: (1) multiple sets of language representations, (2) multilingual word embeddings, (3) projected and predicted syntactic and morphological features, (4) software to provide linguistically sound evaluations of language representations.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"133 1","pages":"0"},"PeriodicalIF":5.3000,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1162/coli_a_00491","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Abstract To what extent can neural network models learn generalizations about language structure, and how do we find out what they have learned? We explore these questions by training neural models for a range of natural language processing tasks on a massively multilingual dataset of Bible translations in 1,295 languages. The learned language representations are then compared to existing typological databases as well as to a novel set of quantitative syntactic and morphological features obtained through annotation projection. We conclude that some generalizations are surprisingly close to traditional features from linguistic typology, but that most of our models, as well as those of previous work, do not appear to have made linguistically meaningful generalizations. Careful attention to details in the evaluation turns out to be essential to avoid false positives. Furthermore, to encourage continued work in this field, we release several resources covering most or all of the languages in our data: (1) multiple sets of language representations, (2) multilingual word embeddings, (3) projected and predicted syntactic and morphological features, (4) software to provide linguistically sound evaluations of language representations.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

语言嵌入有时包含类型概括

神经网络模型在多大程度上可以学习语言结构的泛化，我们如何发现它们学到了什么?我们通过训练神经模型，在1,295种语言的圣经翻译的大规模多语言数据集上进行一系列自然语言处理任务，来探索这些问题。然后将学习到的语言表征与现有的类型数据库以及通过注释投影获得的一组新的定量句法和形态学特征进行比较。我们得出的结论是，一些概括与语言类型学的传统特征惊人地接近，但我们的大多数模型，以及以前的工作，似乎并没有做出语言学上有意义的概括。在评估中仔细注意细节是避免误报的必要条件。此外，为了鼓励在这一领域的持续工作，我们发布了几个资源，涵盖了我们数据中的大部分或全部语言:(1)多套语言表示，(2)多语言词嵌入，(3)预测和预测的句法和形态特征，(4)提供语言表示的语言健全评估的软件。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computational Linguistics Computer Science-Artificial Intelligence

自引率

0.00%

发文量

期刊介绍： Computational Linguistics is the longest-running publication devoted exclusively to the computational and mathematical properties of language and the design and analysis of natural language processing systems. This highly regarded quarterly offers university and industry linguists, computational linguists, artificial intelligence and machine learning investigators, cognitive scientists, speech specialists, and philosophers the latest information about the computational aspects of all the facets of research on language.