用文档向量分析印欧语言的相似性

IF 2.8 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Informatics Pub Date : 2023-09-26 DOI:10.3390/informatics10040076

Samuel R. Schrader, Eren Gultepe

{"title":"用文档向量分析印欧语言的相似性","authors":"Samuel R. Schrader, Eren Gultepe","doi":"10.3390/informatics10040076","DOIUrl":null,"url":null,"abstract":"The evaluation of similarities between natural languages often relies on prior knowledge of the languages being studied. We describe three methods for building phylogenetic trees and clustering languages without the use of language-specific information. The input to our methods is a set of document vectors trained on a corpus of parallel translations of the Bible into 22 Indo-European languages, representing 4 language families: Indo-Iranian, Slavic, Germanic, and Romance. This text corpus consists of a set of 532,092 Bible verses, with 24,186 identical verses translated into each language. The methods are (A) hierarchical clustering using distance between language vector centroids, (B) hierarchical clustering using a network-derived distance measure, and (C) Deep Embedded Clustering (DEC) of language vectors. We evaluate our methods using a ground-truth tree and language families derived from said tree. All three achieve clustering F-scores above 0.9 on the Indo-Iranian and Slavic families; most confusion is between the Germanic and Romance families. The mean F-scores across all families are 0.864 (centroid clustering), 0.953 (network partitioning), and 0.763 (DEC). This shows that document vectors can be used to capture and compare linguistic features of multilingual texts, and thus could help extend language similarity and other translation studies research.","PeriodicalId":37100,"journal":{"name":"Informatics","volume":"20 1","pages":"0"},"PeriodicalIF":2.8000,"publicationDate":"2023-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Analyzing Indo-European Language Similarities Using Document Vectors\",\"authors\":\"Samuel R. Schrader, Eren Gultepe\",\"doi\":\"10.3390/informatics10040076\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The evaluation of similarities between natural languages often relies on prior knowledge of the languages being studied. We describe three methods for building phylogenetic trees and clustering languages without the use of language-specific information. The input to our methods is a set of document vectors trained on a corpus of parallel translations of the Bible into 22 Indo-European languages, representing 4 language families: Indo-Iranian, Slavic, Germanic, and Romance. This text corpus consists of a set of 532,092 Bible verses, with 24,186 identical verses translated into each language. The methods are (A) hierarchical clustering using distance between language vector centroids, (B) hierarchical clustering using a network-derived distance measure, and (C) Deep Embedded Clustering (DEC) of language vectors. We evaluate our methods using a ground-truth tree and language families derived from said tree. All three achieve clustering F-scores above 0.9 on the Indo-Iranian and Slavic families; most confusion is between the Germanic and Romance families. The mean F-scores across all families are 0.864 (centroid clustering), 0.953 (network partitioning), and 0.763 (DEC). This shows that document vectors can be used to capture and compare linguistic features of multilingual texts, and thus could help extend language similarity and other translation studies research.\",\"PeriodicalId\":37100,\"journal\":{\"name\":\"Informatics\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2023-09-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/informatics10040076\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/informatics10040076","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

评估自然语言之间的相似性往往依赖于所研究语言的先验知识。我们描述了在不使用语言特定信息的情况下构建系统发育树和聚类语言的三种方法。我们方法的输入是一组文档向量，这些文档向量是在一个语料上训练的，该语料由圣经平行翻译成22种印欧语言，代表4个语系:印度-伊朗语、斯拉夫语、日耳曼语和罗曼语。这个文本语料库包括一套532,092圣经经文，有24,186相同的经文翻译成每种语言。这些方法是(A)使用语言向量质心之间的距离进行分层聚类，(B)使用网络派生的距离度量进行分层聚类，以及(C)语言向量的深度嵌入聚类(DEC)。我们使用基础真理树和从该树派生的语族来评估我们的方法。在印度-伊朗和斯拉夫家庭中，这三个家庭的聚类f得分都在0.9以上;最容易混淆的是日耳曼家族和罗曼家族。各家庭的平均f分数分别为0.864(质心聚类)、0.953(网络分区)和0.763 (DEC)。这表明文档向量可以用来捕获和比较多语言文本的语言特征，从而有助于扩展语言相似性和其他翻译研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Analyzing Indo-European Language Similarities Using Document Vectors

The evaluation of similarities between natural languages often relies on prior knowledge of the languages being studied. We describe three methods for building phylogenetic trees and clustering languages without the use of language-specific information. The input to our methods is a set of document vectors trained on a corpus of parallel translations of the Bible into 22 Indo-European languages, representing 4 language families: Indo-Iranian, Slavic, Germanic, and Romance. This text corpus consists of a set of 532,092 Bible verses, with 24,186 identical verses translated into each language. The methods are (A) hierarchical clustering using distance between language vector centroids, (B) hierarchical clustering using a network-derived distance measure, and (C) Deep Embedded Clustering (DEC) of language vectors. We evaluate our methods using a ground-truth tree and language families derived from said tree. All three achieve clustering F-scores above 0.9 on the Indo-Iranian and Slavic families; most confusion is between the Germanic and Romance families. The mean F-scores across all families are 0.864 (centroid clustering), 0.953 (network partitioning), and 0.763 (DEC). This shows that document vectors can be used to capture and compare linguistic features of multilingual texts, and thus could help extend language similarity and other translation studies research.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Informatics Social Sciences-Communication

CiteScore

6.60

自引率

6.50%

发文量

审稿时长

6 weeks