用于语言不可知表示学习的细粒度多语言解纠缠自编码器

Proceedings of the Massively Multilingual Natural Language Understanding Workshop (MMNLU-22) Pub Date : 1900-01-01 DOI:10.18653/v1/2022.mmnlu-1.2

Zetian Wu, Zhongkai Sun, Zhengyang Zhao, Sixing Lu, Chengyuan Ma, Chenlei Guo

{"title":"用于语言不可知表示学习的细粒度多语言解纠缠自编码器","authors":"Zetian Wu, Zhongkai Sun, Zhengyang Zhao, Sixing Lu, Chengyuan Ma, Chenlei Guo","doi":"10.18653/v1/2022.mmnlu-1.2","DOIUrl":null,"url":null,"abstract":"Encoding both language-specific and language-agnostic information into a single high-dimensional space is a common practice of pre-trained Multi-lingual Language Models (pMLM). Such encoding has been shown to perform effectively on natural language tasks requiring semantics of the whole sentence (e.g., translation). However, its effectiveness appears to be limited on tasks requiring partial information of the utterance (e.g., multi-lingual entity retrieval, template retrieval, and semantic alignment). In this work, a novel Fine-grained Multilingual Disentangled Autoencoder (FMDA) is proposed to disentangle fine-grained semantic information from language-specific information in a multi-lingual setting. FMDA is capable of successfully extracting the disentangled template semantic and residual semantic representations. Experiments conducted on the MASSIVE dataset demonstrate that the disentangled encoding can boost each other during the training, thus consistently outperforming the original pMLM and the strong language disentanglement baseline on monolingual template retrieval and cross-lingual semantic retrieval tasks across multiple languages.","PeriodicalId":375461,"journal":{"name":"Proceedings of the Massively Multilingual Natural Language Understanding Workshop (MMNLU-22)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fine-grained Multi-lingual Disentangled Autoencoder for Language-agnostic Representation Learning\",\"authors\":\"Zetian Wu, Zhongkai Sun, Zhengyang Zhao, Sixing Lu, Chengyuan Ma, Chenlei Guo\",\"doi\":\"10.18653/v1/2022.mmnlu-1.2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Encoding both language-specific and language-agnostic information into a single high-dimensional space is a common practice of pre-trained Multi-lingual Language Models (pMLM). Such encoding has been shown to perform effectively on natural language tasks requiring semantics of the whole sentence (e.g., translation). However, its effectiveness appears to be limited on tasks requiring partial information of the utterance (e.g., multi-lingual entity retrieval, template retrieval, and semantic alignment). In this work, a novel Fine-grained Multilingual Disentangled Autoencoder (FMDA) is proposed to disentangle fine-grained semantic information from language-specific information in a multi-lingual setting. FMDA is capable of successfully extracting the disentangled template semantic and residual semantic representations. Experiments conducted on the MASSIVE dataset demonstrate that the disentangled encoding can boost each other during the training, thus consistently outperforming the original pMLM and the strong language disentanglement baseline on monolingual template retrieval and cross-lingual semantic retrieval tasks across multiple languages.\",\"PeriodicalId\":375461,\"journal\":{\"name\":\"Proceedings of the Massively Multilingual Natural Language Understanding Workshop (MMNLU-22)\",\"volume\":\"27 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Massively Multilingual Natural Language Understanding Workshop (MMNLU-22)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2022.mmnlu-1.2\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Massively Multilingual Natural Language Understanding Workshop (MMNLU-22)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.mmnlu-1.2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

将特定于语言和与语言无关的信息编码到单个高维空间中是预训练的多语言语言模型(pMLM)的常见做法。这种编码已经被证明可以有效地执行需要整个句子语义的自然语言任务(例如，翻译)。然而，它的有效性似乎仅限于需要部分话语信息的任务(例如，多语言实体检索，模板检索和语义对齐)。在这项工作中，提出了一种新的细粒度多语言解纠缠自编码器(FMDA)，用于在多语言设置中从特定语言信息中解纠缠细粒度语义信息。FMDA能够成功地提取解纠缠模板语义和残馀语义表示。在MASSIVE数据集上进行的实验表明，在训练过程中，解纠缠编码可以相互促进，从而在跨多语言的单语言模板检索和跨语言语义检索任务上始终优于原始pMLM和强语言解纠缠基线。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Fine-grained Multi-lingual Disentangled Autoencoder for Language-agnostic Representation Learning

Encoding both language-specific and language-agnostic information into a single high-dimensional space is a common practice of pre-trained Multi-lingual Language Models (pMLM). Such encoding has been shown to perform effectively on natural language tasks requiring semantics of the whole sentence (e.g., translation). However, its effectiveness appears to be limited on tasks requiring partial information of the utterance (e.g., multi-lingual entity retrieval, template retrieval, and semantic alignment). In this work, a novel Fine-grained Multilingual Disentangled Autoencoder (FMDA) is proposed to disentangle fine-grained semantic information from language-specific information in a multi-lingual setting. FMDA is capable of successfully extracting the disentangled template semantic and residual semantic representations. Experiments conducted on the MASSIVE dataset demonstrate that the disentangled encoding can boost each other during the training, thus consistently outperforming the original pMLM and the strong language disentanglement baseline on monolingual template retrieval and cross-lingual semantic retrieval tasks across multiple languages.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Massively Multilingual Natural Language Understanding Workshop (MMNLU-22)

自引率

0.00%

发文量