基于多任务学习的印度语言浅层解析

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2024-05-11 DOI:10.1145/3664620

Pruthwik Mishra, Vandan Mujadia

{"title":"基于多任务学习的印度语言浅层解析","authors":"Pruthwik Mishra, Vandan Mujadia","doi":"10.1145/3664620","DOIUrl":null,"url":null,"abstract":"Shallow Parsing is an important step for many Natural Language Processing tasks. Although shallow parsing has a rich history for resource rich languages, it is not the case for most Indian languages. Shallow Parsing consists of POS Tagging and Chunking. Our study focuses on developing shallow parsers for Indian languages. As part of shallow parsing we included morph analysis as well. For the study, we first consolidated available shallow parsing corpora for 7 Indian Languages (Hindi, Kannada, Bangla, Malayalam, Marathi, Urdu, Telugu) for which treebanks are publicly available. We then trained models to achieve state of the art performance for shallow parsing in these languages for multiple domains. Since analyzing the performance of model predictions at sentence level is more realistic, we report the performance of these shallow parsers not only at the token level, but also at the sentence level. We also present machine learning techniques for multitask shallow parsing. Our experiments show that fine-tuned contextual embedding with multi-task learning improves the performance of multiple as well as individual shallow parsing tasks across different domains. We show the transfer learning capability of these models by creating shallow parsers (only with POS and Chunk) for Gujarati, Odia, and Punjabi for which no treebanks are available. As a part of this work, we will be releasing the Indian Languages Shallow Linguistic (ILSL) benchmarks for 10 Indian languages including both the major language families Indo-Aryan and Dravidian as common building blocks that can be used to evaluate and understand various linguistic phenomena found in Indian languages and how well newer approaches can tackle them.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"155 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi Task Learning Based Shallow Parsing for Indian Languages\",\"authors\":\"Pruthwik Mishra, Vandan Mujadia\",\"doi\":\"10.1145/3664620\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Shallow Parsing is an important step for many Natural Language Processing tasks. Although shallow parsing has a rich history for resource rich languages, it is not the case for most Indian languages. Shallow Parsing consists of POS Tagging and Chunking. Our study focuses on developing shallow parsers for Indian languages. As part of shallow parsing we included morph analysis as well. For the study, we first consolidated available shallow parsing corpora for 7 Indian Languages (Hindi, Kannada, Bangla, Malayalam, Marathi, Urdu, Telugu) for which treebanks are publicly available. We then trained models to achieve state of the art performance for shallow parsing in these languages for multiple domains. Since analyzing the performance of model predictions at sentence level is more realistic, we report the performance of these shallow parsers not only at the token level, but also at the sentence level. We also present machine learning techniques for multitask shallow parsing. Our experiments show that fine-tuned contextual embedding with multi-task learning improves the performance of multiple as well as individual shallow parsing tasks across different domains. We show the transfer learning capability of these models by creating shallow parsers (only with POS and Chunk) for Gujarati, Odia, and Punjabi for which no treebanks are available. As a part of this work, we will be releasing the Indian Languages Shallow Linguistic (ILSL) benchmarks for 10 Indian languages including both the major language families Indo-Aryan and Dravidian as common building blocks that can be used to evaluate and understand various linguistic phenomena found in Indian languages and how well newer approaches can tackle them.\",\"PeriodicalId\":54312,\"journal\":{\"name\":\"ACM Transactions on Asian and Low-Resource Language Information Processing\",\"volume\":\"155 1\",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2024-05-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Asian and Low-Resource Language Information Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3664620\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3664620","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

浅层解析是许多自然语言处理任务的重要步骤。虽然浅层解析在资源丰富的语言中有着悠久的历史，但在大多数印度语言中却并非如此。浅层解析包括 POS 标记和分块。我们的研究重点是为印度语言开发浅层解析器。作为浅层解析的一部分，我们还包括形态分析。在研究中，我们首先整合了 7 种印度语言（印地语、卡纳达语、孟加拉语、马拉雅拉姆语、马拉地语、乌尔都语和泰卢固语）的现有浅层解析语料库，这些语料库都是公开的树库。然后，我们对模型进行了训练，使这些语言在多个领域的浅层解析方面达到了最先进的性能。由于在句子层面分析模型预测的性能更为现实，我们不仅报告了这些浅层解析器在标记层面的性能，还报告了它们在句子层面的性能。我们还介绍了多任务浅层解析的机器学习技术。我们的实验表明，通过多任务学习对上下文嵌入进行微调，可以提高不同领域中多个以及单个浅层解析任务的性能。我们通过为古吉拉特语、奥迪亚语和旁遮普语创建浅层解析器（仅使用 POS 和 Chunk），展示了这些模型的迁移学习能力，因为这些语种没有树库可用。作为这项工作的一部分，我们将发布 10 种印度语言的印度语言浅层语言学（ILSL）基准，其中包括印度-雅利安语系和德拉威语系这两个主要语系，作为共同的构建模块，可用于评估和理解印度语言中发现的各种语言现象，以及新方法如何很好地解决这些问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Multi Task Learning Based Shallow Parsing for Indian Languages

Shallow Parsing is an important step for many Natural Language Processing tasks. Although shallow parsing has a rich history for resource rich languages, it is not the case for most Indian languages. Shallow Parsing consists of POS Tagging and Chunking. Our study focuses on developing shallow parsers for Indian languages. As part of shallow parsing we included morph analysis as well.

For the study, we first consolidated available shallow parsing corpora for 7 Indian Languages (Hindi, Kannada, Bangla, Malayalam, Marathi, Urdu, Telugu) for which treebanks are publicly available. We then trained models to achieve state of the art performance for shallow parsing in these languages for multiple domains. Since analyzing the performance of model predictions at sentence level is more realistic, we report the performance of these shallow parsers not only at the token level, but also at the sentence level. We also present machine learning techniques for multitask shallow parsing. Our experiments show that fine-tuned contextual embedding with multi-task learning improves the performance of multiple as well as individual shallow parsing tasks across different domains. We show the transfer learning capability of these models by creating shallow parsers (only with POS and Chunk) for Gujarati, Odia, and Punjabi for which no treebanks are available.

As a part of this work, we will be releasing the Indian Languages Shallow Linguistic (ILSL) benchmarks for 10 Indian languages including both the major language families Indo-Aryan and Dravidian as common building blocks that can be used to evaluate and understand various linguistic phenomena found in Indian languages and how well newer approaches can tackle them.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Asian and Low-Resource Language Information Processing Computer Science-General Computer Science

CiteScore

3.60

自引率

15.00%

发文量

241

期刊介绍： The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to: -Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc. -Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc. -Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition. -Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc. -Machine Translation involving Asian or low-resource languages. -Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc. -Information Extraction and Filtering: including automatic abstraction, user profiling, etc. -Speech processing: including text-to-speech synthesis and automatic speech recognition. -Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc. -Cross-lingual information processing involving Asian or low-resource languages. -Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.