上下文中的轻量级语素标记:使用结构化语言表示来支持语言文档上下文的语言分析

Special Interest Group on Computational Morphology and Phonology Workshop Pub Date : 1900-01-01 DOI:10.18653/v1/2023.sigmorphon-1.9

Bhargav Shandilya, Alexis Palmer

{"title":"上下文中的轻量级语素标记:使用结构化语言表示来支持语言文档上下文的语言分析","authors":"Bhargav Shandilya, Alexis Palmer","doi":"10.18653/v1/2023.sigmorphon-1.9","DOIUrl":null,"url":null,"abstract":"Linguistic analysis is a core task in the process of documenting, analyzing, and describing endangered and less-studied languages. In addition to providing insight into the properties of the language being studied, having tools to automatically label words in a language for grammatical category and morphological features can support a range of applications useful for language pedagogy and revitalization. At the same time, most modern NLP methods for these tasks require both large amounts of data in the language and compute costs well beyond the capacity of most research groups and language communities. In this paper, we present a gloss-to-gloss (g2g) model for linguistic analysis (specifically, morphological analysis and part-of-speech tagging) that is lightweight in terms of both data requirements and computational expense. The model is designed for the interlinear glossed text (IGT) format, in which we expect the source text of a sentence in a low-resource language, a translation of that sentence into a language of wider communication, and a detailed glossing of the morphological properties of each word in the sentence. We first produce silver standard parallel glossed data by automatically labeling the high-resource translation. The model then learns to transform source language morphological labels into output labels for the target language, mediated by a structured linguistic representation layer. We test the model on both low-resource and high-resource languages, and find that our simple CNN-based model achieves comparable performance to a state-of-the-art transformer-based model, at a fraction of the computational cost.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"161 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Lightweight morpheme labeling in context: Using structured linguistic representations to support linguistic analysis for the language documentation context\",\"authors\":\"Bhargav Shandilya, Alexis Palmer\",\"doi\":\"10.18653/v1/2023.sigmorphon-1.9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Linguistic analysis is a core task in the process of documenting, analyzing, and describing endangered and less-studied languages. In addition to providing insight into the properties of the language being studied, having tools to automatically label words in a language for grammatical category and morphological features can support a range of applications useful for language pedagogy and revitalization. At the same time, most modern NLP methods for these tasks require both large amounts of data in the language and compute costs well beyond the capacity of most research groups and language communities. In this paper, we present a gloss-to-gloss (g2g) model for linguistic analysis (specifically, morphological analysis and part-of-speech tagging) that is lightweight in terms of both data requirements and computational expense. The model is designed for the interlinear glossed text (IGT) format, in which we expect the source text of a sentence in a low-resource language, a translation of that sentence into a language of wider communication, and a detailed glossing of the morphological properties of each word in the sentence. We first produce silver standard parallel glossed data by automatically labeling the high-resource translation. The model then learns to transform source language morphological labels into output labels for the target language, mediated by a structured linguistic representation layer. We test the model on both low-resource and high-resource languages, and find that our simple CNN-based model achieves comparable performance to a state-of-the-art transformer-based model, at a fraction of the computational cost.\",\"PeriodicalId\":186158,\"journal\":{\"name\":\"Special Interest Group on Computational Morphology and Phonology Workshop\",\"volume\":\"161 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Special Interest Group on Computational Morphology and Phonology Workshop\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2023.sigmorphon-1.9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Special Interest Group on Computational Morphology and Phonology Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2023.sigmorphon-1.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

语言分析是记录、分析和描述濒危语言和研究较少的语言过程中的核心任务。除了提供对所研究语言属性的深入了解之外，拥有自动标记语言中语法类别和形态特征的工具可以支持一系列对语言教学和振兴有用的应用。同时，这些任务的大多数现代NLP方法都需要大量的语言数据和计算成本，远远超出了大多数研究小组和语言社区的能力。在本文中，我们提出了一种用于语言分析(特别是形态学分析和词性标注)的gloss-to-gloss (g2g)模型，该模型在数据需求和计算费用方面都很轻量级。该模型是为行间注释文本(IGT)格式设计的，在这种格式中，我们期望使用低资源语言的句子的源文本，将该句子翻译成更广泛交流的语言，并对句子中每个单词的形态学属性进行详细的注释。我们首先通过自动标注高资源翻译生成银标准并行擦亮数据。然后，该模型学习将源语言形态标签转换为目标语言的输出标签，并通过结构化的语言表示层进行中介。我们在低资源语言和高资源语言上测试了模型，发现我们简单的基于cnn的模型达到了与最先进的基于变压器的模型相当的性能，而计算成本只是一小部分。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Lightweight morpheme labeling in context: Using structured linguistic representations to support linguistic analysis for the language documentation context

Linguistic analysis is a core task in the process of documenting, analyzing, and describing endangered and less-studied languages. In addition to providing insight into the properties of the language being studied, having tools to automatically label words in a language for grammatical category and morphological features can support a range of applications useful for language pedagogy and revitalization. At the same time, most modern NLP methods for these tasks require both large amounts of data in the language and compute costs well beyond the capacity of most research groups and language communities. In this paper, we present a gloss-to-gloss (g2g) model for linguistic analysis (specifically, morphological analysis and part-of-speech tagging) that is lightweight in terms of both data requirements and computational expense. The model is designed for the interlinear glossed text (IGT) format, in which we expect the source text of a sentence in a low-resource language, a translation of that sentence into a language of wider communication, and a detailed glossing of the morphological properties of each word in the sentence. We first produce silver standard parallel glossed data by automatically labeling the high-resource translation. The model then learns to transform source language morphological labels into output labels for the target language, mediated by a structured linguistic representation layer. We test the model on both low-resource and high-resource languages, and find that our simple CNN-based model achieves comparable performance to a state-of-the-art transformer-based model, at a fraction of the computational cost.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Special Interest Group on Computational Morphology and Phonology Workshop

自引率

0.00%

发文量