首页 > 最新文献

Special Interest Group on Computational Morphology and Phonology Workshop最新文献

英文 中文
SigMoreFun Submission to the SIGMORPHON Shared Task on Interlinear Glossing SigMoreFun提交给SIGMORPHON关于行间光泽的共享任务
Pub Date : 1900-01-01 DOI: 10.18653/v1/2023.sigmorphon-1.22
Taiqi He, Lindia Tjuatja, Nathaniel R. Robinson, Shinji Watanabe, David R. Mortensen, Graham Neubig, L. Levin
In our submission to the SIGMORPHON 2023 Shared Task on interlinear glossing (IGT), we explore approaches to data augmentation and modeling across seven low-resource languages. For data augmentation, we explore two approaches: creating artificial data from the provided training data and utilizing existing IGT resources in other languages. On the modeling side, we test an enhanced version of the provided token classification baseline as well as a pretrained multilingual seq2seq model. Additionally, we apply post-correction using a dictionary for Gitksan, the language with the smallest amount of data. We find that our token classification models are the best performing, with the highest word-level accuracy for Arapaho and highest morpheme-level accuracy for Gitksan out of all submissions. We also show that data augmentation is an effective strategy, though applying artificial data pretraining has very different effects across both models tested.
在我们提交给SIGMORPHON 2023关于行间光泽(IGT)的共享任务中,我们探索了跨七种低资源语言的数据增强和建模方法。对于数据增强,我们探索了两种方法:从提供的训练数据中创建人工数据和利用其他语言的现有IGT资源。在建模方面,我们测试了所提供的令牌分类基线的增强版本以及预训练的多语言seq2seq模型。此外,我们使用字典对具有最小数据量的吉特克桑语进行后校正。我们发现我们的标记分类模型表现最好,在所有提交的文本中,Arapaho具有最高的词级精度,Gitksan具有最高的语素级精度。我们还表明,数据增强是一种有效的策略,尽管在测试的两个模型中应用人工数据预训练会产生非常不同的效果。
{"title":"SigMoreFun Submission to the SIGMORPHON Shared Task on Interlinear Glossing","authors":"Taiqi He, Lindia Tjuatja, Nathaniel R. Robinson, Shinji Watanabe, David R. Mortensen, Graham Neubig, L. Levin","doi":"10.18653/v1/2023.sigmorphon-1.22","DOIUrl":"https://doi.org/10.18653/v1/2023.sigmorphon-1.22","url":null,"abstract":"In our submission to the SIGMORPHON 2023 Shared Task on interlinear glossing (IGT), we explore approaches to data augmentation and modeling across seven low-resource languages. For data augmentation, we explore two approaches: creating artificial data from the provided training data and utilizing existing IGT resources in other languages. On the modeling side, we test an enhanced version of the provided token classification baseline as well as a pretrained multilingual seq2seq model. Additionally, we apply post-correction using a dictionary for Gitksan, the language with the smallest amount of data. We find that our token classification models are the best performing, with the highest word-level accuracy for Arapaho and highest morpheme-level accuracy for Gitksan out of all submissions. We also show that data augmentation is an effective strategy, though applying artificial data pretraining has very different effects across both models tested.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"69 21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124932351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A future for universal grapheme-phoneme transduction modeling with neuralized finite-state transducers 用神经化有限状态换能器进行通用字素-音素转导建模的未来
Pub Date : 1900-01-01 DOI: 10.18653/v1/2023.sigmorphon-1.30
Chu-Cheng Lin
We propose a universal grapheme-phoneme transduction model using neuralized finite-state transducers. Many computational models of grapheme-phoneme transduction nowadays are based on the (autoregressive) sequence-to-sequence string transduction paradigm. While such models have achieved state-of-the-art performance, they suffer from theoretical limitations of autoregressive models. On the other hand, neuralized finite-state transducers (NFSTs) have shown promising results on various string transduction tasks. NFSTs can be seen as a generalization of weighted finite-state transducers (WFSTs), and can be seen as pairs of a featurized finite-state machine (‘marked finite-state transducer’ or MFST in NFST terminology), and a string scoring function. Instead of taking a product of local contextual feature weights on FST arcs, NFSTs can employ arbitrary scoring functions to weight global contextual features of a string transduction, and therefore break the Markov property. Furthermore, NFSTs can be formally shown to be more expressive than (autoregressive) seq2seq models. Empirically, joint grapheme-phoneme transduction NFSTs have consistently outperformed vanilla seq2seq models on grapheme-tophoneme and phoneme-to-grapheme transduction tasks for English. Furthermore, they provide interpretable aligned string transductions, thanks to their finite-state machine component. In this talk, we propose a multilingual extension of the joint grapheme-phoneme NFST. We achieve this goal by modeling typological and phylogenetic features of languages and scripts as optional latent variables using a finite-state machine. The result is a versatile graphemephoneme transduction model: in addition to standard monolingual and multilingual transduction, the proposed multilingual NFST can also be used in various controlled generation scenarios, such as phoneme-to-grapheme transduction of an unseen language-script pair. We also plan to release an NFST software package.
我们提出了一个通用的使用神经化有限状态换能器的字素-音素转导模型。目前许多字素-音素转导的计算模型都是基于(自回归的)序列-序列字符串转导范式。虽然这些模型已经达到了最先进的性能,但它们受到自回归模型的理论限制。另一方面,神经化有限状态传感器(NFSTs)在各种字符串转换任务中显示出令人鼓舞的结果。NFST可以看作是加权有限状态传感器(WFSTs)的推广,可以看作是一对特征有限状态机(“标记有限状态传感器”或NFST术语中的MFST)和字符串评分函数。NFSTs可以使用任意评分函数来对字符串转导的全局上下文特征进行加权,从而打破马尔可夫性质,而不是在FST弧上取局部上下文特征权重的乘积。此外,nfst可以正式证明比(自回归)seq2seq模型更具表现力。从经验上看,在英语的字素-音素和音素-字素转导任务上,联合字素-音素转导NFSTs一直优于普通的seq2seq模型。此外,由于它们的有限状态机组件,它们提供了可解释的对齐字符串转导。在这次演讲中,我们提出了一种多语言扩展的联合字素-音素NFST。我们通过使用有限状态机将语言和脚本的类型学和系统发育特征建模为可选的潜在变量来实现这一目标。结果是一个通用的字-音素转导模型:除了标准的单语言和多语言转导之外,所提出的多语言NFST还可以用于各种受控的生成场景,例如看不见的语言-脚本对的音素到字-音素转导。我们还计划发布一个NFST软件包。
{"title":"A future for universal grapheme-phoneme transduction modeling with neuralized finite-state transducers","authors":"Chu-Cheng Lin","doi":"10.18653/v1/2023.sigmorphon-1.30","DOIUrl":"https://doi.org/10.18653/v1/2023.sigmorphon-1.30","url":null,"abstract":"We propose a universal grapheme-phoneme transduction model using neuralized finite-state transducers. Many computational models of grapheme-phoneme transduction nowadays are based on the (autoregressive) sequence-to-sequence string transduction paradigm. While such models have achieved state-of-the-art performance, they suffer from theoretical limitations of autoregressive models. On the other hand, neuralized finite-state transducers (NFSTs) have shown promising results on various string transduction tasks. NFSTs can be seen as a generalization of weighted finite-state transducers (WFSTs), and can be seen as pairs of a featurized finite-state machine (‘marked finite-state transducer’ or MFST in NFST terminology), and a string scoring function. Instead of taking a product of local contextual feature weights on FST arcs, NFSTs can employ arbitrary scoring functions to weight global contextual features of a string transduction, and therefore break the Markov property. Furthermore, NFSTs can be formally shown to be more expressive than (autoregressive) seq2seq models. Empirically, joint grapheme-phoneme transduction NFSTs have consistently outperformed vanilla seq2seq models on grapheme-tophoneme and phoneme-to-grapheme transduction tasks for English. Furthermore, they provide interpretable aligned string transductions, thanks to their finite-state machine component. In this talk, we propose a multilingual extension of the joint grapheme-phoneme NFST. We achieve this goal by modeling typological and phylogenetic features of languages and scripts as optional latent variables using a finite-state machine. The result is a versatile graphemephoneme transduction model: in addition to standard monolingual and multilingual transduction, the proposed multilingual NFST can also be used in various controlled generation scenarios, such as phoneme-to-grapheme transduction of an unseen language-script pair. We also plan to release an NFST software package.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115612345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Glossy Bytes: Neural Glossing using Subword Encoding 平滑字节:使用子字编码的神经光泽
Pub Date : 1900-01-01 DOI: 10.18653/v1/2023.sigmorphon-1.24
Ziggy Cross, Michelle Yun, Ananya Apparaju, Jata MacCabe, Garrett Nicolai, Miikka Silfverberg
This paper presents several different neural subword modelling based approaches to interlinear glossing for seven under-resourced languages as a part of the 2023 SIGMORPHON shared task on interlinear glossing. We experiment with various augmentation and tokenization strategies for both the open and closed tracks of data. We found that while byte-level models may perform well for greater amounts of data, character based approaches remain competitive in their performance in lower resource settings.
本文提出了几种不同的基于神经子词建模的方法,用于七种资源不足的语言的行间光泽,作为2023 SIGMORPHON关于行间光泽的共享任务的一部分。我们对数据的开放和封闭轨迹进行了各种增强和标记化策略的实验。我们发现,虽然字节级模型可能在更大的数据量下表现良好,但基于字符的方法在低资源设置下的性能仍然具有竞争力。
{"title":"Glossy Bytes: Neural Glossing using Subword Encoding","authors":"Ziggy Cross, Michelle Yun, Ananya Apparaju, Jata MacCabe, Garrett Nicolai, Miikka Silfverberg","doi":"10.18653/v1/2023.sigmorphon-1.24","DOIUrl":"https://doi.org/10.18653/v1/2023.sigmorphon-1.24","url":null,"abstract":"This paper presents several different neural subword modelling based approaches to interlinear glossing for seven under-resourced languages as a part of the 2023 SIGMORPHON shared task on interlinear glossing. We experiment with various augmentation and tokenization strategies for both the open and closed tracks of data. We found that while byte-level models may perform well for greater amounts of data, character based approaches remain competitive in their performance in lower resource settings.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130881352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Investigating Phoneme Similarity with Artificially Accented Speech 人工重音语音的音素相似性研究
Pub Date : 1900-01-01 DOI: 10.18653/v1/2023.sigmorphon-1.6
Margot Masson, Julie Carson-Berndsen
While the deep learning revolution has led to significant performance improvements in speech recognition, accented speech remains a challenge. Current approaches to this challenge typically do not seek to understand and provide explanations for the variations of accented speech, whether they stem from native regional variation or non-native error patterns. This paper seeks to address non-native speaker variations from both a knowledge-based and a data-driven perspective. We propose to approximate non-native accented-speech pronunciation patterns by the means of two approaches: based on phonetic and phonological knowledge on the one hand and inferred from a text-to-speech system on the other. Artificial speech is then generated with a range of variants which have been captured in confusion matrices representing phoneme similarities. We then show that non-native accent confusions actually propagate to the transcription from the ASR, thus suggesting that the inference of accent specific phoneme confusions is achievable from artificial speech.
虽然深度学习革命在语音识别方面带来了显著的性能改进,但语音重音仍然是一个挑战。当前应对这一挑战的方法通常不寻求理解和解释口音语音的变化,无论它们是源于本地区域变化还是非本地错误模式。本文试图从基于知识和数据驱动的角度来解决非母语人士的差异。我们建议通过两种方法来近似非母语重音语音模式:一方面基于语音和语音知识,另一方面从文本到语音系统推断。然后,人工语音由一系列变体生成,这些变体在表示音素相似性的混淆矩阵中被捕获。然后我们表明,非母语口音混淆实际上从ASR传播到转录,从而表明口音特定音位混淆的推断是可以从人工语音中实现的。
{"title":"Investigating Phoneme Similarity with Artificially Accented Speech","authors":"Margot Masson, Julie Carson-Berndsen","doi":"10.18653/v1/2023.sigmorphon-1.6","DOIUrl":"https://doi.org/10.18653/v1/2023.sigmorphon-1.6","url":null,"abstract":"While the deep learning revolution has led to significant performance improvements in speech recognition, accented speech remains a challenge. Current approaches to this challenge typically do not seek to understand and provide explanations for the variations of accented speech, whether they stem from native regional variation or non-native error patterns. This paper seeks to address non-native speaker variations from both a knowledge-based and a data-driven perspective. We propose to approximate non-native accented-speech pronunciation patterns by the means of two approaches: based on phonetic and phonological knowledge on the one hand and inferred from a text-to-speech system on the other. Artificial speech is then generated with a range of variants which have been captured in confusion matrices representing phoneme similarities. We then show that non-native accent confusions actually propagate to the transcription from the ASR, thus suggesting that the inference of accent specific phoneme confusions is achievable from artificial speech.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114193443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Multilinear Approach to the Unsupervised Learning of Morphology 形态学无监督学习的多线性方法
Pub Date : 1900-01-01 DOI: 10.18653/v1/W16-2020
A. Meyer, Markus Dickinson
We present a novel approach to the unsupervised learning of morphology. In particular, we use a Multiple Cause Mixture Model (MCMM), a type of autoencoder network consisting of two node layers—hidden and surface—and a matrix of weights connecting hidden nodes to surface nodes. We show that an MCMM shares crucial graphical properties with autosegmental morphology. We argue on the basis of this graphical similarity that our approach is theoretically sound. Experiment results on Hebrew data show that this theoretical soundness bears out in practice.
我们提出了一种新的形态学无监督学习方法。特别是,我们使用了多原因混合模型(MCMM),这是一种自编码器网络,由两个节点层(隐藏节点和表面节点)以及连接隐藏节点和表面节点的权重矩阵组成。我们证明了MCMM与自节形态学共享关键的图形特性。基于这种图形相似性,我们认为我们的方法在理论上是合理的。希伯来数据的实验结果表明,这种理论的合理性在实践中得到了证实。
{"title":"A Multilinear Approach to the Unsupervised Learning of Morphology","authors":"A. Meyer, Markus Dickinson","doi":"10.18653/v1/W16-2020","DOIUrl":"https://doi.org/10.18653/v1/W16-2020","url":null,"abstract":"We present a novel approach to the unsupervised learning of morphology. In particular, we use a Multiple Cause Mixture Model (MCMM), a type of autoencoder network consisting of two node layers—hidden and surface—and a matrix of weights connecting hidden nodes to surface nodes. We show that an MCMM shares crucial graphical properties with autosegmental morphology. We argue on the basis of this graphical similarity that our approach is theoretically sound. Experiment results on Hebrew data show that this theoretical soundness bears out in practice.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121921934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Letter Sequence Labeling for Compound Splitting 化合物拆分的字母序列标记
Pub Date : 1900-01-01 DOI: 10.18653/v1/W16-2012
Jianqiang Ma, Verena Henrich, E. Hinrichs
For languages such as German where compounds occur frequently and are written as single tokens, a wide variety of NLP applications benefits from recognizing and splitting compounds. As the traditional word frequency-based approach to compound splitting has several drawbacks, this paper introduces a letter sequence labeling approach, which can utilize rich word form features to build discriminative learning models that are optimized for splitting. Experiments show that the proposed method significantly outperforms state-of-the-art compound splitters.
对于像德语这样经常出现复合词并被写成单个标记的语言,各种各样的NLP应用程序都受益于对复合词的识别和拆分。针对传统的基于词频的复合分词方法存在的诸多缺陷,本文提出了一种基于字母序列标注的方法,该方法利用丰富的词形特征来构建针对分词进行优化的判别学习模型。实验表明,该方法明显优于最先进的复合分离器。
{"title":"Letter Sequence Labeling for Compound Splitting","authors":"Jianqiang Ma, Verena Henrich, E. Hinrichs","doi":"10.18653/v1/W16-2012","DOIUrl":"https://doi.org/10.18653/v1/W16-2012","url":null,"abstract":"For languages such as German where compounds occur frequently and are written as single tokens, a wide variety of NLP applications benefits from recognizing and splitting compounds. As the traditional word frequency-based approach to compound splitting has several drawbacks, this paper introduces a letter sequence labeling approach, which can utilize rich word form features to build discriminative learning models that are optimized for splitting. Experiments show that the proposed method significantly outperforms state-of-the-art compound splitters.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122215506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Joint Learning Model for Low-Resource Agglutinative Language Morphological Tagging 低资源黏着语形态标注的联合学习模型
Pub Date : 1900-01-01 DOI: 10.18653/v1/2023.sigmorphon-1.4
Gulinigeer Abudouwaili, Kahaerjiang Abiderexiti, Nian Yi, Aishan Wumaier
Due to the lack of data resources, rule-based or transfer learning is mainly used in the morphological tagging of low-resource languages. However, these methods require expert knowledge, ignore contextual features, and have error propagation. Therefore, we propose a joint morphological tagger for low-resource agglutinative languages to alleviate the above challenges. First, we represent the contextual input with multi-dimensional features of agglutinative words. Second, joint training reduces the direct impact of part-of-speech errors on morphological features and increases the indirect influence between the two types of labels through a fusion mechanism. Finally, our model separately predicts part-of-speech and morphological features. Part-of-speech tagging is regarded as sequence tagging. When predicting morphological features, two-label adjacency graphs are dynamically reconstructed by integrating multilingual global features and monolingual local features. Then, a graph convolution network is used to learn the higher-order intersection of labels. A series of experiments show that the proposed model in this paper is superior to other comparative models.
由于缺乏数据资源,基于规则的学习或迁移学习主要用于低资源语言的形态标注。然而,这些方法需要专业知识,忽略了上下文特征,并且存在错误传播。因此,我们提出了一种针对低资源黏着语言的联合形态标注器来缓解上述挑战。首先,我们用黏着词的多维特征来表示语境输入。其次,联合训练减少词性错误对词形特征的直接影响,通过融合机制增加两类标签之间的间接影响。最后,我们的模型分别预测词性和形态特征。词性标注被认为是序列标注。在预测形态特征时,通过整合多语言全局特征和单语言局部特征,动态重构双标签邻接图。然后,使用图卷积网络学习标签的高阶交集。一系列实验表明,本文提出的模型优于其他比较模型。
{"title":"Joint Learning Model for Low-Resource Agglutinative Language Morphological Tagging","authors":"Gulinigeer Abudouwaili, Kahaerjiang Abiderexiti, Nian Yi, Aishan Wumaier","doi":"10.18653/v1/2023.sigmorphon-1.4","DOIUrl":"https://doi.org/10.18653/v1/2023.sigmorphon-1.4","url":null,"abstract":"Due to the lack of data resources, rule-based or transfer learning is mainly used in the morphological tagging of low-resource languages. However, these methods require expert knowledge, ignore contextual features, and have error propagation. Therefore, we propose a joint morphological tagger for low-resource agglutinative languages to alleviate the above challenges. First, we represent the contextual input with multi-dimensional features of agglutinative words. Second, joint training reduces the direct impact of part-of-speech errors on morphological features and increases the indirect influence between the two types of labels through a fusion mechanism. Finally, our model separately predicts part-of-speech and morphological features. Part-of-speech tagging is regarded as sequence tagging. When predicting morphological features, two-label adjacency graphs are dynamically reconstructed by integrating multilingual global features and monolingual local features. Then, a graph convolution network is used to learn the higher-order intersection of labels. A series of experiments show that the proposed model in this paper is superior to other comparative models.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116037226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generalized Glossing Guidelines: An Explicit, Human- and Machine-Readable, Item-and-Process Convention for Morphological Annotation 广义擦光指南:一种明确的、人类和机器可读的、用于形态注释的项目和过程约定
Pub Date : 1900-01-01 DOI: 10.18653/v1/2023.sigmorphon-1.7
David R. Mortensen, Ela Gulsen, Taiqi He, Nathaniel R. Robinson, Jonathan D. Amith, Lindia Tjuatja, L. Levin
Interlinear glossing provides a vital type of morphosyntactic annotation, both for linguists and language revitalists, and numerous conventions exist for representing it formally and computationally. Some of these formats are human readable; others are machine readable. Some are easy to edit with general-purpose tools. Few represent non-concatentative processes like infixation, reduplication, mutation, truncation, and tonal overwriting in a consistent and formally rigorous way (on par with affixation). We propose an annotation convention—Generalized Glossing Guidelines (GGG) that combines all of these positive properties using an Item-and-Process (IP) framework. We describe the format, demonstrate its linguistic adequacy, and compare it with two other interlinear glossed text annotation schemes.
行间注释为语言学家和语言复兴主义者提供了一种重要的形态句法注释类型,并且存在许多用于形式化和计算性地表示它的惯例。其中一些格式是人类可读的;其他是机器可读的。有些可以用通用工具轻松编辑。很少有以一致和形式上严格的方式(与词缀相同)表示非连接过程,如内嵌、重复、突变、截断和音调覆盖。我们提出了一种注释约定——通用注释指南(GGG),它使用项目与过程(IP)框架结合了所有这些正性质。我们描述了格式,证明了它的语言充分性,并将其与其他两种行间有光泽的文本注释方案进行比较。
{"title":"Generalized Glossing Guidelines: An Explicit, Human- and Machine-Readable, Item-and-Process Convention for Morphological Annotation","authors":"David R. Mortensen, Ela Gulsen, Taiqi He, Nathaniel R. Robinson, Jonathan D. Amith, Lindia Tjuatja, L. Levin","doi":"10.18653/v1/2023.sigmorphon-1.7","DOIUrl":"https://doi.org/10.18653/v1/2023.sigmorphon-1.7","url":null,"abstract":"Interlinear glossing provides a vital type of morphosyntactic annotation, both for linguists and language revitalists, and numerous conventions exist for representing it formally and computationally. Some of these formats are human readable; others are machine readable. Some are easy to edit with general-purpose tools. Few represent non-concatentative processes like infixation, reduplication, mutation, truncation, and tonal overwriting in a consistent and formally rigorous way (on par with affixation). We propose an annotation convention—Generalized Glossing Guidelines (GGG) that combines all of these positive properties using an Item-and-Process (IP) framework. We describe the format, demonstrate its linguistic adequacy, and compare it with two other interlinear glossed text annotation schemes.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126911561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Translating a low-resource language using GPT-3 and a human-readable dictionary 使用GPT-3和人类可读的字典翻译低资源语言
Pub Date : 1900-01-01 DOI: 10.18653/v1/2023.sigmorphon-1.2
M. Elsner, Jordan Needle
We investigate how well words in the polysynthetic language Inuktitut can be translated by combining dictionary definitions, without use of a neural machine translation model trained on parallel text. Such a translation system would allow natural language technology to benefit from resources designed for community use in a language revitalization or education program, rather than requiring a separate parallel corpus. We show that the text-to-text generation capabilities of GPT-3 allow it to perform this task with BLEU scores of up to 18.5. We investigate prompting GPT-3 to provide multiple translations, which can help slightly, and providing it with grammar information, which is mostly ineffective. Finally, we test GPT-3’s ability to derive morpheme definitions from whole-word translations, but find this process is prone to errors including hallucinations.
我们研究了在不使用平行文本训练的神经机器翻译模型的情况下,通过结合字典定义来翻译多合成语言因纽特语中的单词的效果。这样的翻译系统将使自然语言技术受益于为语言振兴或教育计划的社区使用而设计的资源,而不是需要单独的并行语料库。我们证明GPT-3的文本到文本生成功能允许它以高达18.5的BLEU分数执行此任务。我们研究了提示GPT-3提供多种翻译,这可以略微有所帮助,并提供语法信息,这通常是无效的。最后,我们测试了GPT-3从全词翻译中推导语素定义的能力,但发现这一过程容易出现包括幻觉在内的错误。
{"title":"Translating a low-resource language using GPT-3 and a human-readable dictionary","authors":"M. Elsner, Jordan Needle","doi":"10.18653/v1/2023.sigmorphon-1.2","DOIUrl":"https://doi.org/10.18653/v1/2023.sigmorphon-1.2","url":null,"abstract":"We investigate how well words in the polysynthetic language Inuktitut can be translated by combining dictionary definitions, without use of a neural machine translation model trained on parallel text. Such a translation system would allow natural language technology to benefit from resources designed for community use in a language revitalization or education program, rather than requiring a separate parallel corpus. We show that the text-to-text generation capabilities of GPT-3 allow it to perform this task with BLEU scores of up to 18.5. We investigate prompting GPT-3 to provide multiple translations, which can help slightly, and providing it with grammar information, which is mostly ineffective. Finally, we test GPT-3’s ability to derive morpheme definitions from whole-word translations, but find this process is prone to errors including hallucinations.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131212776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lightweight morpheme labeling in context: Using structured linguistic representations to support linguistic analysis for the language documentation context 上下文中的轻量级语素标记:使用结构化语言表示来支持语言文档上下文的语言分析
Pub Date : 1900-01-01 DOI: 10.18653/v1/2023.sigmorphon-1.9
Bhargav Shandilya, Alexis Palmer
Linguistic analysis is a core task in the process of documenting, analyzing, and describing endangered and less-studied languages. In addition to providing insight into the properties of the language being studied, having tools to automatically label words in a language for grammatical category and morphological features can support a range of applications useful for language pedagogy and revitalization. At the same time, most modern NLP methods for these tasks require both large amounts of data in the language and compute costs well beyond the capacity of most research groups and language communities. In this paper, we present a gloss-to-gloss (g2g) model for linguistic analysis (specifically, morphological analysis and part-of-speech tagging) that is lightweight in terms of both data requirements and computational expense. The model is designed for the interlinear glossed text (IGT) format, in which we expect the source text of a sentence in a low-resource language, a translation of that sentence into a language of wider communication, and a detailed glossing of the morphological properties of each word in the sentence. We first produce silver standard parallel glossed data by automatically labeling the high-resource translation. The model then learns to transform source language morphological labels into output labels for the target language, mediated by a structured linguistic representation layer. We test the model on both low-resource and high-resource languages, and find that our simple CNN-based model achieves comparable performance to a state-of-the-art transformer-based model, at a fraction of the computational cost.
语言分析是记录、分析和描述濒危语言和研究较少的语言过程中的核心任务。除了提供对所研究语言属性的深入了解之外,拥有自动标记语言中语法类别和形态特征的工具可以支持一系列对语言教学和振兴有用的应用。同时,这些任务的大多数现代NLP方法都需要大量的语言数据和计算成本,远远超出了大多数研究小组和语言社区的能力。在本文中,我们提出了一种用于语言分析(特别是形态学分析和词性标注)的gloss-to-gloss (g2g)模型,该模型在数据需求和计算费用方面都很轻量级。该模型是为行间注释文本(IGT)格式设计的,在这种格式中,我们期望使用低资源语言的句子的源文本,将该句子翻译成更广泛交流的语言,并对句子中每个单词的形态学属性进行详细的注释。我们首先通过自动标注高资源翻译生成银标准并行擦亮数据。然后,该模型学习将源语言形态标签转换为目标语言的输出标签,并通过结构化的语言表示层进行中介。我们在低资源语言和高资源语言上测试了模型,发现我们简单的基于cnn的模型达到了与最先进的基于变压器的模型相当的性能,而计算成本只是一小部分。
{"title":"Lightweight morpheme labeling in context: Using structured linguistic representations to support linguistic analysis for the language documentation context","authors":"Bhargav Shandilya, Alexis Palmer","doi":"10.18653/v1/2023.sigmorphon-1.9","DOIUrl":"https://doi.org/10.18653/v1/2023.sigmorphon-1.9","url":null,"abstract":"Linguistic analysis is a core task in the process of documenting, analyzing, and describing endangered and less-studied languages. In addition to providing insight into the properties of the language being studied, having tools to automatically label words in a language for grammatical category and morphological features can support a range of applications useful for language pedagogy and revitalization. At the same time, most modern NLP methods for these tasks require both large amounts of data in the language and compute costs well beyond the capacity of most research groups and language communities. In this paper, we present a gloss-to-gloss (g2g) model for linguistic analysis (specifically, morphological analysis and part-of-speech tagging) that is lightweight in terms of both data requirements and computational expense. The model is designed for the interlinear glossed text (IGT) format, in which we expect the source text of a sentence in a low-resource language, a translation of that sentence into a language of wider communication, and a detailed glossing of the morphological properties of each word in the sentence. We first produce silver standard parallel glossed data by automatically labeling the high-resource translation. The model then learns to transform source language morphological labels into output labels for the target language, mediated by a structured linguistic representation layer. We test the model on both low-resource and high-resource languages, and find that our simple CNN-based model achieves comparable performance to a state-of-the-art transformer-based model, at a fraction of the computational cost.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"161 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134260124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Special Interest Group on Computational Morphology and Phonology Workshop
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1