首页 > 最新文献

Proceedings of the Workshop on Computational Methods for Endangered Languages最新文献

英文 中文
Integrating Automated Segmentation and Glossing into Documentary and Descriptive Linguistics 将自动分割和注释整合到文献和描述语言学中
Pub Date : 2021-03-02 DOI: 10.33011/COMPUTEL.V1I.965
Sarah Moeller, Mans Hulden
Any attempt to integrate NLP systems to the study of endangered languages must take into consideration traditional approaches by both NLP and linguistics. This paper tests different strategies and workflows for morpheme segmentation and glossing that may affect the potential to integrate machine learning. Two experiments train Transformer models on documentary corpora from five under-documented languages. In one experiment, a model learns segmentation and glossing as a joint step and another model learns the tasks into two sequential steps. We find the sequential approach yields somewhat better results. In a second experiment, one model is trained on surface segmented data, where strings of texts have been simply divided at morpheme boundaries. Another model is trained on canonically segmented data, the approach preferred by linguists, where abstract, underlying forms are represented. We find no clear advantage to either segmentation strategy and note that the difference between them disappears as training data increases. On average the models achieve more than a 0.5 F1-score, with the best models scoring 0.6 or above. An analysis of errors leads us to conclude consistency during manual segmentation and glossing may facilitate higher scores from automatic evaluation but in reality the scores may be lowered when evaluated against original data because instances of annotator error in the original data are “corrected” by the model.
任何将自然语言处理系统整合到濒危语言研究中的尝试都必须考虑到自然语言处理和语言学的传统方法。本文测试了可能影响整合机器学习潜力的语素分割和注释的不同策略和工作流程。两个实验训练Transformer模型在五种文档化语言的文档语料库上。在一个实验中,一个模型学习分割和上光作为一个联合步骤,另一个模型学习任务分为两个连续步骤。我们发现顺序方法会产生更好的结果。在第二个实验中,一个模型在表面分割数据上进行训练,其中文本字符串在语素边界上被简单地分割。另一个模型是在典型的分段数据上进行训练的,这是语言学家喜欢的方法,其中抽象的底层形式被表示出来。我们发现两种分割策略都没有明显的优势,并且注意到它们之间的差异随着训练数据的增加而消失。平均而言,这些模型的f1得分在0.5以上,最好的模型得分在0.6或以上。对错误的分析使我们得出结论,手动分割和擦亮过程中的一致性可能有助于从自动评估中获得更高的分数,但实际上,在对原始数据进行评估时,分数可能会降低,因为原始数据中的注释者错误实例被模型“纠正”了。
{"title":"Integrating Automated Segmentation and Glossing into Documentary and Descriptive Linguistics","authors":"Sarah Moeller, Mans Hulden","doi":"10.33011/COMPUTEL.V1I.965","DOIUrl":"https://doi.org/10.33011/COMPUTEL.V1I.965","url":null,"abstract":"Any attempt to integrate NLP systems to the study of endangered languages must take into consideration traditional approaches by both NLP and linguistics. This paper tests different strategies and workflows for morpheme segmentation and glossing that may affect the potential to integrate machine learning. Two experiments train Transformer models on documentary corpora from five under-documented languages. In one experiment, a model learns segmentation and glossing as a joint step and another model learns the tasks into two sequential steps. We find the sequential approach yields somewhat better results. In a second experiment, one model is trained on surface segmented data, where strings of texts have been simply divided at morpheme boundaries. Another model is trained on canonically segmented data, the approach preferred by linguists, where abstract, underlying forms are represented. We find no clear advantage to either segmentation strategy and note that the difference between them disappears as training data increases. On average the models achieve more than a 0.5 F1-score, with the best models scoring 0.6 or above. An analysis of errors leads us to conclude consistency during manual segmentation and glossing may facilitate higher scores from automatic evaluation but in reality the scores may be lowered when evaluated against original data because instances of annotator error in the original data are “corrected” by the model.","PeriodicalId":152370,"journal":{"name":"Proceedings of the Workshop on Computational Methods for Endangered Languages","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121900605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
The Usefulness of Bibles in Low-Resource Machine Translation 圣经在低资源机器翻译中的有用性
Pub Date : 2021-03-02 DOI: 10.33011/COMPUTEL.V1I.957
Ling Liu, Zach Ryan, Mans Hulden
Bibles are available in a wide range of languages, which provides valuable parallel text between languages since verses can be aligned accurately between all the different translations. How well can such data be utilized to train good neural machine translation (NMT) models? We are particularly interested in low-resource languages of high morphological complexity, and attempt to answer this question in the current work by training and evaluating Basque-English and Navajo-English MT models with the Transformer architecture. Different tokenization methods are applied, among which syllabification turns out to be most effective for Navajo and it is also good for Basque. Another additional data resource which can be potentially available for endangered languages is a dictionary of either word or phrase translations, thanks to linguists’ work on language documentation. Could this data be leveraged to augment Bible data for better performance? We experiment with different ways to utilize dictionary data, and find that word-to-word mapping translation with a word-pair dictionary is more effective than low-resource techniques such as backtranslation or adding dictionary data directly into the training set, though neither backtranslation nor word-to-word mapping translation produce improvements over using Bible data alone in our experiments.
圣经有多种语言版本,这为不同语言之间提供了有价值的平行文本,因为经文可以在所有不同的翻译之间准确地对齐。如何利用这些数据来训练良好的神经机器翻译(NMT)模型?我们对高形态复杂性的低资源语言特别感兴趣,并试图在当前的工作中通过使用Transformer架构训练和评估巴斯克英语和纳瓦霍英语MT模型来回答这个问题。采用了不同的标记方法,其中音节化对纳瓦霍语最有效,对巴斯克语也有较好的效果。由于语言学家在语言文档方面的工作,另一个可能用于濒危语言的额外数据资源是单词或短语翻译词典。是否可以利用这些数据来增强圣经数据以获得更好的性能?我们尝试了不同的方法来利用字典数据,并发现使用词对字典的词到词映射翻译比反向翻译或直接将字典数据添加到训练集中等低资源技术更有效,尽管在我们的实验中,反向翻译和词到词映射翻译都没有比单独使用圣经数据产生改进。
{"title":"The Usefulness of Bibles in Low-Resource Machine Translation","authors":"Ling Liu, Zach Ryan, Mans Hulden","doi":"10.33011/COMPUTEL.V1I.957","DOIUrl":"https://doi.org/10.33011/COMPUTEL.V1I.957","url":null,"abstract":"Bibles are available in a wide range of languages, which provides valuable parallel text between languages since verses can be aligned accurately between all the different translations. How well can such data be utilized to train good neural machine translation (NMT) models? We are particularly interested in low-resource languages of high morphological complexity, and attempt to answer this question in the current work by training and evaluating Basque-English and Navajo-English MT models with the Transformer architecture. Different tokenization methods are applied, among which syllabification turns out to be most effective for Navajo and it is also good for Basque. Another additional data resource which can be potentially available for endangered languages is a dictionary of either word or phrase translations, thanks to linguists’ work on language documentation. Could this data be leveraged to augment Bible data for better performance? We experiment with different ways to utilize dictionary data, and find that word-to-word mapping translation with a word-pair dictionary is more effective than low-resource techniques such as backtranslation or adding dictionary data directly into the training set, though neither backtranslation nor word-to-word mapping translation produce improvements over using Bible data alone in our experiments.","PeriodicalId":152370,"journal":{"name":"Proceedings of the Workshop on Computational Methods for Endangered Languages","volume":"133 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124254891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Expanding the JHU Bible Corpus for Machine Translation of the Indigenous Languages of North America 扩展JHU圣经语料库用于北美土著语言的机器翻译
Pub Date : 2021-03-02 DOI: 10.33011/COMPUTEL.V1I.949
Garrett Nicolai, Edith Coates, M. Zhang, Miikka Silfverberg
We present an extension to the JHU Bible corpus, collecting and normalizing more than thirty Bible translations in thirty Indigenous languages of North America. These exhibit a wide variety of interesting syntactic and morphological phenomena that are understudied in the computational community. Neural translation experiments demonstrate significant gains obtained through cross-lingual, many-to-many translation, with improvements of up to 8.4 BLEU over monolingual models for extremely low-resource languages.
我们提出了一个扩展到JHU圣经语料库,收集和规范三十多个圣经翻译在北美三十种土著语言。它们展示了各种各样的有趣的语法和形态现象,这些现象在计算界尚未得到充分的研究。神经翻译实验表明,通过跨语言、多对多翻译获得了显著的收益,对于资源极低的语言,神经翻译比单语言模型提高了8.4 BLEU。
{"title":"Expanding the JHU Bible Corpus for Machine Translation of the Indigenous Languages of North America","authors":"Garrett Nicolai, Edith Coates, M. Zhang, Miikka Silfverberg","doi":"10.33011/COMPUTEL.V1I.949","DOIUrl":"https://doi.org/10.33011/COMPUTEL.V1I.949","url":null,"abstract":"We present an extension to the JHU Bible corpus, collecting and normalizing more than thirty Bible translations in thirty Indigenous languages of North America. These exhibit a wide variety of interesting syntactic and morphological phenomena that are understudied in the computational community. Neural translation experiments demonstrate significant gains obtained through cross-lingual, many-to-many translation, with improvements of up to 8.4 BLEU over monolingual models for extremely low-resource languages.","PeriodicalId":152370,"journal":{"name":"Proceedings of the Workshop on Computational Methods for Endangered Languages","volume":"75 S4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120965903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Developing a Shared Task for Speech Processing on Endangered Languages 开发濒危语言语音处理共享任务
Pub Date : 2021-03-02 DOI: 10.33011/COMPUTEL.V1I.967
Gina-Anne Levow, Emily Ahn, Emily M. Bender
Advances in speech and language processing have enabled the creation of applications that could, in principle, accelerate the process of language documentation, as speech communities and linguists work on urgent language documentation and reclamation projects. However, such systems have yet to make a significant impact on language documentation, as resource requirements limit the broad applicability of these new techniques. We aim to exploit the framework of shared tasks to focus the technology research community on tasks which address key pain points in language documentation. Here we present initial steps in the implementation of these new shared tasks, through the creation of data sets drawn from endangered language repositories and baseline systems to perform segmentation and speaker labeling of these audio recordings—important enabling steps in the documentation process. This paper motivates these tasks with a use case, describes data set curation and baseline systems, and presents results on this data. We then highlight the challenges and ethical considerations in developing these speech processing tools and tasks to support endangered language documentation.
语音和语言处理方面的进步使应用程序的创建成为可能,这些应用程序原则上可以加速语言文档的进程,因为语音社区和语言学家正在进行紧急语言文档和回收项目。然而,这些系统尚未对语言文档产生重大影响,因为资源需求限制了这些新技术的广泛适用性。我们的目标是利用共享任务的框架,将技术研究社区的注意力集中在解决语言文档中的关键痛点的任务上。在这里,我们介绍了实现这些新的共享任务的初步步骤,通过创建从濒危语言库和基线系统中提取的数据集,对这些录音进行分割和说话人标记——这是文档编制过程中的重要步骤。本文用一个用例来激励这些任务,描述了数据集管理和基线系统,并给出了这些数据的结果。然后,我们强调了开发这些语音处理工具和任务以支持濒危语言文档的挑战和伦理考虑。
{"title":"Developing a Shared Task for Speech Processing on Endangered Languages","authors":"Gina-Anne Levow, Emily Ahn, Emily M. Bender","doi":"10.33011/COMPUTEL.V1I.967","DOIUrl":"https://doi.org/10.33011/COMPUTEL.V1I.967","url":null,"abstract":"Advances in speech and language processing have enabled the creation of applications that could, in principle, accelerate the process of language documentation, as speech communities and linguists work on urgent language documentation and reclamation projects. However, such systems have yet to make a significant impact on language documentation, as resource requirements limit the broad applicability of these new techniques. We aim to exploit the framework of shared tasks to focus the technology research community on tasks which address key pain points in language documentation. Here we present initial steps in the implementation of these new shared tasks, through the creation of data sets drawn from endangered language repositories and baseline systems to perform segmentation and speaker labeling of these audio recordings—important enabling steps in the documentation process. This paper motivates these tasks with a use case, describes data set curation and baseline systems, and presents results on this data. We then highlight the challenges and ethical considerations in developing these speech processing tools and tasks to support endangered language documentation.","PeriodicalId":152370,"journal":{"name":"Proceedings of the Workshop on Computational Methods for Endangered Languages","volume":"124 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132221096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
LARA in the Service of Revivalistics and Documentary Linguistics: Community Engagement and Endangered Languages 文学复兴与文献语言学:社区参与与濒危语言
Pub Date : 2021-03-02 DOI: 10.33011/COMPUTEL.V1I.953
G. Zuckermann, Sigurður Vigfússon, Manny Rayner, Neasa Ní Chiaráin, N. Ivanova, Hanieh Habibi, Branislav Bédi
We argue that LARA, a new web platform that supports easy conversion of text into an online multimedia form designed to support non-native readers, is a good match to the task of creating high-quality resources useful for languages in the revivalistics spectrum. We illustrate with initial case studies in three widely different endangered/revival languages: Irish (Gaelic); Icelandic Sign Language (ÍTM); and Barngarla, a reclaimed Australian Aboriginal language. The exposition is presented from a language community perspective. Links are given to examples of LARA resources constructed for each language.
我们认为LARA是一个新的网络平台,它支持将文本轻松转换为在线多媒体形式,旨在支持非母语读者,是创建高质量资源的一个很好的匹配任务。我们用三种不同的濒危/复兴语言的初步案例研究来说明:爱尔兰语(盖尔语);冰岛手语(ÍTM);巴恩加拉语,一种重新开垦的澳大利亚土著语言。本文从语言共同体的角度进行论述。提供了为每种语言构建的LARA资源示例的链接。
{"title":"LARA in the Service of Revivalistics and Documentary Linguistics: Community Engagement and Endangered Languages","authors":"G. Zuckermann, Sigurður Vigfússon, Manny Rayner, Neasa Ní Chiaráin, N. Ivanova, Hanieh Habibi, Branislav Bédi","doi":"10.33011/COMPUTEL.V1I.953","DOIUrl":"https://doi.org/10.33011/COMPUTEL.V1I.953","url":null,"abstract":"We argue that LARA, a new web platform that supports easy conversion of text into an online multimedia form designed to support non-native readers, is a good match to the task of creating high-quality resources useful for languages in the revivalistics spectrum. We illustrate with initial case studies in three widely different endangered/revival languages: Irish (Gaelic); Icelandic Sign Language (ÍTM); and Barngarla, a reclaimed Australian Aboriginal language. The exposition is presented from a language community perspective. Links are given to examples of LARA resources constructed for each language.","PeriodicalId":152370,"journal":{"name":"Proceedings of the Workshop on Computational Methods for Endangered Languages","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115826282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Shared Digital Resource Application within Insular Scandinavian 共享数字资源在斯堪的纳维亚半岛的应用
Pub Date : 2021-03-02 DOI: 10.33011/COMPUTEL.V1I.961
Hinrik Hafsteinsson, A. Ingason
We describe the application of language technology methods and resources devised for Icelandic, a North Germanic language with about 300,000 speakers, in digital language resource creation for Faroese, a North Germanic language with about 50,000 speakers. The current project encompassed the development of a dedicated, high-accuracy part-of-speech (PoS) tagging solution for Faroese. To achieve this, a state-of-the-art neural PoS tagger for Icelandic, ABLTagger, was trained on a 100,000 word PoS-tagged corpus for Faroese, standardised with methods previously applied to Icelandic corpora. This tagger was supplemented with a novel Experimental Database of Faroese Inflection (EDFM), which is a lexicon containing morphological information on 67,488 Faroese words with about one million inflectional forms. This approach produced a PoS-tagging model for Faroese which achieves a 91.40% overall accuracy when evaluated with 10-fold cross validation, which is currently the highest accuracy for a dedicated Faroese PoS-tagger. The products of this project are made available for use in further research in Faroese language technology.
我们描述了为冰岛语设计的语言技术方法和资源的应用,冰岛语是一种拥有约30万使用者的北日耳曼语言,在法罗语(一种拥有约5万使用者的北日耳曼语言)的数字语言资源创建中。目前的项目包括为法罗语开发一个专用的、高精度的词性标注解决方案。为了实现这一目标,一个最先进的冰岛语词性标注器ABLTagger,在法罗语10万词词性标注语料库上进行了训练,并与之前应用于冰岛语语料库的方法进行了标准化。这个标注器还补充了一个新的法罗语词形变化实验数据库(EDFM),这是一个包含67,488个法罗语单词的词形信息的词典,大约有100万种词形变化形式。该方法生成的法罗语pos标注模型经10倍交叉验证后,总体准确率达到91.40%,是目前法罗语pos标注器的最高准确率。该项目的产品可用于法罗语技术的进一步研究。
{"title":"Shared Digital Resource Application within Insular Scandinavian","authors":"Hinrik Hafsteinsson, A. Ingason","doi":"10.33011/COMPUTEL.V1I.961","DOIUrl":"https://doi.org/10.33011/COMPUTEL.V1I.961","url":null,"abstract":"We describe the application of language technology methods and resources devised for Icelandic, a North Germanic language with about 300,000 speakers, in digital language resource creation for Faroese, a North Germanic language with about 50,000 speakers. The current project encompassed the development of a dedicated, high-accuracy part-of-speech (PoS) tagging solution for Faroese. To achieve this, a state-of-the-art neural PoS tagger for Icelandic, ABLTagger, was trained on a 100,000 word PoS-tagged corpus for Faroese, standardised with methods previously applied to Icelandic corpora. This tagger was supplemented with a novel Experimental Database of Faroese Inflection (EDFM), which is a lexicon containing morphological information on 67,488 Faroese words with about one million inflectional forms. This approach produced a PoS-tagging model for Faroese which achieves a 91.40% overall accuracy when evaluated with 10-fold cross validation, which is currently the highest accuracy for a dedicated Faroese PoS-tagger. The products of this project are made available for use in further research in Faroese language technology.","PeriodicalId":152370,"journal":{"name":"Proceedings of the Workshop on Computational Methods for Endangered Languages","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128481374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Fossicking in dominant language teaching: Javanese and Indonesian ‘low’ varieties in language teaching resources 主导语言教学的化石化:爪哇语和印度尼西亚语“低”变体的语言教学资源
Pub Date : 2021-03-02 DOI: 10.33011/COMPUTEL.V1I.955
Zara Maxwell-Smith
‘Low’ and ‘high’ varieties of Indonesian and other languages of Indonesia are poorly resourced for developing human language technologies. Many languages spoken in Indonesia, even those with very large speaker populations, such as Javanese (over 80 million), are thought to be threatened languages. The teaching of Indonesian language focuses on the prestige variety which forms part of the unusual diglossia found in many parts of Indonesia. We developed a publicly available pipeline to scrape and clean text from the PDFs of a classic Indonesian textbook, The Indonesian Way, creating a corpus. Using the corpus and curated wordlists from a number of lexicons I searched for instances of non-prestige varieties of Indonesian, finding that they play a limited, secondary role to formal Indonesian in this textbook. References to other languages used in Indonesia are usually made as a passing comment. These methods help to determine how text teaching resources relate to and influence the language politics of diglossia and the many languages of Indonesia.
印尼语和印度尼西亚其他语言的“低级”和“高级”变体缺乏开发人类语言技术的资源。在印度尼西亚使用的许多语言,即使是那些拥有大量使用者的语言,如爪哇语(超过8000万),也被认为是受到威胁的语言。印度尼西亚语的教学侧重于形成印度尼西亚许多地区不寻常的语言的一部分的声望多样性。我们开发了一个公开可用的管道,从经典的印尼语教科书《印尼语之道》(the indonesia Way)的pdf文件中抓取和清理文本,创建了一个语料库。使用语料库和从一些词典中整理出来的词表,我搜索了一些非声望的印尼语变体,发现它们在这本教科书中扮演的角色有限,次于正式印尼语。对印度尼西亚使用的其他语言的引用通常是作为一种附带的评论。这些方法有助于确定文本教学资源如何与印度尼西亚的语言政治和多种语言相关并对其产生影响。
{"title":"Fossicking in dominant language teaching: Javanese and Indonesian ‘low’ varieties in language teaching resources","authors":"Zara Maxwell-Smith","doi":"10.33011/COMPUTEL.V1I.955","DOIUrl":"https://doi.org/10.33011/COMPUTEL.V1I.955","url":null,"abstract":"‘Low’ and ‘high’ varieties of Indonesian and other languages of Indonesia are poorly resourced for developing human language technologies. Many languages spoken in Indonesia, even those with very large speaker populations, such as Javanese (over 80 million), are thought to be threatened languages. The teaching of Indonesian language focuses on the prestige variety which forms part of the unusual diglossia found in many parts of Indonesia. We developed a publicly available pipeline to scrape and clean text from the PDFs of a classic Indonesian textbook, The Indonesian Way, creating a corpus. Using the corpus and curated wordlists from a number of lexicons I searched for instances of non-prestige varieties of Indonesian, finding that they play a limited, secondary role to formal Indonesian in this textbook. References to other languages used in Indonesia are usually made as a passing comment. These methods help to determine how text teaching resources relate to and influence the language politics of diglossia and the many languages of Indonesia.","PeriodicalId":152370,"journal":{"name":"Proceedings of the Workshop on Computational Methods for Endangered Languages","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131827115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The Relevance of the Source Language in Transfer Learning for ASR 源语言在ASR迁移学习中的相关性
Pub Date : 2021-03-02 DOI: 10.33011/COMPUTEL.V1I.959
Nils Hjortnaes, N. Partanen, Michael Rießler, Francis M. Tyers
This study presents new experiments on Zyrian Komi speech recognition. We use Deep-Speech to train ASR models from a language documentation corpus that contains both contemporary and archival recordings. Earlier studies have shown that transfer learning from English and using a domain matching Komi language model both improve the CER and WER. In this study we experiment with transfer learning from a more relevant source language, Russian, and including Russian text in the language model construction. The motivation for this is that Russian and Komi are contemporary contact languages, and Russian is regularly present in the corpus. We found that despite the close contact of Russian and Komi, the size of the English speech corpus yielded greater performance when used as the source language. Additionally, we can report that already an update in DeepSpeech version improved the CER by 3.9% against the earlier studies, which is an important step in the development of Komi ASR.
本研究提出了Zyrian Komi语音识别的新实验。我们使用Deep-Speech从包含当代和档案录音的语言文档语料库中训练ASR模型。早期的研究表明,从英语迁移学习和使用领域匹配的Komi语言模型都可以提高CER和WER。在本研究中,我们尝试从更相关的源语言俄语中进行迁移学习,并将俄语文本纳入语言模型构建中。这样做的动机是俄语和科米语是当代的联系语言,俄语经常出现在语料库中。我们发现,尽管俄语和科米语有密切的联系,但英语语音语料库的大小在作为源语言时产生了更好的表现。此外,我们可以报告说,DeepSpeech版本的更新已经比早期的研究提高了3.9%的CER,这是Komi ASR发展的重要一步。
{"title":"The Relevance of the Source Language in Transfer Learning for ASR","authors":"Nils Hjortnaes, N. Partanen, Michael Rießler, Francis M. Tyers","doi":"10.33011/COMPUTEL.V1I.959","DOIUrl":"https://doi.org/10.33011/COMPUTEL.V1I.959","url":null,"abstract":"This study presents new experiments on Zyrian Komi speech recognition. We use Deep-Speech to train ASR models from a language documentation corpus that contains both contemporary and archival recordings. Earlier studies have shown that transfer learning from English and using a domain matching Komi language model both improve the CER and WER. In this study we experiment with transfer learning from a more relevant source language, Russian, and including Russian text in the language model construction. The motivation for this is that Russian and Komi are contemporary contact languages, and Russian is regularly present in the corpus. We found that despite the close contact of Russian and Komi, the size of the English speech corpus yielded greater performance when used as the source language. Additionally, we can report that already an update in DeepSpeech version improved the CER by 3.9% against the earlier studies, which is an important step in the development of Komi ASR.","PeriodicalId":152370,"journal":{"name":"Proceedings of the Workshop on Computational Methods for Endangered Languages","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134240784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Computational Analysis versus Human Intuition: A Critical Comparison of Vector Semantics with Manual Semantic Classification in the Context of Plains Cree 计算分析与人类直觉:向量语义与人工语义分类在平原树语境中的关键比较
Pub Date : 2021-03-02 DOI: 10.33011/COMPUTEL.V1I.971
Daniel Dacanay, Atticus Harrigan, Antti Arppe
A persistent challenge in the creation of semantically classified dictionaries and lexical resources is the lengthy and expensive process of manual semantic classification, a hindrance which can make adequate semantic resources unattainable for under-resourced language communities. We explore here an alternative to manual classification using a vector semantic method, which, although not yet at the level of human sophistication, can provide usable first-pass semantic classifications in a fraction of the time. As a case example, we use a dictionary in Plains Cree (ISO: crk, Algonquian, Western Canada and United States)
在创建语义分类词典和词汇资源的过程中,一个持续的挑战是人工语义分类的过程漫长而昂贵,这可能使资源不足的语言社区无法获得足够的语义资源。我们在这里探索使用向量语义方法的手动分类的替代方法,尽管还没有达到人类的复杂程度,但可以在很短的时间内提供可用的第一次语义分类。作为示例,我们使用Plains Cree (ISO: crk, Algonquian,加拿大西部和美国)的字典。
{"title":"Computational Analysis versus Human Intuition: A Critical Comparison of Vector Semantics with Manual Semantic Classification in the Context of Plains Cree","authors":"Daniel Dacanay, Atticus Harrigan, Antti Arppe","doi":"10.33011/COMPUTEL.V1I.971","DOIUrl":"https://doi.org/10.33011/COMPUTEL.V1I.971","url":null,"abstract":"A persistent challenge in the creation of semantically classified dictionaries and lexical resources is the lengthy and expensive process of manual semantic classification, a hindrance which can make adequate semantic resources unattainable for under-resourced language communities. We explore here an alternative to manual classification using a vector semantic method, which, although not yet at the level of human sophistication, can provide usable first-pass semantic classifications in a fraction of the time. As a case example, we use a dictionary in Plains Cree (ISO: crk, Algonquian, Western Canada and United States)","PeriodicalId":152370,"journal":{"name":"Proceedings of the Workshop on Computational Methods for Endangered Languages","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124431159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Applying Support Vector Machines to POS tagging of the Ainu Language 支持向量机在阿伊努语词性标注中的应用
Pub Date : 2019-02-26 DOI: 10.33011/computel.v2i.449
Karol Nowakowski, M. Ptaszynski, Fumito Masui, Yoshio Momouchi
We describe our attempt to apply a state-of-the-art sequential tagger – SVMTool – in the task of automatic part-of-speech annotation of the Ainu language, a critically endangered language isolate spoken by the native inhabitants of northern Japan. Our experiments indicated that it performs better than the custom system proposed in previous research (POST-AL), especially when applied to out-of-domain data. The biggest advantage of the model trained using SVMTool over the POST-AL tagger is its ability to guess part-of-speech tags for OoV words, with the accuracy of up to 63%.
我们描述了我们在阿伊努语的自动词性注释任务中应用最先进的顺序标注器SVMTool的尝试,阿伊努语是日本北部土著居民使用的一种极度濒危的孤立语言。我们的实验表明,它比以往研究中提出的自定义系统(POST-AL)性能更好,特别是在应用于域外数据时。与POST-AL标注器相比,使用SVMTool训练的模型的最大优势是它能够猜测OoV单词的词性标记,准确率高达63%。
{"title":"Applying Support Vector Machines to POS tagging of the Ainu Language","authors":"Karol Nowakowski, M. Ptaszynski, Fumito Masui, Yoshio Momouchi","doi":"10.33011/computel.v2i.449","DOIUrl":"https://doi.org/10.33011/computel.v2i.449","url":null,"abstract":"We describe our attempt to apply a state-of-the-art sequential tagger – SVMTool – in the task of automatic part-of-speech annotation of the Ainu language, a critically endangered language isolate spoken by the native inhabitants of northern Japan. Our experiments indicated that it performs better than the custom system proposed in previous research (POST-AL), especially when applied to out-of-domain data. The biggest advantage of the model trained using SVMTool over the POST-AL tagger is its ability to guess part-of-speech tags for OoV words, with the accuracy of up to 63%.","PeriodicalId":152370,"journal":{"name":"Proceedings of the Workshop on Computational Methods for Endangered Languages","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115157960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
Proceedings of the Workshop on Computational Methods for Endangered Languages
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1