一个特定领域的平行语料库和增强的英语-阿萨姆语神经机器翻译

Sahinur Rahman Laskar, Riyanka Manna, Partha Pakray, Sivaji Bandyopadhyay
{"title":"一个特定领域的平行语料库和增强的英语-阿萨姆语神经机器翻译","authors":"Sahinur Rahman Laskar, Riyanka Manna, Partha Pakray, Sivaji Bandyopadhyay","doi":"10.13053/cys-26-4-4423","DOIUrl":null,"url":null,"abstract":"Machine translation deals with automatic translation from one natural language to another. Neural machine translation is a widely accepted technique of the corpus-based machine translation approach. However, an adequate amount of training data is required, and there is a need for the domain-wise parallel corpus to improve translational performance that shows translational coverages in various domains. In this work, a domain-specific parallel corpus is prepared that includes different domain coverages, namely, Agriculture, Government Office, Judiciary, Social Media, Tourism, COVID-19, Sports, and Literature domains for low-resource English-Assamese pair translation. Moreover, we have tackled data scarcity and word-order divergence problems via data augmentation and prior alignment concept. Also, we have contributed Assamese pretrained LM, Assamese word-embeddings by utilizing Assamese monolingual data, and a bilingual dictionary-based post-processing step to enhance transformer-based neural machine translation. We have achieved state-of-the-art results for both forward (English-to-Assamese) and backward (Assamese-to-English) directions of translation.","PeriodicalId":333706,"journal":{"name":"Computación Y Sistemas","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Domain Specific Parallel Corpus and Enhanced English-Assamese Neural Machine Translation\",\"authors\":\"Sahinur Rahman Laskar, Riyanka Manna, Partha Pakray, Sivaji Bandyopadhyay\",\"doi\":\"10.13053/cys-26-4-4423\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine translation deals with automatic translation from one natural language to another. Neural machine translation is a widely accepted technique of the corpus-based machine translation approach. However, an adequate amount of training data is required, and there is a need for the domain-wise parallel corpus to improve translational performance that shows translational coverages in various domains. In this work, a domain-specific parallel corpus is prepared that includes different domain coverages, namely, Agriculture, Government Office, Judiciary, Social Media, Tourism, COVID-19, Sports, and Literature domains for low-resource English-Assamese pair translation. Moreover, we have tackled data scarcity and word-order divergence problems via data augmentation and prior alignment concept. Also, we have contributed Assamese pretrained LM, Assamese word-embeddings by utilizing Assamese monolingual data, and a bilingual dictionary-based post-processing step to enhance transformer-based neural machine translation. We have achieved state-of-the-art results for both forward (English-to-Assamese) and backward (Assamese-to-English) directions of translation.\",\"PeriodicalId\":333706,\"journal\":{\"name\":\"Computación Y Sistemas\",\"volume\":\"54 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computación Y Sistemas\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.13053/cys-26-4-4423\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computación Y Sistemas","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.13053/cys-26-4-4423","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

机器翻译处理从一种自然语言到另一种自然语言的自动翻译。神经网络机器翻译是一种被广泛接受的基于语料库的机器翻译方法。然而,需要足够数量的训练数据,并且需要领域智能并行语料库来提高翻译性能,以显示不同领域的翻译覆盖率。在这项工作中,准备了一个特定领域的平行语料库,其中包括不同的领域覆盖范围,即农业、政府办公室、司法、社交媒体、旅游、COVID-19、体育和文学领域,用于低资源英语-阿萨姆语对翻译。此外,我们还通过数据增强和优先对齐概念解决了数据稀缺和词序偏离问题。此外,我们还贡献了阿萨姆语预训练LM,利用阿萨姆语单语数据的阿萨姆语词嵌入,以及基于双语词典的后处理步骤,以增强基于变压器的神经机器翻译。我们在正向(英语到阿萨姆语)和反向(阿萨姆语到英语)方向的翻译方面都取得了最先进的成果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A Domain Specific Parallel Corpus and Enhanced English-Assamese Neural Machine Translation
Machine translation deals with automatic translation from one natural language to another. Neural machine translation is a widely accepted technique of the corpus-based machine translation approach. However, an adequate amount of training data is required, and there is a need for the domain-wise parallel corpus to improve translational performance that shows translational coverages in various domains. In this work, a domain-specific parallel corpus is prepared that includes different domain coverages, namely, Agriculture, Government Office, Judiciary, Social Media, Tourism, COVID-19, Sports, and Literature domains for low-resource English-Assamese pair translation. Moreover, we have tackled data scarcity and word-order divergence problems via data augmentation and prior alignment concept. Also, we have contributed Assamese pretrained LM, Assamese word-embeddings by utilizing Assamese monolingual data, and a bilingual dictionary-based post-processing step to enhance transformer-based neural machine translation. We have achieved state-of-the-art results for both forward (English-to-Assamese) and backward (Assamese-to-English) directions of translation.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Impact of Altered Oxygen Level on the Growth Dynamics of Hanging Tumor Multispectral Camera Calibration Using Convolutional Neural Networks Simulation of Systems with Random Variables for Making Strategic Decisions Parametric Negations of Probability Distributions and Fuzzy Distribution Sets trACE - Anomaly Correlation Engine for Tracing the Root Cause on a Cloud based Microservice Architecture
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1