Robust Data Augmentation for Neural Machine Translation through EVALNET

IF 2.3 3区 数学 Q1 MATHEMATICS Mathematics Pub Date : 2022-12-27 DOI:10.3390/math11010123
Yo-Han Park, Yong-Seok Choi, Seung Yun, Sang-Hun Kim, K. Lee
{"title":"Robust Data Augmentation for Neural Machine Translation through EVALNET","authors":"Yo-Han Park, Yong-Seok Choi, Seung Yun, Sang-Hun Kim, K. Lee","doi":"10.3390/math11010123","DOIUrl":null,"url":null,"abstract":"Since building Neural Machine Translation (NMT) systems requires a large parallel corpus, various data augmentation techniques have been adopted, especially for low-resource languages. In order to achieve the best performance through data augmentation, the NMT systems should be able to evaluate the quality of augmented data. Several studies have addressed data weighting techniques to assess data quality. The basic idea of data weighting adopted in previous studies is the loss value that a system calculates when learning from training data. The weight derived from the loss value of the data, through simple heuristic rules or neural models, can adjust the loss used in the next step of the learning process. In this study, we propose EvalNet, a data evaluation network, to assess parallel data of NMT. EvalNet exploits a loss value, a cross-attention map, and a semantic similarity between parallel data as its features. The cross-attention map is an encoded representation of cross-attention layers of Transformer, which is a base architecture of an NMT system. The semantic similarity is a cosine distance between two semantic embeddings of a source sentence and a target sentence. Owing to the parallelism of data, the combination of the cross-attention map and the semantic similarity proved to be effective features for data quality evaluation, besides the loss value. EvalNet is the first NMT data evaluator network that introduces the cross-attention map and the semantic similarity as its features. Through various experiments, we conclude that EvalNet is simple yet beneficial for robust training of an NMT system and outperforms the previous studies as a data evaluator.","PeriodicalId":18303,"journal":{"name":"Mathematics","volume":" ","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2022-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mathematics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.3390/math11010123","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS","Score":null,"Total":0}
引用次数: 3

Abstract

Since building Neural Machine Translation (NMT) systems requires a large parallel corpus, various data augmentation techniques have been adopted, especially for low-resource languages. In order to achieve the best performance through data augmentation, the NMT systems should be able to evaluate the quality of augmented data. Several studies have addressed data weighting techniques to assess data quality. The basic idea of data weighting adopted in previous studies is the loss value that a system calculates when learning from training data. The weight derived from the loss value of the data, through simple heuristic rules or neural models, can adjust the loss used in the next step of the learning process. In this study, we propose EvalNet, a data evaluation network, to assess parallel data of NMT. EvalNet exploits a loss value, a cross-attention map, and a semantic similarity between parallel data as its features. The cross-attention map is an encoded representation of cross-attention layers of Transformer, which is a base architecture of an NMT system. The semantic similarity is a cosine distance between two semantic embeddings of a source sentence and a target sentence. Owing to the parallelism of data, the combination of the cross-attention map and the semantic similarity proved to be effective features for data quality evaluation, besides the loss value. EvalNet is the first NMT data evaluator network that introduces the cross-attention map and the semantic similarity as its features. Through various experiments, we conclude that EvalNet is simple yet beneficial for robust training of an NMT system and outperforms the previous studies as a data evaluator.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于EVALNET的神经机器翻译鲁棒数据增强
由于建立神经机器翻译系统需要大量的并行语料库,因此采用了各种数据扩充技术,尤其是对低资源语言。为了通过数据扩充实现最佳性能,NMT系统应该能够评估扩充数据的质量。一些研究涉及评估数据质量的数据加权技术。以往研究中采用的数据加权的基本思想是系统在从训练数据中学习时计算的损失值。通过简单的启发式规则或神经模型,从数据的损失值中得出的权重可以调整下一步学习过程中使用的损失。在本研究中,我们提出了一个数据评估网络EvalNet来评估NMT的并行数据。EvalNet利用损失值、交叉注意力图和并行数据之间的语义相似性作为其特征。交叉注意力映射是Transformer的交叉注意力层的编码表示,Transformer是NMT系统的基本架构。语义相似性是源句子和目标句子的两个语义嵌入之间的余弦距离。由于数据的并行性,除了损失值之外,交叉注意力图和语义相似性的组合被证明是数据质量评估的有效特征。EvalNet是第一个引入交叉注意力图和语义相似性作为特征的NMT数据评估网络。通过各种实验,我们得出结论,EvalNet简单而有益于NMT系统的鲁棒训练,并且作为数据评估器的性能优于先前的研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Mathematics
Mathematics Mathematics-General Mathematics
CiteScore
4.00
自引率
16.70%
发文量
4032
审稿时长
21.9 days
期刊介绍: Mathematics (ISSN 2227-7390) is an international, open access journal which provides an advanced forum for studies related to mathematical sciences. It devotes exclusively to the publication of high-quality reviews, regular research papers and short communications in all areas of pure and applied mathematics. Mathematics also publishes timely and thorough survey articles on current trends, new theoretical techniques, novel ideas and new mathematical tools in different branches of mathematics.
期刊最新文献
Estimating the Relative Risks of Spatial Clusters Using a Predictor-Corrector Method. A Privacy-Preserving Electromagnetic-Spectrum-Sharing Trading Scheme Based on ABE and Blockchain Three Weak Solutions for a Critical Non-Local Problem with Strong Singularity in High Dimension LMKCDEY Revisited: Speeding Up Blind Rotation with Signed Evaluation Keys A New Instance Segmentation Model for High-Resolution Remote Sensing Images Based on Edge Processing
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1