Challenges of Neural Machine Translation for Short Texts

IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computational Linguistics Pub Date : 2022-03-07 DOI:10.1162/coli_a_00435
Yu Wan, Baosong Yang, Derek F. Wong, Lidia S. Chao, Liang Yao, Haibo Zhang, Boxing Chen
{"title":"Challenges of Neural Machine Translation for Short Texts","authors":"Yu Wan, Baosong Yang, Derek F. Wong, Lidia S. Chao, Liang Yao, Haibo Zhang, Boxing Chen","doi":"10.1162/coli_a_00435","DOIUrl":null,"url":null,"abstract":"Abstract Short texts (STs) present in a variety of scenarios, including query, dialog, and entity names. Most of the exciting studies in neural machine translation (NMT) are focused on tackling open problems concerning long sentences rather than short ones. The intuition behind is that, with respect to human learning and processing, short sequences are generally regarded as easy examples. In this article, we first dispel this speculation via conducting preliminary experiments, showing that the conventional state-of-the-art NMT approach, namely, Transformer (Vaswani et al. 2017), still suffers from over-translation and mistranslation errors over STs. After empirically investigating the rationale behind this, we summarize two challenges in NMT for STs associated with translation error types above, respectively: (1) the imbalanced length distribution in training set intensifies model inference calibration over STs, leading to more over-translation cases on STs; and (2) the lack of contextual information forces NMT to have higher data uncertainty on short sentences, and thus NMT model is troubled by considerable mistranslation errors. Some existing approaches, like balancing data distribution for training (e.g., data upsampling) and complementing contextual information (e.g., introducing translation memory) can alleviate the translation issues in NMT for STs. We encourage researchers to investigate other challenges in NMT for STs, thus reducing ST translation errors and enhancing translation quality.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"48 1","pages":"321-342"},"PeriodicalIF":3.7000,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00435","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 10

Abstract

Abstract Short texts (STs) present in a variety of scenarios, including query, dialog, and entity names. Most of the exciting studies in neural machine translation (NMT) are focused on tackling open problems concerning long sentences rather than short ones. The intuition behind is that, with respect to human learning and processing, short sequences are generally regarded as easy examples. In this article, we first dispel this speculation via conducting preliminary experiments, showing that the conventional state-of-the-art NMT approach, namely, Transformer (Vaswani et al. 2017), still suffers from over-translation and mistranslation errors over STs. After empirically investigating the rationale behind this, we summarize two challenges in NMT for STs associated with translation error types above, respectively: (1) the imbalanced length distribution in training set intensifies model inference calibration over STs, leading to more over-translation cases on STs; and (2) the lack of contextual information forces NMT to have higher data uncertainty on short sentences, and thus NMT model is troubled by considerable mistranslation errors. Some existing approaches, like balancing data distribution for training (e.g., data upsampling) and complementing contextual information (e.g., introducing translation memory) can alleviate the translation issues in NMT for STs. We encourage researchers to investigate other challenges in NMT for STs, thus reducing ST translation errors and enhancing translation quality.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
短文本神经机器翻译的挑战
摘要短文本(STs)存在于各种场景中,包括查询、对话框和实体名称。神经机器翻译(NMT)的研究大多集中在解决长句而非短句的开放性问题上。这背后的直觉是,就人类的学习和处理而言,短序列通常被视为简单的例子。在本文中,我们首先通过进行初步实验来消除这一猜测,结果表明,传统的最先进的NMT方法,即Transformer (Vaswani et al. 2017),在STs上仍然存在过度翻译和误译错误。在实证研究了这背后的原理之后,我们总结了与上述翻译错误类型相关的两大挑战,分别是:(1)训练集长度分布的不平衡加剧了对STs的模型推理校准,导致STs上出现更多的过度翻译情况;(2)上下文信息的缺乏使得NMT在短句上具有较高的数据不确定性,因此NMT模型存在相当大的误译错误。一些现有的方法,如平衡训练数据分布(例如,数据上采样)和补充上下文信息(例如,引入翻译记忆库)可以缓解面向STs的NMT中的翻译问题。我们鼓励研究人员研究翻译过程中的其他挑战,从而减少翻译错误,提高翻译质量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Computational Linguistics
Computational Linguistics 工程技术-计算机:跨学科应用
CiteScore
15.80
自引率
0.00%
发文量
45
审稿时长
>12 weeks
期刊介绍: Computational Linguistics, the longest-running publication dedicated solely to the computational and mathematical aspects of language and the design of natural language processing systems, provides university and industry linguists, computational linguists, AI and machine learning researchers, cognitive scientists, speech specialists, and philosophers with the latest insights into the computational aspects of language research.
期刊最新文献
Generation and Polynomial Parsing of Graph Languages with Non-Structural Reentrancies Languages through the Looking Glass of BPE Compression Capturing Fine-Grained Regional Differences in Language Use through Voting Precinct Embeddings Machine Learning for Ancient Languages: A Survey Statistical Methods for Annotation Analysis by Silviu Paun, Ron Artstein, and Massimo Poesio
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1