使用统计机器翻译对训练数据进行分级

A. Finch, E. Sumita
{"title":"使用统计机器翻译对训练数据进行分级","authors":"A. Finch, E. Sumita","doi":"10.1109/ISUC.2008.20","DOIUrl":null,"url":null,"abstract":"One of the main causes of errors in statistical machine translation are the erroneous phrase pairs that can find their way into the phrase table. These phrases are the result of poor word-to-word alignments during the training of the translation model. These word alignment errors in turn cause errors during the phrase extraction phase, and these erroneous bilingual phrase pairs are then used during the decoding process and appear in the output of the machine translation system. Machine translation training data is never perfect, often bilingual sentence pairs are incorrectly aligned sentence-by-sentence, or these pairs are poor translations of each other due to human error. Even when sentence pairs in the corpus are good translations of each other the translations may not be literal enough to admit to the sort of phrase-by-phrase translation necessary to make good training data for a phrase-based statistical machine translation (SMT) system. This is because such SMT systems operate on the assumption that source can be transformed into target simply by translating phrase-by-phrase with re-ordering. In the real world, many perfectly correct translations are not of this form, and these sentences even though correct translations, make poor training data for training the translation models of a phrase-based SMT system. This paper presents a technique in which preliminary machine translation systems are built with the sole purpose of indicating those sentence pairs in the training corpus that the systems are able to generate using their models, the hypothesis being that these sentence pairs are likely to make good training data for an SMT system of the same type. These sentences are then used to bootstrap a second SMT system, and those sentences identified as good training data are given additional weight during the training process for building the translation models. Using this technique we were able to improve the performance of a Japanese-to-English SMT system by 1.2-1.5 BLEU points on unseen evaluation data.","PeriodicalId":339811,"journal":{"name":"2008 Second International Symposium on Universal Communication","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Using Statistical Machine Translation to Grade Training Data\",\"authors\":\"A. Finch, E. Sumita\",\"doi\":\"10.1109/ISUC.2008.20\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the main causes of errors in statistical machine translation are the erroneous phrase pairs that can find their way into the phrase table. These phrases are the result of poor word-to-word alignments during the training of the translation model. These word alignment errors in turn cause errors during the phrase extraction phase, and these erroneous bilingual phrase pairs are then used during the decoding process and appear in the output of the machine translation system. Machine translation training data is never perfect, often bilingual sentence pairs are incorrectly aligned sentence-by-sentence, or these pairs are poor translations of each other due to human error. Even when sentence pairs in the corpus are good translations of each other the translations may not be literal enough to admit to the sort of phrase-by-phrase translation necessary to make good training data for a phrase-based statistical machine translation (SMT) system. This is because such SMT systems operate on the assumption that source can be transformed into target simply by translating phrase-by-phrase with re-ordering. In the real world, many perfectly correct translations are not of this form, and these sentences even though correct translations, make poor training data for training the translation models of a phrase-based SMT system. This paper presents a technique in which preliminary machine translation systems are built with the sole purpose of indicating those sentence pairs in the training corpus that the systems are able to generate using their models, the hypothesis being that these sentence pairs are likely to make good training data for an SMT system of the same type. These sentences are then used to bootstrap a second SMT system, and those sentences identified as good training data are given additional weight during the training process for building the translation models. Using this technique we were able to improve the performance of a Japanese-to-English SMT system by 1.2-1.5 BLEU points on unseen evaluation data.\",\"PeriodicalId\":339811,\"journal\":{\"name\":\"2008 Second International Symposium on Universal Communication\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-12-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 Second International Symposium on Universal Communication\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISUC.2008.20\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 Second International Symposium on Universal Communication","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISUC.2008.20","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

统计机器翻译错误的主要原因之一是错误的短语对可以进入短语表。这些短语是翻译模型训练过程中单词对单词对齐不良的结果。这些词对齐错误反过来又导致短语提取阶段的错误,然后这些错误的双语短语对在解码过程中被使用,并出现在机器翻译系统的输出中。机器翻译训练数据从来都不是完美的,通常双语句子对一句一句地不正确对齐,或者由于人为错误,这些对彼此的翻译很差。即使语料库中的句子对彼此翻译得很好,翻译也可能不够逐句翻译,无法为基于短语的统计机器翻译(SMT)系统提供良好的训练数据。这是因为这种SMT系统是基于这样一种假设,即只要逐句翻译并重新排序,就可以将源转换为目标。在现实世界中,许多完全正确的翻译不是这种形式,这些句子即使是正确的翻译,对于训练基于短语的SMT系统的翻译模型来说,也是很差的训练数据。本文提出了一种技术,在这种技术中,初步的机器翻译系统建立的唯一目的是指示系统能够使用其模型生成的训练语料库中的句子对,假设这些句子对可能为相同类型的SMT系统提供良好的训练数据。然后使用这些句子来引导第二个SMT系统,在构建翻译模型的训练过程中,那些被识别为良好训练数据的句子被赋予额外的权重。使用这种技术,我们能够在未见过的评估数据上将日语到英语的SMT系统的性能提高1.2-1.5 BLEU点。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Using Statistical Machine Translation to Grade Training Data
One of the main causes of errors in statistical machine translation are the erroneous phrase pairs that can find their way into the phrase table. These phrases are the result of poor word-to-word alignments during the training of the translation model. These word alignment errors in turn cause errors during the phrase extraction phase, and these erroneous bilingual phrase pairs are then used during the decoding process and appear in the output of the machine translation system. Machine translation training data is never perfect, often bilingual sentence pairs are incorrectly aligned sentence-by-sentence, or these pairs are poor translations of each other due to human error. Even when sentence pairs in the corpus are good translations of each other the translations may not be literal enough to admit to the sort of phrase-by-phrase translation necessary to make good training data for a phrase-based statistical machine translation (SMT) system. This is because such SMT systems operate on the assumption that source can be transformed into target simply by translating phrase-by-phrase with re-ordering. In the real world, many perfectly correct translations are not of this form, and these sentences even though correct translations, make poor training data for training the translation models of a phrase-based SMT system. This paper presents a technique in which preliminary machine translation systems are built with the sole purpose of indicating those sentence pairs in the training corpus that the systems are able to generate using their models, the hypothesis being that these sentence pairs are likely to make good training data for an SMT system of the same type. These sentences are then used to bootstrap a second SMT system, and those sentences identified as good training data are given additional weight during the training process for building the translation models. Using this technique we were able to improve the performance of a Japanese-to-English SMT system by 1.2-1.5 BLEU points on unseen evaluation data.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
AnHitz, Development and Integration of Language, Speech and Visual Technologies for Basque Chinese NP Chunking: A Semi-Supervised Approach The UCSD/Calit2 GreenLight Project (Invited Paper) Inferring User Interests from Relevance Feedback with High Similarity Sequence Data-Driven Clustering Computer Simulation of HRTFs for Personalization of 3D Audio
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1