产品评论翻译:平行语料库创建和对用户生成的噪声文本的鲁棒性

Kamal Kumar Gupta, Soumya Chennabasavaraj, Nikesh Garera, Asif Ekbal
{"title":"产品评论翻译:平行语料库创建和对用户生成的噪声文本的鲁棒性","authors":"Kamal Kumar Gupta, Soumya Chennabasavaraj, Nikesh Garera, Asif Ekbal","doi":"10.18653/v1/2021.ecnlp-1.21","DOIUrl":null,"url":null,"abstract":"Reviews written by the users for a particular product or service play an influencing role for the customers to make an informative decision. Although online e-commerce portals have immensely impacted our lives, available contents predominantly are in English language- often limiting its widespread usage. There is an exponential growth in the number of e-commerce users who are not proficient in English. Hence, there is a necessity to make these services available in non-English languages, especially in a multilingual country like India. This can be achieved by an in-domain robust machine translation (MT) system. However, the reviews written by the users pose unique challenges to MT, such as misspelled words, ungrammatical constructions, presence of colloquial terms, lack of resources such as in-domain parallel corpus etc. We address the above challenges by presenting an English–Hindi review domain parallel corpus. We train an English–to–Hindi neural machine translation (NMT) system to translate the product reviews available on e-commerce websites. By training the Transformer based NMT model over the generated data, we achieve a score of 33.26 BLEU points for English–to–Hindi translation. In order to make our NMT model robust enough to handle the noisy tokens in the reviews, we integrate a character based language model to generate word vectors and map the noisy tokens with their correct forms. Experiments on four language pairs, viz. English-Hindi, English-German, English-French, and English-Czech show the BLUE scores of 35.09, 28.91, 34.68 and 14.52 which are the improvements of 1.61, 1.05, 1.63 and 1.94, respectively, over the baseline.","PeriodicalId":210217,"journal":{"name":"Proceedings of The 4th Workshop on e-Commerce and NLP","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Product Review Translation: Parallel Corpus Creation and Robustness towards User-generated Noisy Text\",\"authors\":\"Kamal Kumar Gupta, Soumya Chennabasavaraj, Nikesh Garera, Asif Ekbal\",\"doi\":\"10.18653/v1/2021.ecnlp-1.21\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Reviews written by the users for a particular product or service play an influencing role for the customers to make an informative decision. Although online e-commerce portals have immensely impacted our lives, available contents predominantly are in English language- often limiting its widespread usage. There is an exponential growth in the number of e-commerce users who are not proficient in English. Hence, there is a necessity to make these services available in non-English languages, especially in a multilingual country like India. This can be achieved by an in-domain robust machine translation (MT) system. However, the reviews written by the users pose unique challenges to MT, such as misspelled words, ungrammatical constructions, presence of colloquial terms, lack of resources such as in-domain parallel corpus etc. We address the above challenges by presenting an English–Hindi review domain parallel corpus. We train an English–to–Hindi neural machine translation (NMT) system to translate the product reviews available on e-commerce websites. By training the Transformer based NMT model over the generated data, we achieve a score of 33.26 BLEU points for English–to–Hindi translation. In order to make our NMT model robust enough to handle the noisy tokens in the reviews, we integrate a character based language model to generate word vectors and map the noisy tokens with their correct forms. Experiments on four language pairs, viz. English-Hindi, English-German, English-French, and English-Czech show the BLUE scores of 35.09, 28.91, 34.68 and 14.52 which are the improvements of 1.61, 1.05, 1.63 and 1.94, respectively, over the baseline.\",\"PeriodicalId\":210217,\"journal\":{\"name\":\"Proceedings of The 4th Workshop on e-Commerce and NLP\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of The 4th Workshop on e-Commerce and NLP\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2021.ecnlp-1.21\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of The 4th Workshop on e-Commerce and NLP","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2021.ecnlp-1.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

用户对特定产品或服务的评论对客户做出信息决策起着影响作用。尽管在线电子商务门户网站极大地影响了我们的生活,但可用的内容主要是英语,这往往限制了英语的广泛使用。不精通英语的电子商务用户数量呈指数级增长。因此,有必要以非英语语言提供这些服务,特别是在印度这样的多语言国家。这可以通过域内鲁棒机器翻译(MT)系统来实现。然而,用户所写的评论给机器翻译带来了独特的挑战,如拼写错误,不符合语法结构,口语化术语的存在,缺乏资源,如领域内平行语料库等。我们通过提出一个英语-印地语评论领域平行语料库来解决上述挑战。我们训练了一个英语到印地语的神经机器翻译(NMT)系统来翻译电子商务网站上的产品评论。通过在生成的数据上训练基于Transformer的NMT模型,我们实现了英语到印地语翻译的33.26 BLEU分。为了使我们的NMT模型具有足够的鲁棒性来处理评论中的噪声标记,我们集成了一个基于字符的语言模型来生成词向量,并将噪声标记映射为正确的形式。在英语-印地语、英语-德语、英语-法语和英语-捷克语四个语言对的实验中,BLUE得分分别为35.09、28.91、34.68和14.52,分别比基线提高了1.61、1.05、1.63和1.94。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Product Review Translation: Parallel Corpus Creation and Robustness towards User-generated Noisy Text
Reviews written by the users for a particular product or service play an influencing role for the customers to make an informative decision. Although online e-commerce portals have immensely impacted our lives, available contents predominantly are in English language- often limiting its widespread usage. There is an exponential growth in the number of e-commerce users who are not proficient in English. Hence, there is a necessity to make these services available in non-English languages, especially in a multilingual country like India. This can be achieved by an in-domain robust machine translation (MT) system. However, the reviews written by the users pose unique challenges to MT, such as misspelled words, ungrammatical constructions, presence of colloquial terms, lack of resources such as in-domain parallel corpus etc. We address the above challenges by presenting an English–Hindi review domain parallel corpus. We train an English–to–Hindi neural machine translation (NMT) system to translate the product reviews available on e-commerce websites. By training the Transformer based NMT model over the generated data, we achieve a score of 33.26 BLEU points for English–to–Hindi translation. In order to make our NMT model robust enough to handle the noisy tokens in the reviews, we integrate a character based language model to generate word vectors and map the noisy tokens with their correct forms. Experiments on four language pairs, viz. English-Hindi, English-German, English-French, and English-Czech show the BLUE scores of 35.09, 28.91, 34.68 and 14.52 which are the improvements of 1.61, 1.05, 1.63 and 1.94, respectively, over the baseline.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Product Review Translation: Parallel Corpus Creation and Robustness towards User-generated Noisy Text Exploring Inspiration Sets in a Data Programming Pipeline for Product Moderation Combining semantic search and twin product classification for recognition of purchasable items in voice shopping Unsupervised Class-Specific Abstractive Summarization of Customer Reviews SupportNet: Neural Networks for Summary Generation and Key Segment Extraction from Technical Support Tickets
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1