Low-Resource NMT: A Case Study on the Written and Spoken Languages in Hong Kong

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval Pub Date : 2021-12-17 DOI:10.1145/3508230.3508242

Hei Yi Mak, Tan Lee

{"title":"Low-Resource NMT: A Case Study on the Written and Spoken Languages in Hong Kong","authors":"Hei Yi Mak, Tan Lee","doi":"10.1145/3508230.3508242","DOIUrl":null,"url":null,"abstract":"The majority of inhabitants in Hong Kong are able to read and write in standard Chinese but use Cantonese as the primary spoken language in daily life. Spoken Cantonese can be transcribed into Chinese characters, which constitute the so-called written Cantonese. Written Cantonese exhibits significant lexical and grammatical differences from standard written Chinese. The rise of written Cantonese is increasingly evident in the cyber world. The growing interaction between Mandarin speakers and Cantonese speakers is leading to a clear demand for automatic translation between Chinese and Cantonese. This paper describes a transformer-based neural machine translation (NMT) system for written-Chinese-to-written-Cantonese translation. Given that parallel text data of Chinese and Cantonese are extremely scarce, a major focus of this study is on the effort of preparing good amount of training data for NMT. In addition to collecting 28K parallel sentences from previous linguistic studies and scattered internet resources, we devise an effective approach to obtaining 72K parallel sentences by automatically extracting pairs of semantically similar sentences from parallel articles on Chinese Wikipedia and Cantonese Wikipedia. We show that leveraging highly similar sentence pairs mined from Wikipedia improves translation performance in all test sets. Our system outperforms Baidu Fanyi's Chinese-to-Cantonese translation on 6 out of 8 test sets in BLEU scores. Translation examples reveal that our system is able to capture important linguistic transformations between standard Chinese and spoken Cantonese.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3508230.3508242","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The majority of inhabitants in Hong Kong are able to read and write in standard Chinese but use Cantonese as the primary spoken language in daily life. Spoken Cantonese can be transcribed into Chinese characters, which constitute the so-called written Cantonese. Written Cantonese exhibits significant lexical and grammatical differences from standard written Chinese. The rise of written Cantonese is increasingly evident in the cyber world. The growing interaction between Mandarin speakers and Cantonese speakers is leading to a clear demand for automatic translation between Chinese and Cantonese. This paper describes a transformer-based neural machine translation (NMT) system for written-Chinese-to-written-Cantonese translation. Given that parallel text data of Chinese and Cantonese are extremely scarce, a major focus of this study is on the effort of preparing good amount of training data for NMT. In addition to collecting 28K parallel sentences from previous linguistic studies and scattered internet resources, we devise an effective approach to obtaining 72K parallel sentences by automatically extracting pairs of semantically similar sentences from parallel articles on Chinese Wikipedia and Cantonese Wikipedia. We show that leveraging highly similar sentence pairs mined from Wikipedia improves translation performance in all test sets. Our system outperforms Baidu Fanyi's Chinese-to-Cantonese translation on 6 out of 8 test sets in BLEU scores. Translation examples reveal that our system is able to capture important linguistic transformations between standard Chinese and spoken Cantonese.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

低资源的新语言机器学习:以香港书面语和口语为例

香港大部分居民都能读写普通话，但在日常生活中主要使用粤语。粤语口语可以改写成汉字，这就构成了所谓的书面粤语。书面广东话在词汇和语法上与标准书面汉语有显著的差异。书面粤语在网络世界的兴起越来越明显。说普通话的人和说粤语的人之间的互动越来越多，这导致了对汉语和粤语之间自动翻译的明确需求。本文介绍了一种基于变压器的神经网络机器翻译系统，用于汉语对粤语的书面语翻译。鉴于汉语和粤语的平行文本数据极为稀缺，本研究的重点是为NMT准备大量的训练数据。除了从以往的语言学研究和分散的互联网资源中收集28K平行句子外，我们还设计了一种有效的方法，通过自动从中文维基百科和广东话维基百科的平行文章中提取语义相似的句子对来获得72K平行句子。我们表明，利用从维基百科中挖掘的高度相似的句子对可以提高所有测试集的翻译性能。我们的系统在8个测试集中的6个测试集的BLEU分数上超过了百度翻一的汉粤语翻译。翻译实例表明，我们的系统能够捕捉到标准汉语和粤语口语之间的重要语言转换。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval

自引率

0.00%

发文量