面向特定领域的僧伽罗-泰米尔语统计机器翻译的双语列表集成

2018 Moratuwa Engineering Research Conference (MERCon) Pub Date : 2018-05-01 DOI:10.1109/MERCON.2018.8421901

Fathima Farhath, Surangika Ranathunga, Sanath Jayasena, G. Dias

{"title":"面向特定领域的僧伽罗-泰米尔语统计机器翻译的双语列表集成","authors":"Fathima Farhath, Surangika Ranathunga, Sanath Jayasena, G. Dias","doi":"10.1109/MERCON.2018.8421901","DOIUrl":null,"url":null,"abstract":"Availability of quality parallel data is a major requirement to build a reasonably well performing statistical machine translation (SMT) system. Thus, developing a decent SMT system for a low-resourced language pair like Sinhala and Tamil that does not have a large parallel corpus is rather challenging. Past research for other different language pairs has shown that different terminology / bilingual list integration methodologies can be used to improve the quality of SMT systems, for domain-specific SMT in particular. In this paper, we explore if this can be effective for Sinhala-Tamil machine translation for the domain of official government documents. We evaluate the impact of three types of bilingual lists, namely, a list of government organizations and official designations, a glossary related to government administration and operations, and a general bilingual dictionary, based on four different methodologies (three static and one dynamic). Out of four, one methodology gave notable improvements for all three types of list over the baseline.","PeriodicalId":6603,"journal":{"name":"2018 Moratuwa Engineering Research Conference (MERCon)","volume":"7 1","pages":"538-543"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Integration of Bilingual Lists for Domain-Specific Statistical Machine Translation for Sinhala-Tamil\",\"authors\":\"Fathima Farhath, Surangika Ranathunga, Sanath Jayasena, G. Dias\",\"doi\":\"10.1109/MERCON.2018.8421901\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Availability of quality parallel data is a major requirement to build a reasonably well performing statistical machine translation (SMT) system. Thus, developing a decent SMT system for a low-resourced language pair like Sinhala and Tamil that does not have a large parallel corpus is rather challenging. Past research for other different language pairs has shown that different terminology / bilingual list integration methodologies can be used to improve the quality of SMT systems, for domain-specific SMT in particular. In this paper, we explore if this can be effective for Sinhala-Tamil machine translation for the domain of official government documents. We evaluate the impact of three types of bilingual lists, namely, a list of government organizations and official designations, a glossary related to government administration and operations, and a general bilingual dictionary, based on four different methodologies (three static and one dynamic). Out of four, one methodology gave notable improvements for all three types of list over the baseline.\",\"PeriodicalId\":6603,\"journal\":{\"name\":\"2018 Moratuwa Engineering Research Conference (MERCon)\",\"volume\":\"7 1\",\"pages\":\"538-543\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 Moratuwa Engineering Research Conference (MERCon)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MERCON.2018.8421901\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Moratuwa Engineering Research Conference (MERCon)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MERCON.2018.8421901","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

高质量并行数据的可用性是构建一个性能相当良好的统计机器翻译(SMT)系统的主要要求。因此，为僧伽罗语和泰米尔语等资源匮乏的语言对开发一个像样的SMT系统是相当具有挑战性的，因为它们没有大量的并行语料库。过去对其他不同语言对的研究表明，可以使用不同的术语/双语列表集成方法来提高SMT系统的质量，特别是针对特定领域的SMT。在本文中，我们探讨这是否可以有效地用于官方政府文件领域的僧伽罗语-泰米尔语机器翻译。我们基于四种不同的方法(三种静态方法和一种动态方法)评估了三种类型的双语列表的影响，即政府组织和官方名称列表，与政府管理和运作相关的词汇表和通用双语词典。在四种方法中，有一种方法在基线上对所有三种类型的列表进行了显著改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Integration of Bilingual Lists for Domain-Specific Statistical Machine Translation for Sinhala-Tamil

Availability of quality parallel data is a major requirement to build a reasonably well performing statistical machine translation (SMT) system. Thus, developing a decent SMT system for a low-resourced language pair like Sinhala and Tamil that does not have a large parallel corpus is rather challenging. Past research for other different language pairs has shown that different terminology / bilingual list integration methodologies can be used to improve the quality of SMT systems, for domain-specific SMT in particular. In this paper, we explore if this can be effective for Sinhala-Tamil machine translation for the domain of official government documents. We evaluate the impact of three types of bilingual lists, namely, a list of government organizations and official designations, a glossary related to government administration and operations, and a general bilingual dictionary, based on four different methodologies (three static and one dynamic). Out of four, one methodology gave notable improvements for all three types of list over the baseline.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 Moratuwa Engineering Research Conference (MERCon)

自引率

0.00%

发文量