Fathima Farhath, Surangika Ranathunga, Sanath Jayasena, G. Dias
{"title":"面向特定领域的僧伽罗-泰米尔语统计机器翻译的双语列表集成","authors":"Fathima Farhath, Surangika Ranathunga, Sanath Jayasena, G. Dias","doi":"10.1109/MERCON.2018.8421901","DOIUrl":null,"url":null,"abstract":"Availability of quality parallel data is a major requirement to build a reasonably well performing statistical machine translation (SMT) system. Thus, developing a decent SMT system for a low-resourced language pair like Sinhala and Tamil that does not have a large parallel corpus is rather challenging. Past research for other different language pairs has shown that different terminology / bilingual list integration methodologies can be used to improve the quality of SMT systems, for domain-specific SMT in particular. In this paper, we explore if this can be effective for Sinhala-Tamil machine translation for the domain of official government documents. We evaluate the impact of three types of bilingual lists, namely, a list of government organizations and official designations, a glossary related to government administration and operations, and a general bilingual dictionary, based on four different methodologies (three static and one dynamic). Out of four, one methodology gave notable improvements for all three types of list over the baseline.","PeriodicalId":6603,"journal":{"name":"2018 Moratuwa Engineering Research Conference (MERCon)","volume":"7 1","pages":"538-543"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Integration of Bilingual Lists for Domain-Specific Statistical Machine Translation for Sinhala-Tamil\",\"authors\":\"Fathima Farhath, Surangika Ranathunga, Sanath Jayasena, G. Dias\",\"doi\":\"10.1109/MERCON.2018.8421901\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Availability of quality parallel data is a major requirement to build a reasonably well performing statistical machine translation (SMT) system. Thus, developing a decent SMT system for a low-resourced language pair like Sinhala and Tamil that does not have a large parallel corpus is rather challenging. Past research for other different language pairs has shown that different terminology / bilingual list integration methodologies can be used to improve the quality of SMT systems, for domain-specific SMT in particular. In this paper, we explore if this can be effective for Sinhala-Tamil machine translation for the domain of official government documents. We evaluate the impact of three types of bilingual lists, namely, a list of government organizations and official designations, a glossary related to government administration and operations, and a general bilingual dictionary, based on four different methodologies (three static and one dynamic). Out of four, one methodology gave notable improvements for all three types of list over the baseline.\",\"PeriodicalId\":6603,\"journal\":{\"name\":\"2018 Moratuwa Engineering Research Conference (MERCon)\",\"volume\":\"7 1\",\"pages\":\"538-543\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 Moratuwa Engineering Research Conference (MERCon)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MERCON.2018.8421901\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Moratuwa Engineering Research Conference (MERCon)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MERCON.2018.8421901","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Integration of Bilingual Lists for Domain-Specific Statistical Machine Translation for Sinhala-Tamil
Availability of quality parallel data is a major requirement to build a reasonably well performing statistical machine translation (SMT) system. Thus, developing a decent SMT system for a low-resourced language pair like Sinhala and Tamil that does not have a large parallel corpus is rather challenging. Past research for other different language pairs has shown that different terminology / bilingual list integration methodologies can be used to improve the quality of SMT systems, for domain-specific SMT in particular. In this paper, we explore if this can be effective for Sinhala-Tamil machine translation for the domain of official government documents. We evaluate the impact of three types of bilingual lists, namely, a list of government organizations and official designations, a glossary related to government administration and operations, and a general bilingual dictionary, based on four different methodologies (three static and one dynamic). Out of four, one methodology gave notable improvements for all three types of list over the baseline.