{"title":"从维基百科数据资源中挖掘日越多级平行文本语料库","authors":"T. Do","doi":"10.1109/RIVF51545.2021.9642108","DOIUrl":null,"url":null,"abstract":"This paper presents the task of mining a Japanese - Vietnamese parallel text corpus from comparable data resources in application of machine translation. Data resource for this language pair is few and rare so the parallel text should be extracted at multi levels, sentence level and fragment level, to get as much data as possible. Moreover, the proposed method considers word order independently so it can be applied to different language families. The result applied on Japanese- Vietnamese Wikipedia resource shows that the proposed method increases significantly the number of extracted parallel data. The extracted multi-level parallel text contributes to the quality of machine translation as well. More than 144,000 pairs of parallel sentences and 148,000 pairs of parallel fragments had been mined and opened to the research community.","PeriodicalId":6860,"journal":{"name":"2021 RIVF International Conference on Computing and Communication Technologies (RIVF)","volume":"47 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Mining Japanese-Vietnamese multi-level parallel text corpus from Wikipedia data resource\",\"authors\":\"T. Do\",\"doi\":\"10.1109/RIVF51545.2021.9642108\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents the task of mining a Japanese - Vietnamese parallel text corpus from comparable data resources in application of machine translation. Data resource for this language pair is few and rare so the parallel text should be extracted at multi levels, sentence level and fragment level, to get as much data as possible. Moreover, the proposed method considers word order independently so it can be applied to different language families. The result applied on Japanese- Vietnamese Wikipedia resource shows that the proposed method increases significantly the number of extracted parallel data. The extracted multi-level parallel text contributes to the quality of machine translation as well. More than 144,000 pairs of parallel sentences and 148,000 pairs of parallel fragments had been mined and opened to the research community.\",\"PeriodicalId\":6860,\"journal\":{\"name\":\"2021 RIVF International Conference on Computing and Communication Technologies (RIVF)\",\"volume\":\"47 1\",\"pages\":\"1-6\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 RIVF International Conference on Computing and Communication Technologies (RIVF)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/RIVF51545.2021.9642108\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 RIVF International Conference on Computing and Communication Technologies (RIVF)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RIVF51545.2021.9642108","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Mining Japanese-Vietnamese multi-level parallel text corpus from Wikipedia data resource
This paper presents the task of mining a Japanese - Vietnamese parallel text corpus from comparable data resources in application of machine translation. Data resource for this language pair is few and rare so the parallel text should be extracted at multi levels, sentence level and fragment level, to get as much data as possible. Moreover, the proposed method considers word order independently so it can be applied to different language families. The result applied on Japanese- Vietnamese Wikipedia resource shows that the proposed method increases significantly the number of extracted parallel data. The extracted multi-level parallel text contributes to the quality of machine translation as well. More than 144,000 pairs of parallel sentences and 148,000 pairs of parallel fragments had been mined and opened to the research community.