{"title":"Supervised Bilingual Word Embeddings for Low-Resource Language Pairs: Myanmar and Thai","authors":"","doi":"10.1109/iSAI-NLP54397.2021.9678157","DOIUrl":null,"url":null,"abstract":"Bilingual word embeddings (BWEs) represent the lexicons of two different languages in a shared embedding space, which are useful for cross-lingual natural language processing (NLP) tasks. In particular, bilingual word embeddings are extremely useful for machine translation of low-resource languages due to the rare availability of parallel corpus for that languages. Most of the researchers have already learned bilingual word embeddings for high-resource language pairs. To the best of our knowledge, there are no studies on bilingual word embeddings for low resource language pairs, Myanmar-Thai and Myanmar-English. In this paper, we present and evaluate the bilingual word embeddings for Myanmar-Thai, Myanmar-English, Thai-English, and English-Thai language pairs. To train bilingual word embeddings for each language pair, firstly, we used monolingual corpora for constructing monolingual word embeddings. A bilingual dictionary was also utilized to alleviate the problem of learning bilingual mappings as a supervised machine learning task, where a vector space is first learned independently on a monolingual corpus. Then, a linear alignment strategy is used to map the monolingual embeddings to a common bilingual vector space. Either word2vec or fastText model was used to construct monolingual word embeddings. We used bilingual dictionary induction as the intrinsic testbed for evaluating the quality of cross-lingual mappings from our constructed bilingual word embeddings. For all low-resource language pairs, monolingual word2vec embedding models with the CSLS metric achieved the best coverage and accuracy.","PeriodicalId":339826,"journal":{"name":"2021 16th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 16th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iSAI-NLP54397.2021.9678157","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Bilingual word embeddings (BWEs) represent the lexicons of two different languages in a shared embedding space, which are useful for cross-lingual natural language processing (NLP) tasks. In particular, bilingual word embeddings are extremely useful for machine translation of low-resource languages due to the rare availability of parallel corpus for that languages. Most of the researchers have already learned bilingual word embeddings for high-resource language pairs. To the best of our knowledge, there are no studies on bilingual word embeddings for low resource language pairs, Myanmar-Thai and Myanmar-English. In this paper, we present and evaluate the bilingual word embeddings for Myanmar-Thai, Myanmar-English, Thai-English, and English-Thai language pairs. To train bilingual word embeddings for each language pair, firstly, we used monolingual corpora for constructing monolingual word embeddings. A bilingual dictionary was also utilized to alleviate the problem of learning bilingual mappings as a supervised machine learning task, where a vector space is first learned independently on a monolingual corpus. Then, a linear alignment strategy is used to map the monolingual embeddings to a common bilingual vector space. Either word2vec or fastText model was used to construct monolingual word embeddings. We used bilingual dictionary induction as the intrinsic testbed for evaluating the quality of cross-lingual mappings from our constructed bilingual word embeddings. For all low-resource language pairs, monolingual word2vec embedding models with the CSLS metric achieved the best coverage and accuracy.