Tibetan Syllable Prediction with Pre-trained Cross-lingual Language Model

2022 IEEE 5th International Conference on Computer and Communication Engineering Technology (CCET) Pub Date : 2022-08-19 DOI:10.1109/CCET55412.2022.9906389

Zibo Yi, Qingbo Wu, Jie Yu, Yongtao Tang, Xiaodong Liu, Long Peng, Jun Ma

{"title":"Tibetan Syllable Prediction with Pre-trained Cross-lingual Language Model","authors":"Zibo Yi, Qingbo Wu, Jie Yu, Yongtao Tang, Xiaodong Liu, Long Peng, Jun Ma","doi":"10.1109/CCET55412.2022.9906389","DOIUrl":null,"url":null,"abstract":"In recent years, with the development of Tibetan language information technologies, the Internet Tibetan data is increasing year by year. Due to the need for the Tibetan input method and Tibetan error correction, Tibetan language prediction has become an urgent problem to be solved. At present, the challenges of Tibetan prediction are that the Tibetan syllable composition is complex, the vocabulary of Tibetan words which is composed of syllables is extremely large, and the Tibetan word separation technology is not mature. To solve the above problems, this paper proposes a Tibetan syllable prediction method based on a pre-trained cross-lingual language model using Tibetan syllables instead of Tibetan words as the token for prediction. The method uses the cross-lingual language model XLM-R and fine-tunes it using Tibetan news texts to make it more suitable for predicting Tibetan in the news domain. We conduct experiments on Tibetan syllable prediction for texts crawled on the Tibetan news website. The experiments show that the precision of our model for Tibetan text prediction is higher than that of the current n-gram methods.","PeriodicalId":329327,"journal":{"name":"2022 IEEE 5th International Conference on Computer and Communication Engineering Technology (CCET)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 5th International Conference on Computer and Communication Engineering Technology (CCET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCET55412.2022.9906389","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, with the development of Tibetan language information technologies, the Internet Tibetan data is increasing year by year. Due to the need for the Tibetan input method and Tibetan error correction, Tibetan language prediction has become an urgent problem to be solved. At present, the challenges of Tibetan prediction are that the Tibetan syllable composition is complex, the vocabulary of Tibetan words which is composed of syllables is extremely large, and the Tibetan word separation technology is not mature. To solve the above problems, this paper proposes a Tibetan syllable prediction method based on a pre-trained cross-lingual language model using Tibetan syllables instead of Tibetan words as the token for prediction. The method uses the cross-lingual language model XLM-R and fine-tunes it using Tibetan news texts to make it more suitable for predicting Tibetan in the news domain. We conduct experiments on Tibetan syllable prediction for texts crawled on the Tibetan news website. The experiments show that the precision of our model for Tibetan text prediction is higher than that of the current n-gram methods.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于预训练跨语言模型的藏文音节预测

近年来，随着藏文信息技术的发展，互联网藏文数据逐年增加。由于藏文输入法和藏文纠错的需要，藏文预测已成为一个亟待解决的问题。目前，藏文预测面临的挑战是藏文音节组成复杂，由音节组成的藏文词汇量极大，藏文分词技术不成熟。针对上述问题，本文提出了一种基于预训练的跨语言模型的藏语音节预测方法，使用藏语音节代替藏语单词作为预测标记。该方法使用跨语种语言模型XLM-R，并使用藏文新闻文本对其进行微调，使其更适合新闻领域的藏文预测。我们对从藏语新闻网站抓取的文本进行了藏语音节预测实验。实验表明，该模型对藏文文本的预测精度高于现有的n-gram方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IEEE 5th International Conference on Computer and Communication Engineering Technology (CCET)

自引率

0.00%

发文量