基于预训练跨语言模型的藏文音节预测

Zibo Yi, Qingbo Wu, Jie Yu, Yongtao Tang, Xiaodong Liu, Long Peng, Jun Ma
{"title":"基于预训练跨语言模型的藏文音节预测","authors":"Zibo Yi, Qingbo Wu, Jie Yu, Yongtao Tang, Xiaodong Liu, Long Peng, Jun Ma","doi":"10.1109/CCET55412.2022.9906389","DOIUrl":null,"url":null,"abstract":"In recent years, with the development of Tibetan language information technologies, the Internet Tibetan data is increasing year by year. Due to the need for the Tibetan input method and Tibetan error correction, Tibetan language prediction has become an urgent problem to be solved. At present, the challenges of Tibetan prediction are that the Tibetan syllable composition is complex, the vocabulary of Tibetan words which is composed of syllables is extremely large, and the Tibetan word separation technology is not mature. To solve the above problems, this paper proposes a Tibetan syllable prediction method based on a pre-trained cross-lingual language model using Tibetan syllables instead of Tibetan words as the token for prediction. The method uses the cross-lingual language model XLM-R and fine-tunes it using Tibetan news texts to make it more suitable for predicting Tibetan in the news domain. We conduct experiments on Tibetan syllable prediction for texts crawled on the Tibetan news website. The experiments show that the precision of our model for Tibetan text prediction is higher than that of the current n-gram methods.","PeriodicalId":329327,"journal":{"name":"2022 IEEE 5th International Conference on Computer and Communication Engineering Technology (CCET)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Tibetan Syllable Prediction with Pre-trained Cross-lingual Language Model\",\"authors\":\"Zibo Yi, Qingbo Wu, Jie Yu, Yongtao Tang, Xiaodong Liu, Long Peng, Jun Ma\",\"doi\":\"10.1109/CCET55412.2022.9906389\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, with the development of Tibetan language information technologies, the Internet Tibetan data is increasing year by year. Due to the need for the Tibetan input method and Tibetan error correction, Tibetan language prediction has become an urgent problem to be solved. At present, the challenges of Tibetan prediction are that the Tibetan syllable composition is complex, the vocabulary of Tibetan words which is composed of syllables is extremely large, and the Tibetan word separation technology is not mature. To solve the above problems, this paper proposes a Tibetan syllable prediction method based on a pre-trained cross-lingual language model using Tibetan syllables instead of Tibetan words as the token for prediction. The method uses the cross-lingual language model XLM-R and fine-tunes it using Tibetan news texts to make it more suitable for predicting Tibetan in the news domain. We conduct experiments on Tibetan syllable prediction for texts crawled on the Tibetan news website. The experiments show that the precision of our model for Tibetan text prediction is higher than that of the current n-gram methods.\",\"PeriodicalId\":329327,\"journal\":{\"name\":\"2022 IEEE 5th International Conference on Computer and Communication Engineering Technology (CCET)\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE 5th International Conference on Computer and Communication Engineering Technology (CCET)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCET55412.2022.9906389\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 5th International Conference on Computer and Communication Engineering Technology (CCET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCET55412.2022.9906389","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

近年来,随着藏文信息技术的发展,互联网藏文数据逐年增加。由于藏文输入法和藏文纠错的需要,藏文预测已成为一个亟待解决的问题。目前,藏文预测面临的挑战是藏文音节组成复杂,由音节组成的藏文词汇量极大,藏文分词技术不成熟。针对上述问题,本文提出了一种基于预训练的跨语言模型的藏语音节预测方法,使用藏语音节代替藏语单词作为预测标记。该方法使用跨语种语言模型XLM-R,并使用藏文新闻文本对其进行微调,使其更适合新闻领域的藏文预测。我们对从藏语新闻网站抓取的文本进行了藏语音节预测实验。实验表明,该模型对藏文文本的预测精度高于现有的n-gram方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Tibetan Syllable Prediction with Pre-trained Cross-lingual Language Model
In recent years, with the development of Tibetan language information technologies, the Internet Tibetan data is increasing year by year. Due to the need for the Tibetan input method and Tibetan error correction, Tibetan language prediction has become an urgent problem to be solved. At present, the challenges of Tibetan prediction are that the Tibetan syllable composition is complex, the vocabulary of Tibetan words which is composed of syllables is extremely large, and the Tibetan word separation technology is not mature. To solve the above problems, this paper proposes a Tibetan syllable prediction method based on a pre-trained cross-lingual language model using Tibetan syllables instead of Tibetan words as the token for prediction. The method uses the cross-lingual language model XLM-R and fine-tunes it using Tibetan news texts to make it more suitable for predicting Tibetan in the news domain. We conduct experiments on Tibetan syllable prediction for texts crawled on the Tibetan news website. The experiments show that the precision of our model for Tibetan text prediction is higher than that of the current n-gram methods.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
5G Enabling Streaming Media Architecture with Edge Intelligence Gateway in Smart Grids VPN Traffic Identification Based on Tunneling Protocol Characteristics An Improved Clock Cycle Measurement Method for High-Speed Serial Signal with Duty-Cycle-Distortion Jitter Research on Banana Leaf Disease Detection Based on the Image Processing Technology Vision Transformer Based on Knowledge Distillation in TCM Image Classification
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1