Development of the N-gram Model for Azerbaijani Language

Aliya Bannayeva, Mustafa Aslanov
{"title":"Development of the N-gram Model for Azerbaijani Language","authors":"Aliya Bannayeva, Mustafa Aslanov","doi":"10.1109/AICT50176.2020.9368645","DOIUrl":null,"url":null,"abstract":"This research focuses on a text prediction model for the Azerbaijani language. Parsed and cleaned Azerbaijani Wikipedia is used as corpus for the language model. In total, there are more than a million distinct words and sentences, and over seven hundred million characters.For the language model itself, a statistical model with n-grams is implemented. N-grams are contiguous sequences of n strings or characters from a given sample of text or speech. The Markov Chain is used as the model to predict the next word.The Markov Chain focuses on the probabilities of the sequence of words in the n-grams, rather than the probabilities of the entire corpus. This simplifies the task at hand and yields in less computational overhead, while still maintaining sensible results. Logically, the higher the N in the n-grams, the more sensible the resulting prediction.Concretely, bigrams, trigrams, quadgrams and fivegrams are implemented. For the evaluation of the model, intrinsic type of evaluation is used, which computes the perplexity rate.","PeriodicalId":136491,"journal":{"name":"2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICT50176.2020.9368645","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

This research focuses on a text prediction model for the Azerbaijani language. Parsed and cleaned Azerbaijani Wikipedia is used as corpus for the language model. In total, there are more than a million distinct words and sentences, and over seven hundred million characters.For the language model itself, a statistical model with n-grams is implemented. N-grams are contiguous sequences of n strings or characters from a given sample of text or speech. The Markov Chain is used as the model to predict the next word.The Markov Chain focuses on the probabilities of the sequence of words in the n-grams, rather than the probabilities of the entire corpus. This simplifies the task at hand and yields in less computational overhead, while still maintaining sensible results. Logically, the higher the N in the n-grams, the more sensible the resulting prediction.Concretely, bigrams, trigrams, quadgrams and fivegrams are implemented. For the evaluation of the model, intrinsic type of evaluation is used, which computes the perplexity rate.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
阿塞拜疆语N-gram模型的发展
本研究的重点是阿塞拜疆语的文本预测模型。解析和清理的阿塞拜疆语维基百科被用作语言模型的语料库。总共有100多万个不同的单词和句子,以及7亿多个字符。对于语言模型本身,实现了一个n-gram的统计模型。n -gram是来自给定文本或语音样本的n个字符串或字符的连续序列。使用马尔可夫链作为预测下一个单词的模型。马尔可夫链关注的是n-gram中单词序列的概率,而不是整个语料库的概率。这简化了手头的任务,产生了更少的计算开销,同时仍然保持了合理的结果。从逻辑上讲,N -gram中的N越高,得到的预测越合理。具体来说,是双、三、四、五。对于模型的评价,采用内禀式评价,计算困惑率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Blockchain-based open infrastructure for URL filtering in an Internet browser 2D Amplitude-Only Microwave Tomography Algorithm for Breast-Cancer Detection Information Extraction from Arabic Law Documents An Experimental Design Approach to Analyse the Performance of Island-Based Parallel Artificial Bee Colony Algorithm Automation Check Vulnerabilities Of Access Points Based On 802.11 Protocol
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1