{"title":"阿塞拜疆语N-gram模型的发展","authors":"Aliya Bannayeva, Mustafa Aslanov","doi":"10.1109/AICT50176.2020.9368645","DOIUrl":null,"url":null,"abstract":"This research focuses on a text prediction model for the Azerbaijani language. Parsed and cleaned Azerbaijani Wikipedia is used as corpus for the language model. In total, there are more than a million distinct words and sentences, and over seven hundred million characters.For the language model itself, a statistical model with n-grams is implemented. N-grams are contiguous sequences of n strings or characters from a given sample of text or speech. The Markov Chain is used as the model to predict the next word.The Markov Chain focuses on the probabilities of the sequence of words in the n-grams, rather than the probabilities of the entire corpus. This simplifies the task at hand and yields in less computational overhead, while still maintaining sensible results. Logically, the higher the N in the n-grams, the more sensible the resulting prediction.Concretely, bigrams, trigrams, quadgrams and fivegrams are implemented. For the evaluation of the model, intrinsic type of evaluation is used, which computes the perplexity rate.","PeriodicalId":136491,"journal":{"name":"2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Development of the N-gram Model for Azerbaijani Language\",\"authors\":\"Aliya Bannayeva, Mustafa Aslanov\",\"doi\":\"10.1109/AICT50176.2020.9368645\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This research focuses on a text prediction model for the Azerbaijani language. Parsed and cleaned Azerbaijani Wikipedia is used as corpus for the language model. In total, there are more than a million distinct words and sentences, and over seven hundred million characters.For the language model itself, a statistical model with n-grams is implemented. N-grams are contiguous sequences of n strings or characters from a given sample of text or speech. The Markov Chain is used as the model to predict the next word.The Markov Chain focuses on the probabilities of the sequence of words in the n-grams, rather than the probabilities of the entire corpus. This simplifies the task at hand and yields in less computational overhead, while still maintaining sensible results. Logically, the higher the N in the n-grams, the more sensible the resulting prediction.Concretely, bigrams, trigrams, quadgrams and fivegrams are implemented. For the evaluation of the model, intrinsic type of evaluation is used, which computes the perplexity rate.\",\"PeriodicalId\":136491,\"journal\":{\"name\":\"2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AICT50176.2020.9368645\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICT50176.2020.9368645","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Development of the N-gram Model for Azerbaijani Language
This research focuses on a text prediction model for the Azerbaijani language. Parsed and cleaned Azerbaijani Wikipedia is used as corpus for the language model. In total, there are more than a million distinct words and sentences, and over seven hundred million characters.For the language model itself, a statistical model with n-grams is implemented. N-grams are contiguous sequences of n strings or characters from a given sample of text or speech. The Markov Chain is used as the model to predict the next word.The Markov Chain focuses on the probabilities of the sequence of words in the n-grams, rather than the probabilities of the entire corpus. This simplifies the task at hand and yields in less computational overhead, while still maintaining sensible results. Logically, the higher the N in the n-grams, the more sensible the resulting prediction.Concretely, bigrams, trigrams, quadgrams and fivegrams are implemented. For the evaluation of the model, intrinsic type of evaluation is used, which computes the perplexity rate.