Nazerke Sultanova, Gulshat Kessikbayeva, Y. Amangeldi
{"title":"Kazakh Language Open Vocabulary Language Model with Deep Neural Networks","authors":"Nazerke Sultanova, Gulshat Kessikbayeva, Y. Amangeldi","doi":"10.1109/ICECCO48375.2019.9043182","DOIUrl":null,"url":null,"abstract":"Natural Language models are a crucial tool in computational linguistics. They are specially difficult to build in agglutinative languages, which require attention since the words are formed by attaching sequences of different morphemes, where each morpheme can change the meaning of the word. For the mentioned type of language fixed and limited vocabulary itself can pose restrictions. The character-based solution may help to overcome the problem. However, it triggers the disambiguation of a word according to the context. The present work aims to build a character-based language model for the Kazakh Language, with the use of Deep Neural Networks, namely a Long Short-Term Memory model. The Language Model in the present research is generative and aims to produce all possible correct words within the context given. A word can be treated as a morpheme generated by characters where any possible word type could be generated. In order to understand the language model correctly, it is necessary to use data which was initially written in Kazakh and not translated from other sources. Therefore, the model will be trained using books written in Kazakh.","PeriodicalId":166322,"journal":{"name":"2019 15th International Conference on Electronics, Computer and Computation (ICECCO)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 15th International Conference on Electronics, Computer and Computation (ICECCO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICECCO48375.2019.9043182","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Natural Language models are a crucial tool in computational linguistics. They are specially difficult to build in agglutinative languages, which require attention since the words are formed by attaching sequences of different morphemes, where each morpheme can change the meaning of the word. For the mentioned type of language fixed and limited vocabulary itself can pose restrictions. The character-based solution may help to overcome the problem. However, it triggers the disambiguation of a word according to the context. The present work aims to build a character-based language model for the Kazakh Language, with the use of Deep Neural Networks, namely a Long Short-Term Memory model. The Language Model in the present research is generative and aims to produce all possible correct words within the context given. A word can be treated as a morpheme generated by characters where any possible word type could be generated. In order to understand the language model correctly, it is necessary to use data which was initially written in Kazakh and not translated from other sources. Therefore, the model will be trained using books written in Kazakh.