Next word prediction for Urdu language using deep learning models

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Speech and Language Pub Date : 2024-03-02 DOI:10.1016/j.csl.2024.101635

Ramish Shahid, Aamir Wali, Maryam Bashir

{"title":"Next word prediction for Urdu language using deep learning models","authors":"Ramish Shahid, Aamir Wali, Maryam Bashir","doi":"10.1016/j.csl.2024.101635","DOIUrl":null,"url":null,"abstract":"<div><p>Deep learning models are being used for natural language processing. Despite their success, these models have been employed for only a few languages. Pretrained models also exist but they are mostly available for the English language. Low resource languages like Urdu are not able to benefit from these pre-trained deep learning models and their effectiveness in Urdu language processing remains a question. This paper investigates the usefulness of deep learning models for the next word prediction and suggestion model for Urdu. For this purpose, this study considers and proposes two word prediction models for Urdu. Firstly, we propose to use LSTM for neural language modeling of Urdu. LSTMs are a popular approach for language modeling due to their ability to process sequential data. Secondly, we employ BERT which was specifically designed for natural language modeling. We train BERT from scratch using an Urdu corpus consisting of 1.1 million sentences thus paving the way for further studies in the Urdu language. We achieved an accuracy of 52.4% with LSTM and 73.7% with BERT. Our proposed BERT model outperformed two other pre-trained BERT models developed for Urdu. Since this is a multi-class problem and the number of classes is equal to the vocabulary size, this accuracy is still promising. Based on the present performance, BERT seems to be effective for the Urdu language, and this paper lays the groundwork for future studies.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"87 ","pages":"Article 101635"},"PeriodicalIF":3.1000,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824000184","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Deep learning models are being used for natural language processing. Despite their success, these models have been employed for only a few languages. Pretrained models also exist but they are mostly available for the English language. Low resource languages like Urdu are not able to benefit from these pre-trained deep learning models and their effectiveness in Urdu language processing remains a question. This paper investigates the usefulness of deep learning models for the next word prediction and suggestion model for Urdu. For this purpose, this study considers and proposes two word prediction models for Urdu. Firstly, we propose to use LSTM for neural language modeling of Urdu. LSTMs are a popular approach for language modeling due to their ability to process sequential data. Secondly, we employ BERT which was specifically designed for natural language modeling. We train BERT from scratch using an Urdu corpus consisting of 1.1 million sentences thus paving the way for further studies in the Urdu language. We achieved an accuracy of 52.4% with LSTM and 73.7% with BERT. Our proposed BERT model outperformed two other pre-trained BERT models developed for Urdu. Since this is a multi-class problem and the number of classes is equal to the vocabulary size, this accuracy is still promising. Based on the present performance, BERT seems to be effective for the Urdu language, and this paper lays the groundwork for future studies.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用深度学习模型预测乌尔都语的下一个单词

深度学习模型正被用于自然语言处理。尽管取得了成功，但这些模型仅用于少数语言。预训练模型也存在，但它们大多适用于英语。像乌尔都语这样的低资源语言无法从这些预训练的深度学习模型中获益，它们在乌尔都语处理中的有效性仍然是个问题。本文研究了深度学习模型在乌尔都语下一个词预测和建议模型中的实用性。为此，本研究考虑并提出了两种乌尔都语单词预测模型。首先，我们建议使用 LSTM 对乌尔都语进行神经语言建模。LSTM 具有处理连续数据的能力，因此是一种流行的语言建模方法。其次，我们采用了专为自然语言建模而设计的 BERT。我们使用由 110 万个句子组成的乌尔都语语料库从头开始训练 BERT，从而为乌尔都语的进一步研究铺平了道路。LSTM 的准确率为 52.4%，BERT 的准确率为 73.7%。我们提出的 BERT 模型优于为乌尔都语开发的其他两个预训练 BERT 模型。由于这是一个多类问题，而类的数量与词汇量相等，因此这一准确率仍然很有希望。根据目前的表现，BERT 似乎对乌尔都语很有效，本文为今后的研究奠定了基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.