基于数据挖掘的孟加拉语语料库语言模型研究

2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA) Pub Date : 2021-09-02 DOI:10.1109/ICIRCA51532.2021.9544818

M. Kowsher, Md. Jashim Uddin, A. Tahabilder, Md Ruhul Amin, Md. Fahim Shahriar, Md. Shohanur Islam Sobuj

{"title":"基于数据挖掘的孟加拉语语料库语言模型研究","authors":"M. Kowsher, Md. Jashim Uddin, A. Tahabilder, Md Ruhul Amin, Md. Fahim Shahriar, Md. Shohanur Islam Sobuj","doi":"10.1109/ICIRCA51532.2021.9544818","DOIUrl":null,"url":null,"abstract":"Natural language processing (NLP) is an area of machine learning that has garnered a lot of attention in recent days due to the revolution in artificial intelligence, robotics, and smart devices. NLP focuses on training machines to understand and analyze various languages, extract meaningful information from those, translate from one language to another, correct grammar, predict the next word, complete a sentence, or even generate a completely new sentence from an existing corpus. A major challenge in NLP lies in training the model for obtaining high prediction accuracy since training needs a vast dataset. For widely used languages like English, there are many datasets available that can be used for NLP tasks like training a model and summarization but for languages like Bengali, which is only spoken primarily in South Asia, there is a dearth of big datasets which can be used to build a robust machine learning model. Therefore, NLP researchers who mainly work with the Bengali language will find an extensive, robust dataset incredibly useful for their NLP tasks involving the Bengali language. With this pressing issue in mind, this research work has prepared a dataset whose content is curated from social media, blogs, newspapers, wiki pages, and other similar resources. The amount of samples in this dataset is 19132010, and the length varies from 3 to 512 words. This dataset can easily be used to build any unsupervised machine learning model with an aim to performing necessary NLP tasks involving the Bengali language. Also, this research work is releasing two preprocessed version of this dataset that is especially suited for training both core machine learning-based and statistical-based model. As very few attempts have been made in this domain, keeping Bengali language researchers in mind, it is believed that the proposed dataset will significantly contribute to the Bengali machine learning and NLP community.","PeriodicalId":245244,"journal":{"name":"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"BanglaLM: Data Mining based Bangla Corpus for Language Model Research\",\"authors\":\"M. Kowsher, Md. Jashim Uddin, A. Tahabilder, Md Ruhul Amin, Md. Fahim Shahriar, Md. Shohanur Islam Sobuj\",\"doi\":\"10.1109/ICIRCA51532.2021.9544818\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Natural language processing (NLP) is an area of machine learning that has garnered a lot of attention in recent days due to the revolution in artificial intelligence, robotics, and smart devices. NLP focuses on training machines to understand and analyze various languages, extract meaningful information from those, translate from one language to another, correct grammar, predict the next word, complete a sentence, or even generate a completely new sentence from an existing corpus. A major challenge in NLP lies in training the model for obtaining high prediction accuracy since training needs a vast dataset. For widely used languages like English, there are many datasets available that can be used for NLP tasks like training a model and summarization but for languages like Bengali, which is only spoken primarily in South Asia, there is a dearth of big datasets which can be used to build a robust machine learning model. Therefore, NLP researchers who mainly work with the Bengali language will find an extensive, robust dataset incredibly useful for their NLP tasks involving the Bengali language. With this pressing issue in mind, this research work has prepared a dataset whose content is curated from social media, blogs, newspapers, wiki pages, and other similar resources. The amount of samples in this dataset is 19132010, and the length varies from 3 to 512 words. This dataset can easily be used to build any unsupervised machine learning model with an aim to performing necessary NLP tasks involving the Bengali language. Also, this research work is releasing two preprocessed version of this dataset that is especially suited for training both core machine learning-based and statistical-based model. As very few attempts have been made in this domain, keeping Bengali language researchers in mind, it is believed that the proposed dataset will significantly contribute to the Bengali machine learning and NLP community.\",\"PeriodicalId\":245244,\"journal\":{\"name\":\"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)\",\"volume\":\"43 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIRCA51532.2021.9544818\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIRCA51532.2021.9544818","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

自然语言处理(NLP)是机器学习的一个领域，最近由于人工智能、机器人和智能设备的革命而引起了很多关注。NLP专注于训练机器理解和分析各种语言，从中提取有意义的信息，从一种语言翻译到另一种语言，纠正语法，预测下一个单词，完成一个句子，甚至从现有的语料库生成一个全新的句子。NLP的一个主要挑战在于训练模型以获得较高的预测精度，因为训练需要大量的数据集。对于像英语这样广泛使用的语言，有许多可用的数据集可用于NLP任务，如训练模型和摘要，但对于像孟加拉语这样主要在南亚使用的语言，缺乏可用于构建强大机器学习模型的大数据集。因此，主要研究孟加拉语的NLP研究人员会发现一个广泛的、健壮的数据集对他们涉及孟加拉语的NLP任务非常有用。考虑到这个紧迫的问题，这项研究工作准备了一个数据集，其内容来自社交媒体、博客、报纸、维基页面和其他类似的资源。该数据集的样本数量为19132010，长度从3到512个单词不等。该数据集可以很容易地用于构建任何无监督机器学习模型，目的是执行涉及孟加拉语的必要NLP任务。此外，这项研究工作还发布了该数据集的两个预处理版本，特别适合训练基于机器学习和基于统计的核心模型。由于在这个领域很少有尝试，考虑到孟加拉语研究人员，我们相信所提出的数据集将对孟加拉语机器学习和NLP社区做出重大贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

BanglaLM: Data Mining based Bangla Corpus for Language Model Research

Natural language processing (NLP) is an area of machine learning that has garnered a lot of attention in recent days due to the revolution in artificial intelligence, robotics, and smart devices. NLP focuses on training machines to understand and analyze various languages, extract meaningful information from those, translate from one language to another, correct grammar, predict the next word, complete a sentence, or even generate a completely new sentence from an existing corpus. A major challenge in NLP lies in training the model for obtaining high prediction accuracy since training needs a vast dataset. For widely used languages like English, there are many datasets available that can be used for NLP tasks like training a model and summarization but for languages like Bengali, which is only spoken primarily in South Asia, there is a dearth of big datasets which can be used to build a robust machine learning model. Therefore, NLP researchers who mainly work with the Bengali language will find an extensive, robust dataset incredibly useful for their NLP tasks involving the Bengali language. With this pressing issue in mind, this research work has prepared a dataset whose content is curated from social media, blogs, newspapers, wiki pages, and other similar resources. The amount of samples in this dataset is 19132010, and the length varies from 3 to 512 words. This dataset can easily be used to build any unsupervised machine learning model with an aim to performing necessary NLP tasks involving the Bengali language. Also, this research work is releasing two preprocessed version of this dataset that is especially suited for training both core machine learning-based and statistical-based model. As very few attempts have been made in this domain, keeping Bengali language researchers in mind, it is believed that the proposed dataset will significantly contribute to the Bengali machine learning and NLP community.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)

自引率

0.00%

发文量

期刊最新文献

Morse Code Detector and Decoder using Eye Blinks Detection of Social and Newsworthy events using Tweet Analysis An Efficient Workflow Management Model for Fog Computing Application Analysis of Image Enhancement Method in Deep Learning Image Recognition Scene Virtual Learning Assistance for Students