基于数据挖掘的孟加拉语语料库语言模型研究

M. Kowsher, Md. Jashim Uddin, A. Tahabilder, Md Ruhul Amin, Md. Fahim Shahriar, Md. Shohanur Islam Sobuj
{"title":"基于数据挖掘的孟加拉语语料库语言模型研究","authors":"M. Kowsher, Md. Jashim Uddin, A. Tahabilder, Md Ruhul Amin, Md. Fahim Shahriar, Md. Shohanur Islam Sobuj","doi":"10.1109/ICIRCA51532.2021.9544818","DOIUrl":null,"url":null,"abstract":"Natural language processing (NLP) is an area of machine learning that has garnered a lot of attention in recent days due to the revolution in artificial intelligence, robotics, and smart devices. NLP focuses on training machines to understand and analyze various languages, extract meaningful information from those, translate from one language to another, correct grammar, predict the next word, complete a sentence, or even generate a completely new sentence from an existing corpus. A major challenge in NLP lies in training the model for obtaining high prediction accuracy since training needs a vast dataset. For widely used languages like English, there are many datasets available that can be used for NLP tasks like training a model and summarization but for languages like Bengali, which is only spoken primarily in South Asia, there is a dearth of big datasets which can be used to build a robust machine learning model. Therefore, NLP researchers who mainly work with the Bengali language will find an extensive, robust dataset incredibly useful for their NLP tasks involving the Bengali language. With this pressing issue in mind, this research work has prepared a dataset whose content is curated from social media, blogs, newspapers, wiki pages, and other similar resources. The amount of samples in this dataset is 19132010, and the length varies from 3 to 512 words. This dataset can easily be used to build any unsupervised machine learning model with an aim to performing necessary NLP tasks involving the Bengali language. Also, this research work is releasing two preprocessed version of this dataset that is especially suited for training both core machine learning-based and statistical-based model. As very few attempts have been made in this domain, keeping Bengali language researchers in mind, it is believed that the proposed dataset will significantly contribute to the Bengali machine learning and NLP community.","PeriodicalId":245244,"journal":{"name":"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"BanglaLM: Data Mining based Bangla Corpus for Language Model Research\",\"authors\":\"M. Kowsher, Md. Jashim Uddin, A. Tahabilder, Md Ruhul Amin, Md. Fahim Shahriar, Md. Shohanur Islam Sobuj\",\"doi\":\"10.1109/ICIRCA51532.2021.9544818\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Natural language processing (NLP) is an area of machine learning that has garnered a lot of attention in recent days due to the revolution in artificial intelligence, robotics, and smart devices. NLP focuses on training machines to understand and analyze various languages, extract meaningful information from those, translate from one language to another, correct grammar, predict the next word, complete a sentence, or even generate a completely new sentence from an existing corpus. A major challenge in NLP lies in training the model for obtaining high prediction accuracy since training needs a vast dataset. For widely used languages like English, there are many datasets available that can be used for NLP tasks like training a model and summarization but for languages like Bengali, which is only spoken primarily in South Asia, there is a dearth of big datasets which can be used to build a robust machine learning model. Therefore, NLP researchers who mainly work with the Bengali language will find an extensive, robust dataset incredibly useful for their NLP tasks involving the Bengali language. With this pressing issue in mind, this research work has prepared a dataset whose content is curated from social media, blogs, newspapers, wiki pages, and other similar resources. The amount of samples in this dataset is 19132010, and the length varies from 3 to 512 words. This dataset can easily be used to build any unsupervised machine learning model with an aim to performing necessary NLP tasks involving the Bengali language. Also, this research work is releasing two preprocessed version of this dataset that is especially suited for training both core machine learning-based and statistical-based model. As very few attempts have been made in this domain, keeping Bengali language researchers in mind, it is believed that the proposed dataset will significantly contribute to the Bengali machine learning and NLP community.\",\"PeriodicalId\":245244,\"journal\":{\"name\":\"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)\",\"volume\":\"43 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIRCA51532.2021.9544818\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIRCA51532.2021.9544818","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

自然语言处理(NLP)是机器学习的一个领域,最近由于人工智能、机器人和智能设备的革命而引起了很多关注。NLP专注于训练机器理解和分析各种语言,从中提取有意义的信息,从一种语言翻译到另一种语言,纠正语法,预测下一个单词,完成一个句子,甚至从现有的语料库生成一个全新的句子。NLP的一个主要挑战在于训练模型以获得较高的预测精度,因为训练需要大量的数据集。对于像英语这样广泛使用的语言,有许多可用的数据集可用于NLP任务,如训练模型和摘要,但对于像孟加拉语这样主要在南亚使用的语言,缺乏可用于构建强大机器学习模型的大数据集。因此,主要研究孟加拉语的NLP研究人员会发现一个广泛的、健壮的数据集对他们涉及孟加拉语的NLP任务非常有用。考虑到这个紧迫的问题,这项研究工作准备了一个数据集,其内容来自社交媒体、博客、报纸、维基页面和其他类似的资源。该数据集的样本数量为19132010,长度从3到512个单词不等。该数据集可以很容易地用于构建任何无监督机器学习模型,目的是执行涉及孟加拉语的必要NLP任务。此外,这项研究工作还发布了该数据集的两个预处理版本,特别适合训练基于机器学习和基于统计的核心模型。由于在这个领域很少有尝试,考虑到孟加拉语研究人员,我们相信所提出的数据集将对孟加拉语机器学习和NLP社区做出重大贡献。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
BanglaLM: Data Mining based Bangla Corpus for Language Model Research
Natural language processing (NLP) is an area of machine learning that has garnered a lot of attention in recent days due to the revolution in artificial intelligence, robotics, and smart devices. NLP focuses on training machines to understand and analyze various languages, extract meaningful information from those, translate from one language to another, correct grammar, predict the next word, complete a sentence, or even generate a completely new sentence from an existing corpus. A major challenge in NLP lies in training the model for obtaining high prediction accuracy since training needs a vast dataset. For widely used languages like English, there are many datasets available that can be used for NLP tasks like training a model and summarization but for languages like Bengali, which is only spoken primarily in South Asia, there is a dearth of big datasets which can be used to build a robust machine learning model. Therefore, NLP researchers who mainly work with the Bengali language will find an extensive, robust dataset incredibly useful for their NLP tasks involving the Bengali language. With this pressing issue in mind, this research work has prepared a dataset whose content is curated from social media, blogs, newspapers, wiki pages, and other similar resources. The amount of samples in this dataset is 19132010, and the length varies from 3 to 512 words. This dataset can easily be used to build any unsupervised machine learning model with an aim to performing necessary NLP tasks involving the Bengali language. Also, this research work is releasing two preprocessed version of this dataset that is especially suited for training both core machine learning-based and statistical-based model. As very few attempts have been made in this domain, keeping Bengali language researchers in mind, it is believed that the proposed dataset will significantly contribute to the Bengali machine learning and NLP community.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Morse Code Detector and Decoder using Eye Blinks Detection of Social and Newsworthy events using Tweet Analysis An Efficient Workflow Management Model for Fog Computing Application Analysis of Image Enhancement Method in Deep Learning Image Recognition Scene Virtual Learning Assistance for Students
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1