2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)最新文献

英文中文

Removal of Unsymmetrical faults and Analysis of Total Harmonic Distortion by using Grey Wolf Optimization Technique 基于灰狼优化技术的不对称故障去除及总谐波畸变分析

2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)

Pub Date : 2021-09-02 DOI: 10.1109/ICIRCA51532.2021.9544728

Sandeep Kaur

In this Research paper the Author has obtained results through the use of MATLAB SIMULATION which shows that when a new optimization technique called Grey Wolf is applied on Distribution Static Compensator (D-STATCOM), it leads to further improved results for voltage sag and current swell for the three phase fault, double line to ground fault and single line to ground fault. To remove the Total Harmonic Distortion in Distribution System, a new optimization Technique called Grey Wolf Optimization (GWO) has been introduced. The results obtained after using GWO are very encouraging and it has further reduced voltage sag and current swell in the distribution system.

本文通过MATLAB仿真得到的结果表明，将灰狼优化技术应用于配电静态补偿器(D-STATCOM)后，对三相故障、双线对地故障和单线对地故障的电压暂降和电流膨胀效果进一步改善。为了消除配电系统的总谐波失真，提出了一种新的优化技术——灰狼优化。使用GWO后取得了令人鼓舞的效果，进一步降低了配电系统的电压凹陷和电流膨胀。

引用次数: 2

Analysis of Modern Econometric Model System Based on Panel Data and Intelligent Fuzzy Clustering Model 基于面板数据和智能模糊聚类模型的现代计量经济模型系统分析

2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)

Pub Date : 2021-09-02 DOI: 10.1109/ICIRCA51532.2021.9544922

Guanglei Zhao, Yuhuan Shi

Analysis of the modern econometric model system based on panel data and intelligent fuzzy clustering model is studied in this paper. We use normalized sensitivity to analyze the sensitivity of the general multi-layer feed-forward network econometric model. The sensitivity not only considers the first-order partial derivative information, but also takes into account the distribution of economic system inputs. Classical econometric models are mostly in the form of constant parameters. However, with the development of non-classical econometric models, other parameter forms have emerged, including variable parameters, non-parameters, and semi-parameters. Hence, we consider the core aspects of the different perspectives to construct the efficient model. The designed approach is simulated on the collected data sets and the compared with the other methods.

本文研究了基于面板数据和智能模糊聚类模型的现代计量经济模型系统。我们用归一化灵敏度来分析一般多层前馈网络计量模型的灵敏度。灵敏度不仅考虑了一阶偏导数信息，而且考虑了经济系统输入的分布。经典的计量经济学模型大多采用常数参数的形式。然而，随着非经典计量经济模型的发展，出现了变参数、非参数、半参数等其他参数形式。因此，我们考虑不同视角的核心方面来构建效率模型。在实测数据集上对所设计的方法进行了仿真，并与其他方法进行了比较。

引用次数: 0

Security Attacks in Cloud Computing: A Systematic Review 云计算中的安全攻击:系统综述

2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)

Pub Date : 2021-09-02 DOI: 10.1109/ICIRCA51532.2021.9544840

Suman Sansanwal, Nitin Jain

These days cloud computing plays very major role in the world of computers. On demand services via Internet is provided by cloud computing using large amount of virtual storage. Its most important feature is that user has no need to establish costly computing infrastructure and pay very less for its services. Virtualization is the backbone of resource sharing provided by cloud computing. Security is huge challenge of cloud computing. Cloud computing allows the user to access resources anywhere anytime via internet which is actually the main reason behind the multiple varieties of attacks. Generally at different cloud layers various threats functioning like data breach, data leakage and unauthorized data access. Even there is constant implementation and improvement occurring regarding security issues and its countermeasures over cloud computing with the growth of time, unfortunately security is still a big challenge. This paper includes a examination based on a theoretical survey on cloud computing that communicated various possible threats and also taxonomy model where at each layer a number of various security attacks enter from the usage of different cloud services, moreover for these attacks proposed mechanisms and the solutions available earlier.

如今，云计算在计算机世界中扮演着非常重要的角色。通过互联网的按需服务是由使用大量虚拟存储的云计算提供的。它最重要的特点是用户不需要建立昂贵的计算基础设施，并且为其服务支付的费用非常低。虚拟化是云计算提供的资源共享的支柱。安全性是云计算面临的巨大挑战。云计算允许用户通过互联网随时随地访问资源，这实际上是多种攻击背后的主要原因。通常，在不同的云层面，存在各种威胁，如数据泄露、数据泄露和未经授权的数据访问。尽管随着时间的推移，云计算的安全问题及其对策不断得到实现和改进，但不幸的是，安全性仍然是一个巨大的挑战。本文包括基于云计算理论调查的检查，该调查传达了各种可能的威胁和分类模型，其中在每个层中，许多不同的安全攻击从使用不同的云服务进入，并且针对这些攻击提出了先前可用的机制和解决方案。

{"title":"Security Attacks in Cloud Computing: A Systematic Review","authors":"Suman Sansanwal, Nitin Jain","doi":"10.1109/ICIRCA51532.2021.9544840","DOIUrl":"https://doi.org/10.1109/ICIRCA51532.2021.9544840","url":null,"abstract":"These days cloud computing plays very major role in the world of computers. On demand services via Internet is provided by cloud computing using large amount of virtual storage. Its most important feature is that user has no need to establish costly computing infrastructure and pay very less for its services. Virtualization is the backbone of resource sharing provided by cloud computing. Security is huge challenge of cloud computing. Cloud computing allows the user to access resources anywhere anytime via internet which is actually the main reason behind the multiple varieties of attacks. Generally at different cloud layers various threats functioning like data breach, data leakage and unauthorized data access. Even there is constant implementation and improvement occurring regarding security issues and its countermeasures over cloud computing with the growth of time, unfortunately security is still a big challenge. This paper includes a examination based on a theoretical survey on cloud computing that communicated various possible threats and also taxonomy model where at each layer a number of various security attacks enter from the usage of different cloud services, moreover for these attacks proposed mechanisms and the solutions available earlier.","PeriodicalId":245244,"journal":{"name":"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114916290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Real Time Clinical Decision System for Risk Prediction and Severity in Critical Ill Patients Using Machine Learning 使用机器学习的危重病人风险预测和严重程度的实时临床决策系统

2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)

Pub Date : 2021-09-02 DOI: 10.1109/ICIRCA51532.2021.9544943

Ammanath Gopal, M. Sailatha, S. Vikas, G. Sampath

This paper predicts the diseases of the patients by considering their symptoms who are admitted in the critical care units. This system operates at the bed side of the patients and predicts the diseases so that the basic treatment is provided to the patients without any delay. By providing basic medication to the patients, the occurrence of serious conditions and circumstances can be prevented. In hospitals, there is a decision system that operates using three phase approach which is prone to delay and inaccuracy. The proposed system eradicates the inaccurate and delayed results by considering the moderate datasets and hence yields better and fast results.

本文通过对重症监护病房住院患者的症状进行预测。该系统在患者床边运行，预测病情，及时为患者提供基本治疗。通过向患者提供基本药物，可以预防严重情况和情况的发生。在医院，有一个使用三段式方法的决策系统，这很容易延迟和不准确。该系统通过考虑适度的数据集，消除了不准确和延迟的结果，从而产生更好和更快的结果。

引用次数: 1

An Intelligent framework for E-Recruitment System Based on Text Categorization and Semantic Analysis 基于文本分类和语义分析的电子招聘系统智能框架

2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)

Pub Date : 2021-09-02 DOI: 10.1109/ICIRCA51532.2021.9544102

Razkeen Shaikh, Nikita Phulkar, H. Bhute, S. Shaikh, Prajakta Bhapkar

In the field of online job recruiting, accurate job and resume categorization is critical for both the seeker and the recruiter. Using Natural Language Processing (NLP) technology we have developed an autonomous text classification system that POS tag, tokenizes, Lemmatize the data. We have utilized Phrase Matcher to calculate the score of resumes based on recruiter's information, suggest lacking skills to users, and provide the top resumes to the recruiter. Finally, the proposed system is presented together with its findings and analysis. We divided candidates into groups based on the information in their resumes. We used domain adaptation due to the sensitive nature of the resumes content. A Word Order Similarity between Sentences is used to categorize the resume data on large dataset of job description. The System is evaluated and resulted in improved precision and recall.

在网上招聘领域，准确的工作和简历分类对求职者和招聘人员都至关重要。利用自然语言处理(NLP)技术，我们开发了一个自动文本分类系统，该系统可以对数据进行词性标注、标记和引理化。我们利用短语匹配器根据招聘人员的信息计算简历的分数，向用户建议缺乏的技能，并将最优秀的简历提供给招聘人员。最后，提出了该系统及其研究结果和分析。我们根据求职者简历中的信息将他们分成几组。由于简历内容的敏感性，我们使用了领域自适应。利用句子间词序相似度对大型职位描述数据集上的简历数据进行分类。对该系统进行了评估，结果提高了准确率和召回率。

引用次数: 1

Cervical cancer diagnosis using convolution neural network with conditional random field 带条件随机场的卷积神经网络诊断宫颈癌

2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)

Pub Date : 2021-09-02 DOI: 10.1109/ICIRCA51532.2021.9544832

V. Soni, A. Soni

Cervical cancer is the second most common disease in women worldwide, and the Pap smear is one of the most used methods for detecting cervical cancer early on. Developing nations, such as India, must confront hurdles in order to manage an increasing number of patients on a daily basis. Various online and offline machine learning techniques were used on benchmarked data sets to diagnose cervical cancer in this paper. the importance of machine learning can be seen in the various fields as it provides various benefits in the completion of the task. Medical image analysis is done for diagnostic purposes in the medical form but creating pictures of the structures and activities inside the body. The use of machine learning for medical image analysis provides various benefits during the diagnosis of a person's diseases. CNN-CRF provides various applications for analyzing the structure and capturing the picture of the inside body structure of the human. Different applications of machine learning help in analyzing the different types of the medical image such as neural networks and CT scans. Medical image analysis is the area that has been largely benefited by machine learning.

子宫颈癌是全世界妇女中第二大常见疾病，巴氏涂片检查是早期发现子宫颈癌最常用的方法之一。像印度这样的发展中国家必须面对障碍，以便每天管理越来越多的病人。本文在基准数据集上使用了各种在线和离线机器学习技术来诊断宫颈癌。机器学习的重要性可以在各个领域看到，因为它在完成任务方面提供了各种好处。医学图像分析是为了医学形式的诊断目的而进行的，但它创建了身体内部结构和活动的图像。使用机器学习进行医学图像分析在诊断一个人的疾病过程中提供了各种好处。CNN-CRF为分析人体结构和捕捉人体内部结构提供了多种应用。机器学习的不同应用有助于分析不同类型的医学图像，如神经网络和CT扫描。医学图像分析是机器学习在很大程度上受益的领域。

{"title":"Cervical cancer diagnosis using convolution neural network with conditional random field","authors":"V. Soni, A. Soni","doi":"10.1109/ICIRCA51532.2021.9544832","DOIUrl":"https://doi.org/10.1109/ICIRCA51532.2021.9544832","url":null,"abstract":"Cervical cancer is the second most common disease in women worldwide, and the Pap smear is one of the most used methods for detecting cervical cancer early on. Developing nations, such as India, must confront hurdles in order to manage an increasing number of patients on a daily basis. Various online and offline machine learning techniques were used on benchmarked data sets to diagnose cervical cancer in this paper. the importance of machine learning can be seen in the various fields as it provides various benefits in the completion of the task. Medical image analysis is done for diagnostic purposes in the medical form but creating pictures of the structures and activities inside the body. The use of machine learning for medical image analysis provides various benefits during the diagnosis of a person's diseases. CNN-CRF provides various applications for analyzing the structure and capturing the picture of the inside body structure of the human. Different applications of machine learning help in analyzing the different types of the medical image such as neural networks and CT scans. Medical image analysis is the area that has been largely benefited by machine learning.","PeriodicalId":245244,"journal":{"name":"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123962873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Programmable Detection of COVID-19 Infection Using Chest X-Ray Images Through Transfer Learning 基于迁移学习的胸部x线图像可编程检测COVID-19感染

2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)

Pub Date : 2021-09-02 DOI: 10.1109/ICIRCA51532.2021.9545050

Vemuri Triveni, R. Priyanka, Koya Dinesh Teja, Y. Sangeetha

The novel Coronavirus (COVID-19), which has been designated a pandemic by the World Health Organization, has infected over 1 million individuals and killed many. COVID-19 infection may progress to pneumonia, which can be diagnosed via a chest X-ray. This research work proposes a novel technique for automatically detecting COVID-19 infection using chest X-rays. This research used 500 X-rays of patients diagnosed with coronavirus and 500 X-rays of healthy individuals to generate a data set. Due to the scarcity of publicly accessible pictures of COVID-19 patients, this research study has been attempted via the lens of knowledge transmission. Also, this research work integrates different convolutional neural network (CNN) architectures trained on Image Net to function as X-ray image feature extractors. After that, integrate CNN with well-established machine learning methods such as k Nearest Neighbor, Bayes, Random Forest, Multilayer Perceptron (MLP). The findings indicate that the most successful extractor-classifier combination for one of the data sets is the InceptionV3 architecture, which has an SVM classifier with a linear kernel that achieves an accuracy of 99.421 percent. Another benchmark, the best combination, is ResNet50 with MLP, which has 97.461%accuracy. As a result, the suggested technique demonstrates the efficacy of detecting COVID-19 using X-rays.

被世界卫生组织指定为大流行的新型冠状病毒(COVID-19)已经感染了100多万人，并导致许多人死亡。COVID-19感染可能发展为肺炎，可通过胸部x光检查进行诊断。本研究提出了一种利用胸部x光自动检测COVID-19感染的新技术。这项研究使用了500张被诊断为冠状病毒的患者的x光片和500张健康个体的x光片来生成一个数据集。由于COVID-19患者公开图片的稀缺性，本研究试图通过知识传播的视角进行研究。此外，本研究将不同的卷积神经网络(CNN)架构整合在Image Net上训练，作为x射线图像的特征提取器。之后，将CNN与k最近邻、贝叶斯、随机森林、多层感知器(MLP)等成熟的机器学习方法相结合。研究结果表明，对于其中一个数据集，最成功的提取器-分类器组合是InceptionV3架构，它具有具有线性核的SVM分类器，其准确率达到99.421%。另一个基准，最好的组合，是与MLP的ResNet50，其准确率为97.461%。因此，建议的技术证明了使用x射线检测COVID-19的有效性。

{"title":"Programmable Detection of COVID-19 Infection Using Chest X-Ray Images Through Transfer Learning","authors":"Vemuri Triveni, R. Priyanka, Koya Dinesh Teja, Y. Sangeetha","doi":"10.1109/ICIRCA51532.2021.9545050","DOIUrl":"https://doi.org/10.1109/ICIRCA51532.2021.9545050","url":null,"abstract":"The novel Coronavirus (COVID-19), which has been designated a pandemic by the World Health Organization, has infected over 1 million individuals and killed many. COVID-19 infection may progress to pneumonia, which can be diagnosed via a chest X-ray. This research work proposes a novel technique for automatically detecting COVID-19 infection using chest X-rays. This research used 500 X-rays of patients diagnosed with coronavirus and 500 X-rays of healthy individuals to generate a data set. Due to the scarcity of publicly accessible pictures of COVID-19 patients, this research study has been attempted via the lens of knowledge transmission. Also, this research work integrates different convolutional neural network (CNN) architectures trained on Image Net to function as X-ray image feature extractors. After that, integrate CNN with well-established machine learning methods such as k Nearest Neighbor, Bayes, Random Forest, Multilayer Perceptron (MLP). The findings indicate that the most successful extractor-classifier combination for one of the data sets is the InceptionV3 architecture, which has an SVM classifier with a linear kernel that achieves an accuracy of 99.421 percent. Another benchmark, the best combination, is ResNet50 with MLP, which has 97.461%accuracy. As a result, the suggested technique demonstrates the efficacy of detecting COVID-19 using X-rays.","PeriodicalId":245244,"journal":{"name":"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125786175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Imputation of Missing Values using Improved K-Means Clustering Algorithm to Attain Data Quality 基于改进k -均值聚类算法的缺失值补全方法

2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)

Pub Date : 2021-09-02 DOI: 10.1109/ICIRCA51532.2021.9544855

Stephin Philip, Pawan Vashisth, Anant Chaturvedi, Neha Gupta

A data warehouse aids in the management of large amounts of data that may be stored in order to handle user input during the computer process. The major issue with a data warehouse is to maintain the data that the user stores in good quality. Some traditional techniques can improve data quality while also increasing efficiency. Each unit of data has a unique feature that has been researched by many researchers and has an influence on data quality. This research article has enhanced the K-Means method by utilizing the Euclidean Distance metric to detect missing values from the gathered sources and replace them with closest values while maintaining the data's consistency, exactness, and quality. yThe improved data will assist developers in analysing data quality prior to data integration by allowing them to make informed decisions quickly in accordance with business requirements. Improved K-Means achieves better accuracy and requires less computational time for clustering data objects when compared to other related approaches.

数据仓库有助于管理可能存储的大量数据，以便在计算机过程中处理用户输入。数据仓库的主要问题是维护用户存储的高质量数据。一些传统技术可以在提高效率的同时提高数据质量。每个数据单元都有一个独特的特征，这些特征已经被许多研究者研究过，并对数据质量产生影响。本文改进了K-Means方法，利用欧几里得距离度量来检测收集到的数据源中的缺失值，并用最接近的值替换它们，同时保持数据的一致性、准确性和质量。改进后的数据将帮助开发人员在数据集成之前分析数据质量，使他们能够根据业务需求快速做出明智的决策。与其他相关方法相比，改进的K-Means获得了更好的精度，并且需要更少的计算时间来聚类数据对象。

引用次数: 0

Bluetooth Low Energy based Indoor Positioning System using ESP32 基于ESP32的低功耗蓝牙室内定位系统

2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)

Pub Date : 2021-09-02 DOI: 10.1109/ICIRCA51532.2021.9544975

S. Sophia, B. Shankar, K. Akshya, AR. C. Arunachalam, V. T. Y. Avanthika, S. Deepak

As of now, the Global Positioning System (GPS) is the leading outdoor positioning system, However, indoors, GPS is a flop because the signal does not penetrate easily through solid objects and there is no line-of-sight. Since GPS is unreliable in indoors the alternative technology emerged called Indoor Positioning System (IPS). Indoor positioning is accomplished using several techniques and devices. The proposed model prefers to use Bluetooth Low Energy-based positioning system. This paper focuses on implementing BLE based indoor positioning using ES P32-Node MCU.

目前，全球定位系统(GPS)在室外定位系统中占主导地位，但在室内，由于信号不容易穿透固体物体，而且没有视线，因此GPS是失败的。由于GPS在室内不可靠，因此出现了一种替代技术，称为室内定位系统(IPS)。室内定位是通过几种技术和设备来完成的。提出的模型更倾向于使用基于蓝牙低功耗的定位系统。本文主要研究利用ES P32-Node单片机实现基于BLE的室内定位。

引用次数: 6

BanglaLM: Data Mining based Bangla Corpus for Language Model Research 基于数据挖掘的孟加拉语语料库语言模型研究

2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)

Pub Date : 2021-09-02 DOI: 10.1109/ICIRCA51532.2021.9544818

M. Kowsher, Md. Jashim Uddin, A. Tahabilder, Md Ruhul Amin, Md. Fahim Shahriar, Md. Shohanur Islam Sobuj

Natural language processing (NLP) is an area of machine learning that has garnered a lot of attention in recent days due to the revolution in artificial intelligence, robotics, and smart devices. NLP focuses on training machines to understand and analyze various languages, extract meaningful information from those, translate from one language to another, correct grammar, predict the next word, complete a sentence, or even generate a completely new sentence from an existing corpus. A major challenge in NLP lies in training the model for obtaining high prediction accuracy since training needs a vast dataset. For widely used languages like English, there are many datasets available that can be used for NLP tasks like training a model and summarization but for languages like Bengali, which is only spoken primarily in South Asia, there is a dearth of big datasets which can be used to build a robust machine learning model. Therefore, NLP researchers who mainly work with the Bengali language will find an extensive, robust dataset incredibly useful for their NLP tasks involving the Bengali language. With this pressing issue in mind, this research work has prepared a dataset whose content is curated from social media, blogs, newspapers, wiki pages, and other similar resources. The amount of samples in this dataset is 19132010, and the length varies from 3 to 512 words. This dataset can easily be used to build any unsupervised machine learning model with an aim to performing necessary NLP tasks involving the Bengali language. Also, this research work is releasing two preprocessed version of this dataset that is especially suited for training both core machine learning-based and statistical-based model. As very few attempts have been made in this domain, keeping Bengali language researchers in mind, it is believed that the proposed dataset will significantly contribute to the Bengali machine learning and NLP community.

自然语言处理(NLP)是机器学习的一个领域，最近由于人工智能、机器人和智能设备的革命而引起了很多关注。NLP专注于训练机器理解和分析各种语言，从中提取有意义的信息，从一种语言翻译到另一种语言，纠正语法，预测下一个单词，完成一个句子，甚至从现有的语料库生成一个全新的句子。NLP的一个主要挑战在于训练模型以获得较高的预测精度，因为训练需要大量的数据集。对于像英语这样广泛使用的语言，有许多可用的数据集可用于NLP任务，如训练模型和摘要，但对于像孟加拉语这样主要在南亚使用的语言，缺乏可用于构建强大机器学习模型的大数据集。因此，主要研究孟加拉语的NLP研究人员会发现一个广泛的、健壮的数据集对他们涉及孟加拉语的NLP任务非常有用。考虑到这个紧迫的问题，这项研究工作准备了一个数据集，其内容来自社交媒体、博客、报纸、维基页面和其他类似的资源。该数据集的样本数量为19132010，长度从3到512个单词不等。该数据集可以很容易地用于构建任何无监督机器学习模型，目的是执行涉及孟加拉语的必要NLP任务。此外，这项研究工作还发布了该数据集的两个预处理版本，特别适合训练基于机器学习和基于统计的核心模型。由于在这个领域很少有尝试，考虑到孟加拉语研究人员，我们相信所提出的数据集将对孟加拉语机器学习和NLP社区做出重大贡献。

{"title":"BanglaLM: Data Mining based Bangla Corpus for Language Model Research","authors":"M. Kowsher, Md. Jashim Uddin, A. Tahabilder, Md Ruhul Amin, Md. Fahim Shahriar, Md. Shohanur Islam Sobuj","doi":"10.1109/ICIRCA51532.2021.9544818","DOIUrl":"https://doi.org/10.1109/ICIRCA51532.2021.9544818","url":null,"abstract":"Natural language processing (NLP) is an area of machine learning that has garnered a lot of attention in recent days due to the revolution in artificial intelligence, robotics, and smart devices. NLP focuses on training machines to understand and analyze various languages, extract meaningful information from those, translate from one language to another, correct grammar, predict the next word, complete a sentence, or even generate a completely new sentence from an existing corpus. A major challenge in NLP lies in training the model for obtaining high prediction accuracy since training needs a vast dataset. For widely used languages like English, there are many datasets available that can be used for NLP tasks like training a model and summarization but for languages like Bengali, which is only spoken primarily in South Asia, there is a dearth of big datasets which can be used to build a robust machine learning model. Therefore, NLP researchers who mainly work with the Bengali language will find an extensive, robust dataset incredibly useful for their NLP tasks involving the Bengali language. With this pressing issue in mind, this research work has prepared a dataset whose content is curated from social media, blogs, newspapers, wiki pages, and other similar resources. The amount of samples in this dataset is 19132010, and the length varies from 3 to 512 words. This dataset can easily be used to build any unsupervised machine learning model with an aim to performing necessary NLP tasks involving the Bengali language. Also, this research work is releasing two preprocessed version of this dataset that is especially suited for training both core machine learning-based and statistical-based model. As very few attempts have been made in this domain, keeping Bengali language researchers in mind, it is believed that the proposed dataset will significantly contribute to the Bengali machine learning and NLP community.","PeriodicalId":245244,"journal":{"name":"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128557583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀