Pub Date : 2021-09-02DOI: 10.1109/ICIRCA51532.2021.9544728
Sandeep Kaur
In this Research paper the Author has obtained results through the use of MATLAB SIMULATION which shows that when a new optimization technique called Grey Wolf is applied on Distribution Static Compensator (D-STATCOM), it leads to further improved results for voltage sag and current swell for the three phase fault, double line to ground fault and single line to ground fault. To remove the Total Harmonic Distortion in Distribution System, a new optimization Technique called Grey Wolf Optimization (GWO) has been introduced. The results obtained after using GWO are very encouraging and it has further reduced voltage sag and current swell in the distribution system.
{"title":"Removal of Unsymmetrical faults and Analysis of Total Harmonic Distortion by using Grey Wolf Optimization Technique","authors":"Sandeep Kaur","doi":"10.1109/ICIRCA51532.2021.9544728","DOIUrl":"https://doi.org/10.1109/ICIRCA51532.2021.9544728","url":null,"abstract":"In this Research paper the Author has obtained results through the use of MATLAB SIMULATION which shows that when a new optimization technique called Grey Wolf is applied on Distribution Static Compensator (D-STATCOM), it leads to further improved results for voltage sag and current swell for the three phase fault, double line to ground fault and single line to ground fault. To remove the Total Harmonic Distortion in Distribution System, a new optimization Technique called Grey Wolf Optimization (GWO) has been introduced. The results obtained after using GWO are very encouraging and it has further reduced voltage sag and current swell in the distribution system.","PeriodicalId":245244,"journal":{"name":"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124363490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-02DOI: 10.1109/ICIRCA51532.2021.9544922
Guanglei Zhao, Yuhuan Shi
Analysis of the modern econometric model system based on panel data and intelligent fuzzy clustering model is studied in this paper. We use normalized sensitivity to analyze the sensitivity of the general multi-layer feed-forward network econometric model. The sensitivity not only considers the first-order partial derivative information, but also takes into account the distribution of economic system inputs. Classical econometric models are mostly in the form of constant parameters. However, with the development of non-classical econometric models, other parameter forms have emerged, including variable parameters, non-parameters, and semi-parameters. Hence, we consider the core aspects of the different perspectives to construct the efficient model. The designed approach is simulated on the collected data sets and the compared with the other methods.
{"title":"Analysis of Modern Econometric Model System Based on Panel Data and Intelligent Fuzzy Clustering Model","authors":"Guanglei Zhao, Yuhuan Shi","doi":"10.1109/ICIRCA51532.2021.9544922","DOIUrl":"https://doi.org/10.1109/ICIRCA51532.2021.9544922","url":null,"abstract":"Analysis of the modern econometric model system based on panel data and intelligent fuzzy clustering model is studied in this paper. We use normalized sensitivity to analyze the sensitivity of the general multi-layer feed-forward network econometric model. The sensitivity not only considers the first-order partial derivative information, but also takes into account the distribution of economic system inputs. Classical econometric models are mostly in the form of constant parameters. However, with the development of non-classical econometric models, other parameter forms have emerged, including variable parameters, non-parameters, and semi-parameters. Hence, we consider the core aspects of the different perspectives to construct the efficient model. The designed approach is simulated on the collected data sets and the compared with the other methods.","PeriodicalId":245244,"journal":{"name":"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114848765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-02DOI: 10.1109/ICIRCA51532.2021.9544840
Suman Sansanwal, Nitin Jain
These days cloud computing plays very major role in the world of computers. On demand services via Internet is provided by cloud computing using large amount of virtual storage. Its most important feature is that user has no need to establish costly computing infrastructure and pay very less for its services. Virtualization is the backbone of resource sharing provided by cloud computing. Security is huge challenge of cloud computing. Cloud computing allows the user to access resources anywhere anytime via internet which is actually the main reason behind the multiple varieties of attacks. Generally at different cloud layers various threats functioning like data breach, data leakage and unauthorized data access. Even there is constant implementation and improvement occurring regarding security issues and its countermeasures over cloud computing with the growth of time, unfortunately security is still a big challenge. This paper includes a examination based on a theoretical survey on cloud computing that communicated various possible threats and also taxonomy model where at each layer a number of various security attacks enter from the usage of different cloud services, moreover for these attacks proposed mechanisms and the solutions available earlier.
{"title":"Security Attacks in Cloud Computing: A Systematic Review","authors":"Suman Sansanwal, Nitin Jain","doi":"10.1109/ICIRCA51532.2021.9544840","DOIUrl":"https://doi.org/10.1109/ICIRCA51532.2021.9544840","url":null,"abstract":"These days cloud computing plays very major role in the world of computers. On demand services via Internet is provided by cloud computing using large amount of virtual storage. Its most important feature is that user has no need to establish costly computing infrastructure and pay very less for its services. Virtualization is the backbone of resource sharing provided by cloud computing. Security is huge challenge of cloud computing. Cloud computing allows the user to access resources anywhere anytime via internet which is actually the main reason behind the multiple varieties of attacks. Generally at different cloud layers various threats functioning like data breach, data leakage and unauthorized data access. Even there is constant implementation and improvement occurring regarding security issues and its countermeasures over cloud computing with the growth of time, unfortunately security is still a big challenge. This paper includes a examination based on a theoretical survey on cloud computing that communicated various possible threats and also taxonomy model where at each layer a number of various security attacks enter from the usage of different cloud services, moreover for these attacks proposed mechanisms and the solutions available earlier.","PeriodicalId":245244,"journal":{"name":"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114916290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-02DOI: 10.1109/ICIRCA51532.2021.9544943
Ammanath Gopal, M. Sailatha, S. Vikas, G. Sampath
This paper predicts the diseases of the patients by considering their symptoms who are admitted in the critical care units. This system operates at the bed side of the patients and predicts the diseases so that the basic treatment is provided to the patients without any delay. By providing basic medication to the patients, the occurrence of serious conditions and circumstances can be prevented. In hospitals, there is a decision system that operates using three phase approach which is prone to delay and inaccuracy. The proposed system eradicates the inaccurate and delayed results by considering the moderate datasets and hence yields better and fast results.
{"title":"A Real Time Clinical Decision System for Risk Prediction and Severity in Critical Ill Patients Using Machine Learning","authors":"Ammanath Gopal, M. Sailatha, S. Vikas, G. Sampath","doi":"10.1109/ICIRCA51532.2021.9544943","DOIUrl":"https://doi.org/10.1109/ICIRCA51532.2021.9544943","url":null,"abstract":"This paper predicts the diseases of the patients by considering their symptoms who are admitted in the critical care units. This system operates at the bed side of the patients and predicts the diseases so that the basic treatment is provided to the patients without any delay. By providing basic medication to the patients, the occurrence of serious conditions and circumstances can be prevented. In hospitals, there is a decision system that operates using three phase approach which is prone to delay and inaccuracy. The proposed system eradicates the inaccurate and delayed results by considering the moderate datasets and hence yields better and fast results.","PeriodicalId":245244,"journal":{"name":"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120825366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-02DOI: 10.1109/ICIRCA51532.2021.9544102
Razkeen Shaikh, Nikita Phulkar, H. Bhute, S. Shaikh, Prajakta Bhapkar
In the field of online job recruiting, accurate job and resume categorization is critical for both the seeker and the recruiter. Using Natural Language Processing (NLP) technology we have developed an autonomous text classification system that POS tag, tokenizes, Lemmatize the data. We have utilized Phrase Matcher to calculate the score of resumes based on recruiter's information, suggest lacking skills to users, and provide the top resumes to the recruiter. Finally, the proposed system is presented together with its findings and analysis. We divided candidates into groups based on the information in their resumes. We used domain adaptation due to the sensitive nature of the resumes content. A Word Order Similarity between Sentences is used to categorize the resume data on large dataset of job description. The System is evaluated and resulted in improved precision and recall.
{"title":"An Intelligent framework for E-Recruitment System Based on Text Categorization and Semantic Analysis","authors":"Razkeen Shaikh, Nikita Phulkar, H. Bhute, S. Shaikh, Prajakta Bhapkar","doi":"10.1109/ICIRCA51532.2021.9544102","DOIUrl":"https://doi.org/10.1109/ICIRCA51532.2021.9544102","url":null,"abstract":"In the field of online job recruiting, accurate job and resume categorization is critical for both the seeker and the recruiter. Using Natural Language Processing (NLP) technology we have developed an autonomous text classification system that POS tag, tokenizes, Lemmatize the data. We have utilized Phrase Matcher to calculate the score of resumes based on recruiter's information, suggest lacking skills to users, and provide the top resumes to the recruiter. Finally, the proposed system is presented together with its findings and analysis. We divided candidates into groups based on the information in their resumes. We used domain adaptation due to the sensitive nature of the resumes content. A Word Order Similarity between Sentences is used to categorize the resume data on large dataset of job description. The System is evaluated and resulted in improved precision and recall.","PeriodicalId":245244,"journal":{"name":"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116427489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-02DOI: 10.1109/ICIRCA51532.2021.9544832
V. Soni, A. Soni
Cervical cancer is the second most common disease in women worldwide, and the Pap smear is one of the most used methods for detecting cervical cancer early on. Developing nations, such as India, must confront hurdles in order to manage an increasing number of patients on a daily basis. Various online and offline machine learning techniques were used on benchmarked data sets to diagnose cervical cancer in this paper. the importance of machine learning can be seen in the various fields as it provides various benefits in the completion of the task. Medical image analysis is done for diagnostic purposes in the medical form but creating pictures of the structures and activities inside the body. The use of machine learning for medical image analysis provides various benefits during the diagnosis of a person's diseases. CNN-CRF provides various applications for analyzing the structure and capturing the picture of the inside body structure of the human. Different applications of machine learning help in analyzing the different types of the medical image such as neural networks and CT scans. Medical image analysis is the area that has been largely benefited by machine learning.
{"title":"Cervical cancer diagnosis using convolution neural network with conditional random field","authors":"V. Soni, A. Soni","doi":"10.1109/ICIRCA51532.2021.9544832","DOIUrl":"https://doi.org/10.1109/ICIRCA51532.2021.9544832","url":null,"abstract":"Cervical cancer is the second most common disease in women worldwide, and the Pap smear is one of the most used methods for detecting cervical cancer early on. Developing nations, such as India, must confront hurdles in order to manage an increasing number of patients on a daily basis. Various online and offline machine learning techniques were used on benchmarked data sets to diagnose cervical cancer in this paper. the importance of machine learning can be seen in the various fields as it provides various benefits in the completion of the task. Medical image analysis is done for diagnostic purposes in the medical form but creating pictures of the structures and activities inside the body. The use of machine learning for medical image analysis provides various benefits during the diagnosis of a person's diseases. CNN-CRF provides various applications for analyzing the structure and capturing the picture of the inside body structure of the human. Different applications of machine learning help in analyzing the different types of the medical image such as neural networks and CT scans. Medical image analysis is the area that has been largely benefited by machine learning.","PeriodicalId":245244,"journal":{"name":"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123962873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-02DOI: 10.1109/ICIRCA51532.2021.9545050
Vemuri Triveni, R. Priyanka, Koya Dinesh Teja, Y. Sangeetha
The novel Coronavirus (COVID-19), which has been designated a pandemic by the World Health Organization, has infected over 1 million individuals and killed many. COVID-19 infection may progress to pneumonia, which can be diagnosed via a chest X-ray. This research work proposes a novel technique for automatically detecting COVID-19 infection using chest X-rays. This research used 500 X-rays of patients diagnosed with coronavirus and 500 X-rays of healthy individuals to generate a data set. Due to the scarcity of publicly accessible pictures of COVID-19 patients, this research study has been attempted via the lens of knowledge transmission. Also, this research work integrates different convolutional neural network (CNN) architectures trained on Image Net to function as X-ray image feature extractors. After that, integrate CNN with well-established machine learning methods such as k Nearest Neighbor, Bayes, Random Forest, Multilayer Perceptron (MLP). The findings indicate that the most successful extractor-classifier combination for one of the data sets is the InceptionV3 architecture, which has an SVM classifier with a linear kernel that achieves an accuracy of 99.421 percent. Another benchmark, the best combination, is ResNet50 with MLP, which has 97.461%accuracy. As a result, the suggested technique demonstrates the efficacy of detecting COVID-19 using X-rays.
{"title":"Programmable Detection of COVID-19 Infection Using Chest X-Ray Images Through Transfer Learning","authors":"Vemuri Triveni, R. Priyanka, Koya Dinesh Teja, Y. Sangeetha","doi":"10.1109/ICIRCA51532.2021.9545050","DOIUrl":"https://doi.org/10.1109/ICIRCA51532.2021.9545050","url":null,"abstract":"The novel Coronavirus (COVID-19), which has been designated a pandemic by the World Health Organization, has infected over 1 million individuals and killed many. COVID-19 infection may progress to pneumonia, which can be diagnosed via a chest X-ray. This research work proposes a novel technique for automatically detecting COVID-19 infection using chest X-rays. This research used 500 X-rays of patients diagnosed with coronavirus and 500 X-rays of healthy individuals to generate a data set. Due to the scarcity of publicly accessible pictures of COVID-19 patients, this research study has been attempted via the lens of knowledge transmission. Also, this research work integrates different convolutional neural network (CNN) architectures trained on Image Net to function as X-ray image feature extractors. After that, integrate CNN with well-established machine learning methods such as k Nearest Neighbor, Bayes, Random Forest, Multilayer Perceptron (MLP). The findings indicate that the most successful extractor-classifier combination for one of the data sets is the InceptionV3 architecture, which has an SVM classifier with a linear kernel that achieves an accuracy of 99.421 percent. Another benchmark, the best combination, is ResNet50 with MLP, which has 97.461%accuracy. As a result, the suggested technique demonstrates the efficacy of detecting COVID-19 using X-rays.","PeriodicalId":245244,"journal":{"name":"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125786175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A data warehouse aids in the management of large amounts of data that may be stored in order to handle user input during the computer process. The major issue with a data warehouse is to maintain the data that the user stores in good quality. Some traditional techniques can improve data quality while also increasing efficiency. Each unit of data has a unique feature that has been researched by many researchers and has an influence on data quality. This research article has enhanced the K-Means method by utilizing the Euclidean Distance metric to detect missing values from the gathered sources and replace them with closest values while maintaining the data's consistency, exactness, and quality. yThe improved data will assist developers in analysing data quality prior to data integration by allowing them to make informed decisions quickly in accordance with business requirements. Improved K-Means achieves better accuracy and requires less computational time for clustering data objects when compared to other related approaches.
{"title":"Imputation of Missing Values using Improved K-Means Clustering Algorithm to Attain Data Quality","authors":"Stephin Philip, Pawan Vashisth, Anant Chaturvedi, Neha Gupta","doi":"10.1109/ICIRCA51532.2021.9544855","DOIUrl":"https://doi.org/10.1109/ICIRCA51532.2021.9544855","url":null,"abstract":"A data warehouse aids in the management of large amounts of data that may be stored in order to handle user input during the computer process. The major issue with a data warehouse is to maintain the data that the user stores in good quality. Some traditional techniques can improve data quality while also increasing efficiency. Each unit of data has a unique feature that has been researched by many researchers and has an influence on data quality. This research article has enhanced the K-Means method by utilizing the Euclidean Distance metric to detect missing values from the gathered sources and replace them with closest values while maintaining the data's consistency, exactness, and quality. yThe improved data will assist developers in analysing data quality prior to data integration by allowing them to make informed decisions quickly in accordance with business requirements. Improved K-Means achieves better accuracy and requires less computational time for clustering data objects when compared to other related approaches.","PeriodicalId":245244,"journal":{"name":"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124860272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-02DOI: 10.1109/ICIRCA51532.2021.9544975
S. Sophia, B. Shankar, K. Akshya, AR. C. Arunachalam, V. T. Y. Avanthika, S. Deepak
As of now, the Global Positioning System (GPS) is the leading outdoor positioning system, However, indoors, GPS is a flop because the signal does not penetrate easily through solid objects and there is no line-of-sight. Since GPS is unreliable in indoors the alternative technology emerged called Indoor Positioning System (IPS). Indoor positioning is accomplished using several techniques and devices. The proposed model prefers to use Bluetooth Low Energy-based positioning system. This paper focuses on implementing BLE based indoor positioning using ES P32-Node MCU.
{"title":"Bluetooth Low Energy based Indoor Positioning System using ESP32","authors":"S. Sophia, B. Shankar, K. Akshya, AR. C. Arunachalam, V. T. Y. Avanthika, S. Deepak","doi":"10.1109/ICIRCA51532.2021.9544975","DOIUrl":"https://doi.org/10.1109/ICIRCA51532.2021.9544975","url":null,"abstract":"As of now, the Global Positioning System (GPS) is the leading outdoor positioning system, However, indoors, GPS is a flop because the signal does not penetrate easily through solid objects and there is no line-of-sight. Since GPS is unreliable in indoors the alternative technology emerged called Indoor Positioning System (IPS). Indoor positioning is accomplished using several techniques and devices. The proposed model prefers to use Bluetooth Low Energy-based positioning system. This paper focuses on implementing BLE based indoor positioning using ES P32-Node MCU.","PeriodicalId":245244,"journal":{"name":"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125027700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-02DOI: 10.1109/ICIRCA51532.2021.9544818
M. Kowsher, Md. Jashim Uddin, A. Tahabilder, Md Ruhul Amin, Md. Fahim Shahriar, Md. Shohanur Islam Sobuj
Natural language processing (NLP) is an area of machine learning that has garnered a lot of attention in recent days due to the revolution in artificial intelligence, robotics, and smart devices. NLP focuses on training machines to understand and analyze various languages, extract meaningful information from those, translate from one language to another, correct grammar, predict the next word, complete a sentence, or even generate a completely new sentence from an existing corpus. A major challenge in NLP lies in training the model for obtaining high prediction accuracy since training needs a vast dataset. For widely used languages like English, there are many datasets available that can be used for NLP tasks like training a model and summarization but for languages like Bengali, which is only spoken primarily in South Asia, there is a dearth of big datasets which can be used to build a robust machine learning model. Therefore, NLP researchers who mainly work with the Bengali language will find an extensive, robust dataset incredibly useful for their NLP tasks involving the Bengali language. With this pressing issue in mind, this research work has prepared a dataset whose content is curated from social media, blogs, newspapers, wiki pages, and other similar resources. The amount of samples in this dataset is 19132010, and the length varies from 3 to 512 words. This dataset can easily be used to build any unsupervised machine learning model with an aim to performing necessary NLP tasks involving the Bengali language. Also, this research work is releasing two preprocessed version of this dataset that is especially suited for training both core machine learning-based and statistical-based model. As very few attempts have been made in this domain, keeping Bengali language researchers in mind, it is believed that the proposed dataset will significantly contribute to the Bengali machine learning and NLP community.
{"title":"BanglaLM: Data Mining based Bangla Corpus for Language Model Research","authors":"M. Kowsher, Md. Jashim Uddin, A. Tahabilder, Md Ruhul Amin, Md. Fahim Shahriar, Md. Shohanur Islam Sobuj","doi":"10.1109/ICIRCA51532.2021.9544818","DOIUrl":"https://doi.org/10.1109/ICIRCA51532.2021.9544818","url":null,"abstract":"Natural language processing (NLP) is an area of machine learning that has garnered a lot of attention in recent days due to the revolution in artificial intelligence, robotics, and smart devices. NLP focuses on training machines to understand and analyze various languages, extract meaningful information from those, translate from one language to another, correct grammar, predict the next word, complete a sentence, or even generate a completely new sentence from an existing corpus. A major challenge in NLP lies in training the model for obtaining high prediction accuracy since training needs a vast dataset. For widely used languages like English, there are many datasets available that can be used for NLP tasks like training a model and summarization but for languages like Bengali, which is only spoken primarily in South Asia, there is a dearth of big datasets which can be used to build a robust machine learning model. Therefore, NLP researchers who mainly work with the Bengali language will find an extensive, robust dataset incredibly useful for their NLP tasks involving the Bengali language. With this pressing issue in mind, this research work has prepared a dataset whose content is curated from social media, blogs, newspapers, wiki pages, and other similar resources. The amount of samples in this dataset is 19132010, and the length varies from 3 to 512 words. This dataset can easily be used to build any unsupervised machine learning model with an aim to performing necessary NLP tasks involving the Bengali language. Also, this research work is releasing two preprocessed version of this dataset that is especially suited for training both core machine learning-based and statistical-based model. As very few attempts have been made in this domain, keeping Bengali language researchers in mind, it is believed that the proposed dataset will significantly contribute to the Bengali machine learning and NLP community.","PeriodicalId":245244,"journal":{"name":"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128557583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}