A Survey Study on Proposed Solutions for Imbalanced Big Data

Q4 Earth and Planetary Sciences Iraqi Journal of Science Pub Date : 2024-03-29 DOI:10.24996/ijs.2024.65.3.37

S. Razoqi, Ghayda Al-Talib

{"title":"A Survey Study on Proposed Solutions for Imbalanced Big Data","authors":"S. Razoqi, Ghayda Al-Talib","doi":"10.24996/ijs.2024.65.3.37","DOIUrl":null,"url":null,"abstract":" Learning from imbalanced data has been a focus of studies for more than two decades of continuous development. Training data is considered imbalanced when the size of the positive (minority) class is neglected because of the large size of the negative (majority) class, in addition to the problem of deviating distributions of binary tasks. The appearance of big data brings new problems and challenges to the imbalance problem. Big Data announces the challenges with 5V: volume, velocity, veracity, value, and variety. This study relied on dividing the solution to the problem of data imbalance into three levels: data level, algorithm level, and hybrid approaches. First, the standard solutions for this problem that were proposed were mentioned, and in addition, the most important metrics adopted for measuring the classification efficiency of imbalanced data were identified. In this survey study, 27 studies were reviewed during the period 2015–2022, distributed according to the levels of treatment of the imbalance problem. They also reviewed the performance metrics that were used in these studies and the sources of the datasets to which these solutions were applied. The study makes it easier for researchers and scholars to see the solutions to addressing the problem of data imbalance and the hybrid approaches recently used for that, and to take advantage of them in improving the classification process.","PeriodicalId":14698,"journal":{"name":"Iraqi Journal of Science","volume":"41 2","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Iraqi Journal of Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.24996/ijs.2024.65.3.37","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Earth and Planetary Sciences","Score":null,"Total":0}

引用次数: 0

Abstract

Learning from imbalanced data has been a focus of studies for more than two decades of continuous development. Training data is considered imbalanced when the size of the positive (minority) class is neglected because of the large size of the negative (majority) class, in addition to the problem of deviating distributions of binary tasks. The appearance of big data brings new problems and challenges to the imbalance problem. Big Data announces the challenges with 5V: volume, velocity, veracity, value, and variety. This study relied on dividing the solution to the problem of data imbalance into three levels: data level, algorithm level, and hybrid approaches. First, the standard solutions for this problem that were proposed were mentioned, and in addition, the most important metrics adopted for measuring the classification efficiency of imbalanced data were identified. In this survey study, 27 studies were reviewed during the period 2015–2022, distributed according to the levels of treatment of the imbalance problem. They also reviewed the performance metrics that were used in these studies and the sources of the datasets to which these solutions were applied. The study makes it easier for researchers and scholars to see the solutions to addressing the problem of data imbalance and the hybrid approaches recently used for that, and to take advantage of them in improving the classification process.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

关于失衡大数据拟议解决方案的调查研究

经过二十多年的不断发展，从不平衡性数据中学习一直是研究的重点。除了二元任务的偏差分布问题外，当正向（少数）类的规模因负向（多数）类的规模大而被忽视时，训练数据就被认为是不平衡的。大数据的出现给不平衡问题带来了新的问题和挑战。大数据用 5V 宣告了挑战：数量、速度、真实性、价值和多样性。本研究将数据不平衡问题的解决方案分为三个层面：数据层面、算法层面和混合方法。首先，提到了针对这一问题提出的标准解决方案，此外，还确定了用于衡量不平衡数据分类效率的最重要指标。在这项调查研究中，对 2015-2022 年间的 27 项研究进行了回顾，这些研究按照处理不平衡问题的级别分布。他们还回顾了这些研究中使用的性能指标以及应用这些解决方案的数据集来源。这项研究使研究人员和学者更容易了解解决数据不平衡问题的方案和最近用于解决该问题的混合方法，并利用它们改进分类过程。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊