Synthetic Data Generation Using Genetic Algorithm

Pratyusha Thogarchety, K. Das
{"title":"Synthetic Data Generation Using Genetic Algorithm","authors":"Pratyusha Thogarchety, K. Das","doi":"10.1109/INOCON57975.2023.10101072","DOIUrl":null,"url":null,"abstract":"Statistical machine learning models suffer poorly because of class imbalance issue. Real world dataset contains mostly ‘normal’ examples and very few ‘abnormal’ examples and in most of the cases, the primary goal is to identify the abnormal instances. For example, if we want to develop a statistical machine learning model to identify financial fraud using the historical transaction data then we can expect that majority of the data comes from normal/non-fraudulent class, whereas very few examples are fraudulent transactions. Using such imbalanced dataset for training makes machine learning models highly biased towards majority non-fraudulent class. This way, the objective to catch fraudulent transaction instances fails and misclassifying such minority class instances often results in a much higher cost. Hence, a balanced dataset is very much required to train a sound model. Different techniques such as under sampling, oversampling, SMOTE were proposed earlier. In this paper, we propose a novel technique to generate synthetic data using genetic search algorithm. We examined the effectiveness of our proposed algorithm on different datasets and reported in section V.","PeriodicalId":113637,"journal":{"name":"2023 2nd International Conference for Innovation in Technology (INOCON)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 2nd International Conference for Innovation in Technology (INOCON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INOCON57975.2023.10101072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Statistical machine learning models suffer poorly because of class imbalance issue. Real world dataset contains mostly ‘normal’ examples and very few ‘abnormal’ examples and in most of the cases, the primary goal is to identify the abnormal instances. For example, if we want to develop a statistical machine learning model to identify financial fraud using the historical transaction data then we can expect that majority of the data comes from normal/non-fraudulent class, whereas very few examples are fraudulent transactions. Using such imbalanced dataset for training makes machine learning models highly biased towards majority non-fraudulent class. This way, the objective to catch fraudulent transaction instances fails and misclassifying such minority class instances often results in a much higher cost. Hence, a balanced dataset is very much required to train a sound model. Different techniques such as under sampling, oversampling, SMOTE were proposed earlier. In this paper, we propose a novel technique to generate synthetic data using genetic search algorithm. We examined the effectiveness of our proposed algorithm on different datasets and reported in section V.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于遗传算法的合成数据生成
统计机器学习模型由于类不平衡问题而表现不佳。现实世界的数据集主要包含“正常”示例和极少数“异常”示例,在大多数情况下,主要目标是识别异常实例。例如,如果我们想开发一个统计机器学习模型来使用历史交易数据识别金融欺诈,那么我们可以预期大多数数据来自正常/非欺诈类,而很少有示例是欺诈性交易。使用这种不平衡的数据集进行训练使机器学习模型高度偏向于大多数非欺诈类。这样,捕获欺诈性事务实例的目标就失败了,并且错误分类此类少数类实例通常会导致更高的成本。因此,需要一个平衡的数据集来训练一个健全的模型。不同的技术,如欠采样,过采样,SMOTE之前提出。本文提出了一种利用遗传搜索算法生成合成数据的新技术。我们检查了我们提出的算法在不同数据集上的有效性,并在第V节中报告。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Study of Machine Learning Algorithms for Predicting Heart Disease Analysis and Evaluation of Medical Care Data using Analytic Fuzzy Process Digital Image Enhancement using Conventional Neural Network Multi-View Image Reconstruction Algorithm Based on Virtual Reality Technology Application of Web Data Mining Technology in Computer Information Management
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1