An Autoencoder and Generative Adversarial Networks Approach for Multi-Omics Data Imbalanced Class Handling and Classification

Ibrahim Al-Hurani, Abedalrhman Alkhateeb, Salama Ikki
{"title":"An Autoencoder and Generative Adversarial Networks Approach for Multi-Omics Data Imbalanced Class Handling and Classification","authors":"Ibrahim Al-Hurani, Abedalrhman Alkhateeb, Salama Ikki","doi":"arxiv-2405.09756","DOIUrl":null,"url":null,"abstract":"In the relentless efforts in enhancing medical diagnostics, the integration\nof state-of-the-art machine learning methodologies has emerged as a promising\nresearch area. In molecular biology, there has been an explosion of data\ngenerated from multi-omics sequencing. The advent sequencing equipment can\nprovide large number of complicated measurements per one experiment. Therefore,\ntraditional statistical methods face challenging tasks when dealing with such\nhigh dimensional data. However, most of the information contained in these\ndatasets is redundant or unrelated and can be effectively reduced to\nsignificantly fewer variables without losing much information. Dimensionality\nreduction techniques are mathematical procedures that allow for this reduction;\nthey have largely been developed through statistics and machine learning\ndisciplines. The other challenge in medical datasets is having an imbalanced\nnumber of samples in the classes, which leads to biased results in machine\nlearning models. This study, focused on tackling these challenges in a neural\nnetwork that incorporates autoencoder to extract latent space of the features,\nand Generative Adversarial Networks (GAN) to generate synthetic samples. Latent\nspace is the reduced dimensional space that captures the meaningful features of\nthe original data. Our model starts with feature selection to select the\ndiscriminative features before feeding them to the neural network. Then, the\nmodel predicts the outcome of cancer for different datasets. The proposed model\noutperformed other existing models by scoring accuracy of 95.09% for bladder\ncancer dataset and 88.82% for the breast cancer dataset.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"214 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.09756","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In the relentless efforts in enhancing medical diagnostics, the integration of state-of-the-art machine learning methodologies has emerged as a promising research area. In molecular biology, there has been an explosion of data generated from multi-omics sequencing. The advent sequencing equipment can provide large number of complicated measurements per one experiment. Therefore, traditional statistical methods face challenging tasks when dealing with such high dimensional data. However, most of the information contained in these datasets is redundant or unrelated and can be effectively reduced to significantly fewer variables without losing much information. Dimensionality reduction techniques are mathematical procedures that allow for this reduction; they have largely been developed through statistics and machine learning disciplines. The other challenge in medical datasets is having an imbalanced number of samples in the classes, which leads to biased results in machine learning models. This study, focused on tackling these challenges in a neural network that incorporates autoencoder to extract latent space of the features, and Generative Adversarial Networks (GAN) to generate synthetic samples. Latent space is the reduced dimensional space that captures the meaningful features of the original data. Our model starts with feature selection to select the discriminative features before feeding them to the neural network. Then, the model predicts the outcome of cancer for different datasets. The proposed model outperformed other existing models by scoring accuracy of 95.09% for bladder cancer dataset and 88.82% for the breast cancer dataset.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
一种自动编码器和生成式对抗网络方法用于多传感器数据不平衡类别处理和分类
在提高医疗诊断水平的不懈努力中,整合最先进的机器学习方法已成为一个前景广阔的研究领域。在分子生物学领域,多组学测序产生的数据呈爆炸式增长。新出现的测序设备可以在一次实验中提供大量复杂的测量数据。因此,传统的统计方法在处理这种高维数据时面临挑战。然而,这些数据集中包含的大部分信息都是冗余或不相关的,因此可以有效地减少变量数量而不会丢失太多信息。降维技术是实现降维的数学方法,主要是通过统计学和机器学习学科发展起来的。医学数据集面临的另一个挑战是类中样本数量不平衡,这会导致机器学习模型的结果出现偏差。本研究的重点是在神经网络中应对这些挑战,该网络结合了自动编码器来提取特征的潜在空间,以及生成对抗网络(GAN)来生成合成样本。潜在空间是一个缩小了的维度空间,它捕捉了原始数据的有意义特征。我们的模型从特征选择开始,选择具有区分度的特征,然后将其输入神经网络。然后,模型预测不同数据集的癌症结果。所提出的模型在膀胱癌数据集和乳腺癌数据集上的准确率分别为 95.09% 和 88.82%,优于其他现有模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Allium Vegetables Intake and Digestive System Cancer Risk: A Study Based on Mendelian Randomization, Network Pharmacology and Molecular Docking wgatools: an ultrafast toolkit for manipulating whole genome alignments Selecting Differential Splicing Methods: Practical Considerations Advancements in colored k-mer sets: essentials for the curious Advancements in practical k-mer sets: essentials for the curious
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1