Machine Learning Model for Multiomics Biomarkers Identification for Menopause Status in Breast Cancer

IF 1.8 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Algorithms Pub Date : 2023-12-28 DOI:10.3390/a17010013
Firas Alghanim, Ibrahim Al-Hurani, H. Qattous, Abdullah Al-Refai, Osamah Batiha, A. Alkhateeb, Salama Ikki
{"title":"Machine Learning Model for Multiomics Biomarkers Identification for Menopause Status in Breast Cancer","authors":"Firas Alghanim, Ibrahim Al-Hurani, H. Qattous, Abdullah Al-Refai, Osamah Batiha, A. Alkhateeb, Salama Ikki","doi":"10.3390/a17010013","DOIUrl":null,"url":null,"abstract":"Identifying menopause-related breast cancer biomarkers is crucial for enhancing diagnosis, prognosis, and personalized treatment at that stage of the patient’s life. In this paper, we present a comprehensive framework for extracting multiomics biomarkers specifically related to breast cancer incidence before and after menopause. Our approach integrates DNA methylation, gene expression, and copy number alteration data using a systematic pipeline encompassing data preprocessing and handling class imbalance, dimensionality reduction, and classification. The framework starts with MutSigCV for data preprocessing and ensuring data quality. The Synthetic Minority Over-sampling Technique (SMOTE) up-sampling technique is applied to address the class imbalance representation. Then, Principal Component Analysis (PCA) transforms the DNA methylation, gene expression, and copy number alteration data into a latent space. The purpose is to discard irrelevant variations and extract relevant information. Finally, a classification model is built based on the transformed multiomics data into a unified representation. The framework contributes to understanding the complex interplay between menopause and breast cancer, thereby revealing more precise diagnostic and therapeutic strategies in the future. The explainable artificial intelligence model Shapley based on the XGBoost regressor showed the power of the selected gene expressions for predicting the menopause status, and the potential biomarkers included RUNX1, PTEN, MAP3K1, and CDH1. The literature confirmed the findings.","PeriodicalId":7636,"journal":{"name":"Algorithms","volume":"221 8","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/a17010013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Identifying menopause-related breast cancer biomarkers is crucial for enhancing diagnosis, prognosis, and personalized treatment at that stage of the patient’s life. In this paper, we present a comprehensive framework for extracting multiomics biomarkers specifically related to breast cancer incidence before and after menopause. Our approach integrates DNA methylation, gene expression, and copy number alteration data using a systematic pipeline encompassing data preprocessing and handling class imbalance, dimensionality reduction, and classification. The framework starts with MutSigCV for data preprocessing and ensuring data quality. The Synthetic Minority Over-sampling Technique (SMOTE) up-sampling technique is applied to address the class imbalance representation. Then, Principal Component Analysis (PCA) transforms the DNA methylation, gene expression, and copy number alteration data into a latent space. The purpose is to discard irrelevant variations and extract relevant information. Finally, a classification model is built based on the transformed multiomics data into a unified representation. The framework contributes to understanding the complex interplay between menopause and breast cancer, thereby revealing more precise diagnostic and therapeutic strategies in the future. The explainable artificial intelligence model Shapley based on the XGBoost regressor showed the power of the selected gene expressions for predicting the menopause status, and the potential biomarkers included RUNX1, PTEN, MAP3K1, and CDH1. The literature confirmed the findings.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用于识别乳腺癌绝经状态的多组学生物标记物的机器学习模型
确定与更年期相关的乳腺癌生物标志物对于加强该阶段的诊断、预后和个性化治疗至关重要。在本文中,我们提出了一个提取与绝经前后乳腺癌发病率特别相关的多组学生物标志物的综合框架。我们的方法使用一个系统管道整合了 DNA 甲基化、基因表达和拷贝数改变数据,该管道包括数据预处理、类不平衡处理、降维和分类。该框架从 MutSigCV 开始,进行数据预处理并确保数据质量。应用合成少数群体过度采样技术(SMOTE)向上采样技术来处理类不平衡表示。然后,主成分分析法(PCA)将 DNA 甲基化、基因表达和拷贝数改变数据转化为潜在空间。这样做的目的是摒弃无关变异,提取相关信息。最后,根据转换后的多组学数据建立一个统一表示的分类模型。该框架有助于理解更年期与乳腺癌之间复杂的相互作用,从而揭示未来更精确的诊断和治疗策略。基于 XGBoost 回归器的可解释人工智能模型 Shapley 显示了所选基因表达预测绝经状态的能力,潜在的生物标志物包括 RUNX1、PTEN、MAP3K1 和 CDH1。文献证实了这些发现。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Algorithms
Algorithms Mathematics-Numerical Analysis
CiteScore
4.10
自引率
4.30%
发文量
394
审稿时长
11 weeks
期刊最新文献
Specification Mining Based on the Ordering Points to Identify the Clustering Structure Clustering Algorithm and Model Checking Personalized Advertising in E-Commerce: Using Clickstream Data to Target High-Value Customers Navigating the Maps: Euclidean vs. Road Network Distances in Spatial Queries Hybrid Sparrow Search-Exponential Distribution Optimization with Differential Evolution for Parameter Prediction of Solar Photovoltaic Models Particle Swarm Optimization-Based Unconstrained Polygonal Fitting of 2D Shapes
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1