新型体细胞突变对癌症预后影响的预测建模:使用 COSMIC 数据库的机器学习方法

Masab A. Mansoor, Dba
{"title":"新型体细胞突变对癌症预后影响的预测建模:使用 COSMIC 数据库的机器学习方法","authors":"Masab A. Mansoor, Dba","doi":"10.1101/2024.08.10.24311796","DOIUrl":null,"url":null,"abstract":"Abstract Background Somatic mutations play a crucial role in cancer initiation, progression, and treatment response. While high-throughput sequencing has vastly expanded our understanding of cancer genomics, interpreting the functional impact of novel somatic mutations remains challenging. Machine learning approaches show promise in predicting mutation impacts, but robust models for accurate prognosis across different cancer types are still needed. Objective This study aimed to develop and validate a machine learning model using the Catalogue of Somatic Mutations in Cancer (COSMIC) database to predict the functional impact of novel somatic mutations on cancer prognosis across various cancer types. Methods We extracted data on 6,573,214 coding point mutations across 1,391 cancer types from COSMIC v95. We engineered 47 features for each mutation, including sequence context, protein domain information, evolutionary conservation scores, and frequency data. We developed and compared Random Forest, XGBoost, and Deep Neural Network models, selecting XGBoost based on performance. The model was evaluated using standard metrics and externally validated using data from The Cancer Genome Atlas (TCGA). Results The XGBoost model achieved an area under the Receiver Operating Characteristic curve (AUC-ROC) of 0.89 on the test set and 0.86 on the TCGA validation set. The model demonstrated consistent performance across major cancer types (AUC-ROC range: 0.85-0.92). Key predictive features included evolutionary conservation score, protein domain disruption, and mutation frequency. The model correctly identified 87% of known driver mutations and predicted 3,241 potentially high-impact novel mutations. Model predictions significantly correlated with patient survival in the TCGA dataset (HR = 1.8, 95% CI: 1.6-2.0, p < 0.001). Conclusions Our machine learning model shows strong predictive power in assessing the functional impact of somatic mutations on cancer prognosis across various cancer types. This approach has potential applications in research prioritization and clinical decision support, contributing to the advancement of precision oncology. Keywords cancer genomics; somatic mutations; machine learning; prognosis prediction; COSMIC database; precision oncology","PeriodicalId":18505,"journal":{"name":"medRxiv","volume":"16 5","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Predictive Modeling of Novel Somatic Mutation Impacts on Cancer Prognosis: A Machine Learning Approach Using the COSMIC Database\",\"authors\":\"Masab A. Mansoor, Dba\",\"doi\":\"10.1101/2024.08.10.24311796\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Background Somatic mutations play a crucial role in cancer initiation, progression, and treatment response. While high-throughput sequencing has vastly expanded our understanding of cancer genomics, interpreting the functional impact of novel somatic mutations remains challenging. Machine learning approaches show promise in predicting mutation impacts, but robust models for accurate prognosis across different cancer types are still needed. Objective This study aimed to develop and validate a machine learning model using the Catalogue of Somatic Mutations in Cancer (COSMIC) database to predict the functional impact of novel somatic mutations on cancer prognosis across various cancer types. Methods We extracted data on 6,573,214 coding point mutations across 1,391 cancer types from COSMIC v95. We engineered 47 features for each mutation, including sequence context, protein domain information, evolutionary conservation scores, and frequency data. We developed and compared Random Forest, XGBoost, and Deep Neural Network models, selecting XGBoost based on performance. The model was evaluated using standard metrics and externally validated using data from The Cancer Genome Atlas (TCGA). Results The XGBoost model achieved an area under the Receiver Operating Characteristic curve (AUC-ROC) of 0.89 on the test set and 0.86 on the TCGA validation set. The model demonstrated consistent performance across major cancer types (AUC-ROC range: 0.85-0.92). Key predictive features included evolutionary conservation score, protein domain disruption, and mutation frequency. The model correctly identified 87% of known driver mutations and predicted 3,241 potentially high-impact novel mutations. Model predictions significantly correlated with patient survival in the TCGA dataset (HR = 1.8, 95% CI: 1.6-2.0, p < 0.001). Conclusions Our machine learning model shows strong predictive power in assessing the functional impact of somatic mutations on cancer prognosis across various cancer types. This approach has potential applications in research prioritization and clinical decision support, contributing to the advancement of precision oncology. Keywords cancer genomics; somatic mutations; machine learning; prognosis prediction; COSMIC database; precision oncology\",\"PeriodicalId\":18505,\"journal\":{\"name\":\"medRxiv\",\"volume\":\"16 5\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"medRxiv\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.08.10.24311796\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.10.24311796","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

摘要 背景 体细胞突变在癌症的发生、发展和治疗反应中起着至关重要的作用。虽然高通量测序极大地扩展了我们对癌症基因组学的了解,但解读新型体细胞突变的功能影响仍具有挑战性。机器学习方法有望预测突变的影响,但仍需建立强大的模型来准确预测不同癌症类型的预后。目的 本研究旨在利用癌症体细胞突变目录(COSMIC)数据库开发并验证一种机器学习模型,以预测新型体细胞突变对不同癌症类型预后的功能性影响。方法 我们从 COSMIC v95 中提取了 1,391 种癌症类型中 6,573,214 个编码点突变的数据。我们为每个突变设计了 47 个特征,包括序列上下文、蛋白质域信息、进化保护评分和频率数据。我们开发并比较了随机森林模型、XGBoost 模型和深度神经网络模型,并根据性能选择了 XGBoost 模型。我们使用标准指标对模型进行了评估,并使用癌症基因组图谱(TCGA)的数据进行了外部验证。结果 XGBoost 模型在测试集上的接收者操作特征曲线下面积(AUC-ROC)为 0.89,在 TCGA 验证集上为 0.86。该模型在主要癌症类型中表现出一致的性能(AUC-ROC 范围:0.85-0.92)。主要预测特征包括进化保护得分、蛋白质结构域中断和突变频率。该模型正确识别了87%的已知驱动突变,并预测了3241个潜在的高影响新型突变。在 TCGA 数据集中,模型预测结果与患者生存率明显相关(HR = 1.8,95% CI:1.6-2.0,p < 0.001)。结论 我们的机器学习模型在评估体细胞突变对各种癌症预后的功能性影响方面显示出很强的预测能力。这种方法有望应用于研究优先级排序和临床决策支持,促进精准肿瘤学的发展。关键词 癌症基因组学;体细胞突变;机器学习;预后预测;COSMIC 数据库;精准肿瘤学
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Predictive Modeling of Novel Somatic Mutation Impacts on Cancer Prognosis: A Machine Learning Approach Using the COSMIC Database
Abstract Background Somatic mutations play a crucial role in cancer initiation, progression, and treatment response. While high-throughput sequencing has vastly expanded our understanding of cancer genomics, interpreting the functional impact of novel somatic mutations remains challenging. Machine learning approaches show promise in predicting mutation impacts, but robust models for accurate prognosis across different cancer types are still needed. Objective This study aimed to develop and validate a machine learning model using the Catalogue of Somatic Mutations in Cancer (COSMIC) database to predict the functional impact of novel somatic mutations on cancer prognosis across various cancer types. Methods We extracted data on 6,573,214 coding point mutations across 1,391 cancer types from COSMIC v95. We engineered 47 features for each mutation, including sequence context, protein domain information, evolutionary conservation scores, and frequency data. We developed and compared Random Forest, XGBoost, and Deep Neural Network models, selecting XGBoost based on performance. The model was evaluated using standard metrics and externally validated using data from The Cancer Genome Atlas (TCGA). Results The XGBoost model achieved an area under the Receiver Operating Characteristic curve (AUC-ROC) of 0.89 on the test set and 0.86 on the TCGA validation set. The model demonstrated consistent performance across major cancer types (AUC-ROC range: 0.85-0.92). Key predictive features included evolutionary conservation score, protein domain disruption, and mutation frequency. The model correctly identified 87% of known driver mutations and predicted 3,241 potentially high-impact novel mutations. Model predictions significantly correlated with patient survival in the TCGA dataset (HR = 1.8, 95% CI: 1.6-2.0, p < 0.001). Conclusions Our machine learning model shows strong predictive power in assessing the functional impact of somatic mutations on cancer prognosis across various cancer types. This approach has potential applications in research prioritization and clinical decision support, contributing to the advancement of precision oncology. Keywords cancer genomics; somatic mutations; machine learning; prognosis prediction; COSMIC database; precision oncology
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Factors determining hemoglobin levels in vaginally delivered term newborns at public hospitals in Lusaka, Zambia Accurate and cost-efficient whole genome sequencing of hepatitis B virus using Nanopore Mapping Epigenetic Gene Variant Dynamics: Comparative Analysis of Frequency, Functional Impact and Trait Associations in African and European Populations Assessing Population-level Accessibility to Medical College Hospitals in India: A Geospatial Modeling Study Targeted inference to identify drug repositioning candidates in the Danish health registries
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1