Multicategory Survival Outcomes Classification via Overlapping Group Screening Process Based on Multinomial Logistic Regression Model With Application to TCGA Transcriptomic Data.

IF 2.5 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Cancer Informatics Pub Date : 2024-10-08 eCollection Date: 2024-01-01 DOI:10.1177/11769351241286710

Jie-Huei Wang, Po-Lin Hou, Yi-Hau Chen

{"title":"Multicategory Survival Outcomes Classification via Overlapping Group Screening Process Based on Multinomial Logistic Regression Model With Application to TCGA Transcriptomic Data.","authors":"Jie-Huei Wang, Po-Lin Hou, Yi-Hau Chen","doi":"10.1177/11769351241286710","DOIUrl":null,"url":null,"abstract":"Objectives: Under the classification of multicategory survival outcomes of cancer patients, it is crucial to identify biomarkers that affect specific outcome categories. The classification of multicategory survival outcomes from transcriptomic data has been thoroughly investigated in computational biology. Nevertheless, several challenges must be addressed, including the ultra-high-dimensional feature space, feature contamination, and data imbalance, all of which contribute to the instability of the diagnostic model. Furthermore, although most methods achieve accurate predicted performance for binary classification with high-dimensional transcriptomic data, their extension to multi-class classification is not straightforward.Methods: We employ the One-versus-One strategy to transform multi-class classification into multiple binary classification, and utilize the overlapping group screening procedure with binary logistic regression to include pathway information for identifying important genes and gene-gene interactions for multicategory survival outcomes.Results: A series of simulation studies are conducted to compare the classification accuracy of our proposed approach with some existing machine learning methods. In practical data applications, we utilize the random oversampling procedure to tackle class imbalance issues. We then apply the proposed method to analyze transcriptomic data from various cancers in The Cancer Genome Atlas, such as kidney renal papillary cell carcinoma, lung adenocarcinoma, and head and neck squamous cell carcinoma. Our aim is to establish an accurate microarray-based multicategory cancer diagnosis model. The numerical results illustrate that the new proposal effectively enhances cancer diagnosis compared to approaches that neglect pathway information.Conclusions: We showcase the effectiveness of the proposed method in terms of class prediction accuracy through evaluations on simulated synthetic datasets as well as real dataset applications. We also identified the cancer-related gene-gene interaction biomarkers and reported the corresponding network structure. According to the identified major genes and gene-gene interactions, we can predict for each patient the probabilities that he/she belongs to each of the survival outcome classes.","PeriodicalId":35418,"journal":{"name":"Cancer Informatics","volume":"23 ","pages":"11769351241286710"},"PeriodicalIF":2.5000,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11462568/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/11769351241286710","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: Under the classification of multicategory survival outcomes of cancer patients, it is crucial to identify biomarkers that affect specific outcome categories. The classification of multicategory survival outcomes from transcriptomic data has been thoroughly investigated in computational biology. Nevertheless, several challenges must be addressed, including the ultra-high-dimensional feature space, feature contamination, and data imbalance, all of which contribute to the instability of the diagnostic model. Furthermore, although most methods achieve accurate predicted performance for binary classification with high-dimensional transcriptomic data, their extension to multi-class classification is not straightforward.

Methods: We employ the One-versus-One strategy to transform multi-class classification into multiple binary classification, and utilize the overlapping group screening procedure with binary logistic regression to include pathway information for identifying important genes and gene-gene interactions for multicategory survival outcomes.

Results: A series of simulation studies are conducted to compare the classification accuracy of our proposed approach with some existing machine learning methods. In practical data applications, we utilize the random oversampling procedure to tackle class imbalance issues. We then apply the proposed method to analyze transcriptomic data from various cancers in The Cancer Genome Atlas, such as kidney renal papillary cell carcinoma, lung adenocarcinoma, and head and neck squamous cell carcinoma. Our aim is to establish an accurate microarray-based multicategory cancer diagnosis model. The numerical results illustrate that the new proposal effectively enhances cancer diagnosis compared to approaches that neglect pathway information.

Conclusions: We showcase the effectiveness of the proposed method in terms of class prediction accuracy through evaluations on simulated synthetic datasets as well as real dataset applications. We also identified the cancer-related gene-gene interaction biomarkers and reported the corresponding network structure. According to the identified major genes and gene-gene interactions, we can predict for each patient the probabilities that he/she belongs to each of the survival outcome classes.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于多叉 Logistic 回归模型的重叠组筛选过程的多类生存结果分类，并应用于 TCGA 转录组数据。

研究目的在对癌症患者的多类生存结果进行分类时，确定影响特定结果类别的生物标志物至关重要。计算生物学对转录组数据的多类生存结果分类进行了深入研究。然而，有几个难题必须解决，包括超高维特征空间、特征污染和数据不平衡，所有这些都会导致诊断模型的不稳定性。此外，虽然大多数方法都能在高维转录组数据的二元分类中实现准确的预测性能，但将其扩展到多类分类却并不简单：方法：我们采用 "一对一"（One-versus-One）策略将多类分类转化为多重二元分类，并利用二元逻辑回归的重叠组筛选程序纳入通路信息，以确定多类生存结果的重要基因和基因-基因相互作用：我们进行了一系列模拟研究，比较了我们提出的方法与一些现有机器学习方法的分类准确性。在实际数据应用中，我们利用随机超采样程序来解决类不平衡问题。然后，我们将提出的方法用于分析癌症基因组图谱中各种癌症的转录组数据，如肾脏乳头状细胞癌、肺腺癌和头颈部鳞状细胞癌。我们的目标是建立一个基于芯片的多类癌症精确诊断模型。数值结果表明，与忽视路径信息的方法相比，新建议能有效提高癌症诊断效果：通过对模拟合成数据集和真实数据集应用的评估，我们展示了所提方法在类别预测准确性方面的有效性。我们还确定了与癌症相关的基因-基因相互作用生物标记物，并报告了相应的网络结构。根据确定的主要基因和基因-基因相互作用，我们可以预测每个患者属于每个生存结果类别的概率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Cancer Informatics Medicine-Oncology

CiteScore

3.00

自引率

5.00%

发文量

审稿时长

8 weeks

期刊介绍： The field of cancer research relies on advances in many other disciplines, including omics technology, mass spectrometry, radio imaging, computer science, and biostatistics. Cancer Informatics provides open access to peer-reviewed high-quality manuscripts reporting bioinformatics analysis of molecular genetics and/or clinical data pertaining to cancer, emphasizing the use of machine learning, artificial intelligence, statistical algorithms, advanced imaging techniques, data visualization, and high-throughput technologies. As the leading journal dedicated exclusively to the report of the use of computational methods in cancer research and practice, Cancer Informatics leverages methodological improvements in systems biology, genomics, proteomics, metabolomics, and molecular biochemistry into the fields of cancer detection, treatment, classification, risk-prediction, prevention, outcome, and modeling.