Predictive analytics is crucial in precision medicine for personalized patient care. To aid in precision medicine, this study identifies a subset of genetic and clinical variables that can serve as predictors for classifying diseased tissues/disease types. To achieve this, experiments were performed on diseased tissues obtained from the L1000 dataset to assess differences in the functionality and predictive capabilities of genetic and clinical variables. In this study, the k‐means technique was used for clustering the diseased tissue types, and the multinomial logistic regression (MLR) technique was applied for classifying the diseased tissue types. Dimensionality reduction techniques including principal component analysis and Boruta are used extensively to reduce the dimensionality of genetic and clinical variables. The results showed that landmark genes performed slightly better in clustering diseased tissue types compared to any random set of 978 non‐landmark genes, and the difference is statistically significant. Furthermore, it was evident that both clinical and genetic variables were important in predicting the diseased tissue types. The top three clinical predictors for predicting diseased tissue types were identified as morphology, gender, and age of diagnosis. Additionally, this study explored the possibility of using the latent representations of the clusters of landmark and non‐landmark genes as predictors for an MLR classifier. The classification models built using MLR revealed that landmark genes can serve as a subset of genetic variables and/or as a proxy for clinical variables. This study concludes that combining predictive analytics with dimensionality reduction effectively identifies key predictors in precision medicine, enhancing diagnostic accuracy.
{"title":"Characterizing diseases using genetic and clinical variables: A data analytics approach","authors":"Madhuri Gollapalli, Harsh Anand, Satish Mahadevan Srinivasan","doi":"10.1002/qub2.46","DOIUrl":"https://doi.org/10.1002/qub2.46","url":null,"abstract":"Predictive analytics is crucial in precision medicine for personalized patient care. To aid in precision medicine, this study identifies a subset of genetic and clinical variables that can serve as predictors for classifying diseased tissues/disease types. To achieve this, experiments were performed on diseased tissues obtained from the L1000 dataset to assess differences in the functionality and predictive capabilities of genetic and clinical variables. In this study, the k‐means technique was used for clustering the diseased tissue types, and the multinomial logistic regression (MLR) technique was applied for classifying the diseased tissue types. Dimensionality reduction techniques including principal component analysis and Boruta are used extensively to reduce the dimensionality of genetic and clinical variables. The results showed that landmark genes performed slightly better in clustering diseased tissue types compared to any random set of 978 non‐landmark genes, and the difference is statistically significant. Furthermore, it was evident that both clinical and genetic variables were important in predicting the diseased tissue types. The top three clinical predictors for predicting diseased tissue types were identified as morphology, gender, and age of diagnosis. Additionally, this study explored the possibility of using the latent representations of the clusters of landmark and non‐landmark genes as predictors for an MLR classifier. The classification models built using MLR revealed that landmark genes can serve as a subset of genetic variables and/or as a proxy for clinical variables. This study concludes that combining predictive analytics with dimensionality reduction effectively identifies key predictors in precision medicine, enhancing diagnostic accuracy.","PeriodicalId":508846,"journal":{"name":"Quantitative Biology","volume":"55 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140975906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Binyu Yang, Siying Liu, Jiemin Xie, Xi Tang, Pan Guan, Yifan Zhu, Xuemei Liu, Yunhui Xiong, Zuli Yang, Weiyao Li, Yonghua Wang, Wen Chen, Qingjiao Li, Li C. Xia
Molecular subtyping of gastric cancer (GC) aims to comprehend its genetic landscape. However, the efficacy of current subtyping methods is hampered by their mixed use of molecular features, a lack of strategy optimization, and the limited availability of public GC datasets. There is a pressing need for a precise and easily adoptable subtyping approach for early DNA‐based screening and treatment. Based on TCGA subtypes, we developed a novel DNA‐based hierarchical classifier for gastric cancer molecular subtyping (HCG), which employs gene mutations, copy number aberrations, and methylation patterns as predictors. By incorporating the closely related esophageal adenocarcinomas dataset, we expanded the TCGA GC dataset for the training and testing of HCG (n = 453). The optimization of HCG was achieved through three hierarchical strategies using Lasso‐Logistic regression, evaluated by their overall the area under receiver operating characteristic curve (auROC), accuracy, F1 score, the area under precision‐recall curve (auPRC) and their capability for clinical stratification using multivariate survival analysis. Subtype‐specific DNA alteration biomarkers were discerned through difference tests based on HCG defined subtypes. Our HCG classifier demonstrated superior performance in terms of overall auROC (0.95), accuracy (0.88), F1 score (0.87) and auPRC (0.86), significantly improving the clinical stratification of patients (overall p‐value = 0.032). Difference tests identified 25 subtype‐specific DNA alterations, including a high mutation rate in the SYNE1, ITGB4, and COL22A1 genes for the MSI subtype, and hypermethylation of ALS2CL, KIAA0406, and RPRD1B genes for the EBV subtype. HCG is an accurate and robust classifier for DNA‐based GC molecular subtyping with highly predictive clinical stratification performance. The training and test datasets, along with the analysis programs of HCG, are accessible on the GitHub website (github.com/LabxSCUT).
胃癌(GC)的分子亚型分析旨在了解其基因状况。然而,目前的亚型鉴定方法因其对分子特征的混合使用、缺乏策略优化以及公共胃癌数据集的可用性有限而影响了其效果。目前迫切需要一种精确且易于采用的亚型鉴定方法,用于基于 DNA 的早期筛查和治疗。在 TCGA 亚型的基础上,我们开发了一种新的基于 DNA 的胃癌分子亚型分层分类器(HCG),它采用基因突变、拷贝数畸变和甲基化模式作为预测因子。通过纳入密切相关的食管腺癌数据集,我们扩展了用于训练和测试 HCG 的 TCGA 胃癌数据集(n = 453)。通过使用Lasso-Logistic回归的三种分层策略实现了HCG的优化,并通过接收者操作特征曲线下面积(auROC)、准确率、F1评分、精确度-召回曲线下面积(auPRC)以及使用多变量生存分析进行临床分层的能力对其进行了评估。亚型特异性DNA改变生物标记物是根据HCG定义的亚型通过差异检验确定的。我们的HCG分类器在总体auROC(0.95)、准确率(0.88)、F1得分(0.87)和auPRC(0.86)方面表现优异,显著改善了患者的临床分层(总体p值=0.032)。差异检验确定了 25 种亚型特异性 DNA 改变,包括 MSI 亚型中 SYNE1、ITGB4 和 COL22A1 基因的高突变率,以及 EBV 亚型中 ALS2CL、KIAA0406 和 RPRD1B 基因的高甲基化。HCG是一种基于DNA的GC分子亚型准确而稳健的分类器,具有高度的临床分层预测性能。HCG的训练和测试数据集以及分析程序可在GitHub网站(github.com/LabxSCUT)上访问。
{"title":"Hierarchical learning of gastric cancer molecular subtypes by integrating multi‐modal DNA‐level omics data and clinical stratification","authors":"Binyu Yang, Siying Liu, Jiemin Xie, Xi Tang, Pan Guan, Yifan Zhu, Xuemei Liu, Yunhui Xiong, Zuli Yang, Weiyao Li, Yonghua Wang, Wen Chen, Qingjiao Li, Li C. Xia","doi":"10.1002/qub2.45","DOIUrl":"https://doi.org/10.1002/qub2.45","url":null,"abstract":"Molecular subtyping of gastric cancer (GC) aims to comprehend its genetic landscape. However, the efficacy of current subtyping methods is hampered by their mixed use of molecular features, a lack of strategy optimization, and the limited availability of public GC datasets. There is a pressing need for a precise and easily adoptable subtyping approach for early DNA‐based screening and treatment. Based on TCGA subtypes, we developed a novel DNA‐based hierarchical classifier for gastric cancer molecular subtyping (HCG), which employs gene mutations, copy number aberrations, and methylation patterns as predictors. By incorporating the closely related esophageal adenocarcinomas dataset, we expanded the TCGA GC dataset for the training and testing of HCG (n = 453). The optimization of HCG was achieved through three hierarchical strategies using Lasso‐Logistic regression, evaluated by their overall the area under receiver operating characteristic curve (auROC), accuracy, F1 score, the area under precision‐recall curve (auPRC) and their capability for clinical stratification using multivariate survival analysis. Subtype‐specific DNA alteration biomarkers were discerned through difference tests based on HCG defined subtypes. Our HCG classifier demonstrated superior performance in terms of overall auROC (0.95), accuracy (0.88), F1 score (0.87) and auPRC (0.86), significantly improving the clinical stratification of patients (overall p‐value = 0.032). Difference tests identified 25 subtype‐specific DNA alterations, including a high mutation rate in the SYNE1, ITGB4, and COL22A1 genes for the MSI subtype, and hypermethylation of ALS2CL, KIAA0406, and RPRD1B genes for the EBV subtype. HCG is an accurate and robust classifier for DNA‐based GC molecular subtyping with highly predictive clinical stratification performance. The training and test datasets, along with the analysis programs of HCG, are accessible on the GitHub website (github.com/LabxSCUT).","PeriodicalId":508846,"journal":{"name":"Quantitative Biology","volume":"77 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140984647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianxing Ma, Zetong Zhao, Haochen Li, Lei Wei, Xuegong Zhang
Complicated molecular alterations in tumors generate various mutant peptides. Some of these mutant peptides can be presented to the cell surface and then elicit immune responses, and such mutant peptides are called neoantigens. Accurate detection of neoantigens could help to design personalized cancer vaccines. Although some computational frameworks for neoantigen detection have been proposed, most of them can only detect SNV‐ and indel‐derived neoantigens. In addition, current frameworks adopt oversimplified neoantigen prioritization strategies. These factors hinder the comprehensive and effective detection of neoantigens. We developed NeoHunter, flexible software to systematically detect and prioritize neoantigens from sequencing data in different formats. NeoHunter can detect not only SNV‐ and indel‐derived neoantigens but also gene fusion‐ and aberrant splicing‐derived neoantigens. NeoHunter supports both direct and indirect immunogenicity evaluation strategies to prioritize candidate neoantigens. These strategies utilize binding characteristics, existing biological big data, and T‐cell receptor specificity to ensure accurate detection and prioritization. We applied NeoHunter to the TESLA dataset, cohorts of melanoma and non‐small cell lung cancer patients. NeoHunter achieved high performance across the TESLA cancer patients and detected 79% (27 out of 34) of validated neoantigens in total. SNV‐ and indel‐derived neoantigens accounted for 90% of the top 100 candidate neoantigens while neoantigens from aberrant splicing accounted for 9%. Gene fusion‐derived neoantigens were detected in one patient. NeoHunter is a powerful tool to ‘catch all’ neoantigens and is available for free academic use on Github (XuegongLab/NeoHunter).
{"title":"NeoHunter: Flexible software for systematically detecting neoantigens from sequencing data","authors":"Tianxing Ma, Zetong Zhao, Haochen Li, Lei Wei, Xuegong Zhang","doi":"10.1002/qub2.28","DOIUrl":"https://doi.org/10.1002/qub2.28","url":null,"abstract":"Complicated molecular alterations in tumors generate various mutant peptides. Some of these mutant peptides can be presented to the cell surface and then elicit immune responses, and such mutant peptides are called neoantigens. Accurate detection of neoantigens could help to design personalized cancer vaccines. Although some computational frameworks for neoantigen detection have been proposed, most of them can only detect SNV‐ and indel‐derived neoantigens. In addition, current frameworks adopt oversimplified neoantigen prioritization strategies. These factors hinder the comprehensive and effective detection of neoantigens. We developed NeoHunter, flexible software to systematically detect and prioritize neoantigens from sequencing data in different formats. NeoHunter can detect not only SNV‐ and indel‐derived neoantigens but also gene fusion‐ and aberrant splicing‐derived neoantigens. NeoHunter supports both direct and indirect immunogenicity evaluation strategies to prioritize candidate neoantigens. These strategies utilize binding characteristics, existing biological big data, and T‐cell receptor specificity to ensure accurate detection and prioritization. We applied NeoHunter to the TESLA dataset, cohorts of melanoma and non‐small cell lung cancer patients. NeoHunter achieved high performance across the TESLA cancer patients and detected 79% (27 out of 34) of validated neoantigens in total. SNV‐ and indel‐derived neoantigens accounted for 90% of the top 100 candidate neoantigens while neoantigens from aberrant splicing accounted for 9%. Gene fusion‐derived neoantigens were detected in one patient. NeoHunter is a powerful tool to ‘catch all’ neoantigens and is available for free academic use on Github (XuegongLab/NeoHunter).","PeriodicalId":508846,"journal":{"name":"Quantitative Biology","volume":"8 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139608341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cardiovascular disease (CVD) is the major cause of death in many regions around the world, and several of its risk factors might be linked to diets. To improve public health and the understanding of this topic, we look at the recent Minnesota Coronary Experiment (MCE) analysis that used t‐test and Cox model to evaluate CVD risks. However, these parametric methods might suffer from three problems: small sample size, right‐censored bias, and lack of long‐term evidence. To overcome the first of these challenges, we utilize a nonparametric permutation test to examine the relationship between dietary fats and serum total cholesterol. To address the second problem, we use a resampling‐based rank test to examine whether the serum total cholesterol level affects CVD deaths. For the third issue, we use some extra‐Framingham Heart Study (FHS) data with an A/B test to look for meta‐relationship between diets, risk factors, and CVD risks. We show that, firstly, the link between low saturated fat diets and reduction in serum total cholesterol is strong. Secondly, reducing serum total cholesterol does not robustly have an impact on CVD hazards in the diet group. Lastly, the A/B test result suggests a more complicated relationship regarding abnormal diastolic blood pressure ranges caused by diets and how these might affect the associative link between the cholesterol level and heart disease risks. This study not only helps us to deeply analyze the MCE data but also, in combination with the long‐term FHS data, reveals possible complex relationships behind diets, risk factors, and heart disease.
{"title":"Re‐examination of statistical relationships between dietary fats and other risk factors, and cardiovascular disease, based on two crucial datasets","authors":"Jiarui Ou, Le Zhang, Xiaoli Ru","doi":"10.1002/qub2.29","DOIUrl":"https://doi.org/10.1002/qub2.29","url":null,"abstract":"Cardiovascular disease (CVD) is the major cause of death in many regions around the world, and several of its risk factors might be linked to diets. To improve public health and the understanding of this topic, we look at the recent Minnesota Coronary Experiment (MCE) analysis that used t‐test and Cox model to evaluate CVD risks. However, these parametric methods might suffer from three problems: small sample size, right‐censored bias, and lack of long‐term evidence. To overcome the first of these challenges, we utilize a nonparametric permutation test to examine the relationship between dietary fats and serum total cholesterol. To address the second problem, we use a resampling‐based rank test to examine whether the serum total cholesterol level affects CVD deaths. For the third issue, we use some extra‐Framingham Heart Study (FHS) data with an A/B test to look for meta‐relationship between diets, risk factors, and CVD risks. We show that, firstly, the link between low saturated fat diets and reduction in serum total cholesterol is strong. Secondly, reducing serum total cholesterol does not robustly have an impact on CVD hazards in the diet group. Lastly, the A/B test result suggests a more complicated relationship regarding abnormal diastolic blood pressure ranges caused by diets and how these might affect the associative link between the cholesterol level and heart disease risks. This study not only helps us to deeply analyze the MCE data but also, in combination with the long‐term FHS data, reveals possible complex relationships behind diets, risk factors, and heart disease.","PeriodicalId":508846,"journal":{"name":"Quantitative Biology","volume":"31 50","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139607703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}