膀胱癌肿瘤分期和生存的多基因风险模型。

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Biodata Mining Pub Date : 2022-09-30 DOI:10.1186/s13040-022-00306-w

Mauro Nascimben, Lia Rimondini, Davide Corà, Manolo Venturin

{"title":"膀胱癌肿瘤分期和生存的多基因风险模型。","authors":"Mauro Nascimben, Lia Rimondini, Davide Corà, Manolo Venturin","doi":"10.1186/s13040-022-00306-w","DOIUrl":null,"url":null,"abstract":"Introduction: Bladder cancer assessment with non-invasive gene expression signatures facilitates the detection of patients at risk and surveillance of their status, bypassing the discomforts given by cystoscopy. To achieve accurate cancer estimation, analysis pipelines for gene expression data (GED) may integrate a sequence of several machine learning and bio-statistical techniques to model complex characteristics of pathological patterns.Methods: Numerical experiments tested the combination of GED preprocessing by discretization with tree ensemble embeddings and nonlinear dimensionality reductions to categorize oncological patients comprehensively. Modeling aimed to identify tumor stage and distinguish survival outcomes in two situations: complete and partial data embedding. This latter experimental condition simulates the addition of new patients to an existing model for rapid monitoring of disease progression. Machine learning procedures were employed to identify the most relevant genes involved in patient prognosis and test the performance of preprocessed GED compared to untransformed data in predicting patient conditions.Results: Data embedding paired with dimensionality reduction produced prognostic maps with well-defined clusters of patients, suitable for medical decision support. A second experiment simulated the addition of new patients to an existing model (partial data embedding): Uniform Manifold Approximation and Projection (UMAP) methodology with uniform data discretization led to better outcomes than other analyzed pipelines. Further exploration of parameter space for UMAP and t-distributed stochastic neighbor embedding (t-SNE) underlined the importance of tuning a higher number of parameters for UMAP rather than t-SNE. Moreover, two different machine learning experiments identified a group of genes valuable for partitioning patients (gene relevance analysis) and showed the higher precision obtained by preprocessed data in predicting tumor outcomes for cancer stage and survival rate (six classes prediction).Conclusions: The present investigation proposed new analysis pipelines for disease outcome modeling from bladder cancer-related biomarkers. Complete and partial data embedding experiments suggested that pipelines employing UMAP had a more accurate predictive ability, supporting the recent literature trends on this methodology. However, it was also found that several UMAP parameters influence experimental results, therefore deriving a recommendation for researchers to pay attention to this aspect of the UMAP technique. Machine learning procedures further demonstrated the effectiveness of the proposed preprocessing in predicting patients' conditions and determined a sub-group of biomarkers significant for forecasting bladder cancer prognosis.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0000,"publicationDate":"2022-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9523990/pdf/","citationCount":"6","resultStr":"{\"title\":\"Polygenic risk modeling of tumor stage and survival in bladder cancer.\",\"authors\":\"Mauro Nascimben, Lia Rimondini, Davide Corà, Manolo Venturin\",\"doi\":\"10.1186/s13040-022-00306-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction: Bladder cancer assessment with non-invasive gene expression signatures facilitates the detection of patients at risk and surveillance of their status, bypassing the discomforts given by cystoscopy. To achieve accurate cancer estimation, analysis pipelines for gene expression data (GED) may integrate a sequence of several machine learning and bio-statistical techniques to model complex characteristics of pathological patterns.Methods: Numerical experiments tested the combination of GED preprocessing by discretization with tree ensemble embeddings and nonlinear dimensionality reductions to categorize oncological patients comprehensively. Modeling aimed to identify tumor stage and distinguish survival outcomes in two situations: complete and partial data embedding. This latter experimental condition simulates the addition of new patients to an existing model for rapid monitoring of disease progression. Machine learning procedures were employed to identify the most relevant genes involved in patient prognosis and test the performance of preprocessed GED compared to untransformed data in predicting patient conditions.Results: Data embedding paired with dimensionality reduction produced prognostic maps with well-defined clusters of patients, suitable for medical decision support. A second experiment simulated the addition of new patients to an existing model (partial data embedding): Uniform Manifold Approximation and Projection (UMAP) methodology with uniform data discretization led to better outcomes than other analyzed pipelines. Further exploration of parameter space for UMAP and t-distributed stochastic neighbor embedding (t-SNE) underlined the importance of tuning a higher number of parameters for UMAP rather than t-SNE. Moreover, two different machine learning experiments identified a group of genes valuable for partitioning patients (gene relevance analysis) and showed the higher precision obtained by preprocessed data in predicting tumor outcomes for cancer stage and survival rate (six classes prediction).Conclusions: The present investigation proposed new analysis pipelines for disease outcome modeling from bladder cancer-related biomarkers. Complete and partial data embedding experiments suggested that pipelines employing UMAP had a more accurate predictive ability, supporting the recent literature trends on this methodology. However, it was also found that several UMAP parameters influence experimental results, therefore deriving a recommendation for researchers to pay attention to this aspect of the UMAP technique. Machine learning procedures further demonstrated the effectiveness of the proposed preprocessing in predicting patients' conditions and determined a sub-group of biomarkers significant for forecasting bladder cancer prognosis.\",\"PeriodicalId\":48947,\"journal\":{\"name\":\"Biodata Mining\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2022-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9523990/pdf/\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biodata Mining\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s13040-022-00306-w\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-022-00306-w","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 6

摘要

介绍:膀胱癌的非侵入性基因表达特征评估有助于发现患者的风险和监测他们的状态，绕过膀胱镜检查带来的不适。为了实现准确的癌症估计，基因表达数据(GED)的分析管道可以整合一系列机器学习和生物统计技术来模拟病理模式的复杂特征。方法:数值实验验证了离散化与树集成嵌入相结合的GED预处理与非线性降维相结合对肿瘤患者进行综合分类。建模旨在识别肿瘤分期，区分两种情况下的生存结果:完全和部分数据嵌入。后一种实验条件模拟了将新患者添加到现有模型中以快速监测疾病进展。使用机器学习程序来识别与患者预后相关的最相关基因，并与未转换的数据相比，测试预处理GED在预测患者病情方面的性能。结果:数据嵌入与降维相结合，生成了具有明确定义的患者群的预后图，适用于医疗决策支持。第二个实验模拟了将新患者添加到现有模型中(部分数据嵌入):统一流形近似和投影(UMAP)方法具有统一数据离散化，其结果优于其他分析管道。对UMAP和t分布随机邻居嵌入(t-SNE)参数空间的进一步探索强调了为UMAP而不是t-SNE调整更多参数的重要性。此外，两个不同的机器学习实验确定了一组对划分患者有价值的基因(基因相关性分析)，并显示了预处理数据在预测癌症分期和生存率(六类预测)方面获得的更高精度。结论:本研究为膀胱癌相关生物标志物的疾病结局建模提供了新的分析管道。完整和部分数据嵌入实验表明，采用UMAP的管道具有更准确的预测能力，支持了该方法的最新文献趋势。然而，也发现UMAP的几个参数会影响实验结果，因此建议研究人员关注UMAP技术的这方面。机器学习程序进一步证明了所提出的预处理在预测患者病情方面的有效性，并确定了一组对预测膀胱癌预后具有重要意义的生物标志物。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Polygenic risk modeling of tumor stage and survival in bladder cancer.

Introduction: Bladder cancer assessment with non-invasive gene expression signatures facilitates the detection of patients at risk and surveillance of their status, bypassing the discomforts given by cystoscopy. To achieve accurate cancer estimation, analysis pipelines for gene expression data (GED) may integrate a sequence of several machine learning and bio-statistical techniques to model complex characteristics of pathological patterns.

Methods: Numerical experiments tested the combination of GED preprocessing by discretization with tree ensemble embeddings and nonlinear dimensionality reductions to categorize oncological patients comprehensively. Modeling aimed to identify tumor stage and distinguish survival outcomes in two situations: complete and partial data embedding. This latter experimental condition simulates the addition of new patients to an existing model for rapid monitoring of disease progression. Machine learning procedures were employed to identify the most relevant genes involved in patient prognosis and test the performance of preprocessed GED compared to untransformed data in predicting patient conditions.

Results: Data embedding paired with dimensionality reduction produced prognostic maps with well-defined clusters of patients, suitable for medical decision support. A second experiment simulated the addition of new patients to an existing model (partial data embedding): Uniform Manifold Approximation and Projection (UMAP) methodology with uniform data discretization led to better outcomes than other analyzed pipelines. Further exploration of parameter space for UMAP and t-distributed stochastic neighbor embedding (t-SNE) underlined the importance of tuning a higher number of parameters for UMAP rather than t-SNE. Moreover, two different machine learning experiments identified a group of genes valuable for partitioning patients (gene relevance analysis) and showed the higher precision obtained by preprocessed data in predicting tumor outcomes for cancer stage and survival rate (six classes prediction).

Conclusions: The present investigation proposed new analysis pipelines for disease outcome modeling from bladder cancer-related biomarkers. Complete and partial data embedding experiments suggested that pipelines employing UMAP had a more accurate predictive ability, supporting the recent literature trends on this methodology. However, it was also found that several UMAP parameters influence experimental results, therefore deriving a recommendation for researchers to pay attention to this aspect of the UMAP technique. Machine learning procedures further demonstrated the effectiveness of the proposed preprocessing in predicting patients' conditions and determined a sub-group of biomarkers significant for forecasting bladder cancer prognosis.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.