Biodata Mining最新文献_第9页

DeepAutoGlioma: a deep learning autoencoder-based multi-omics data integration and classification tools for glioma subtyping. DeepAutoGlioma:一个基于深度学习自动编码器的多组学数据集成和分类工具，用于胶质瘤亚型分型。

IF 4.5 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2023-11-15 DOI: 10.1186/s13040-023-00349-7

Sana Munquad, Asim Bikas Das

Background and objective: The classification of glioma subtypes is essential for precision therapy. Due to the heterogeneity of gliomas, the subtype-specific molecular pattern can be captured by integrating and analyzing high-throughput omics data from different genomic layers. The development of a deep-learning framework enables the integration of multi-omics data to classify the glioma subtypes to support the clinical diagnosis.

Results: Transcriptome and methylome data of glioma patients were preprocessed, and differentially expressed features from both datasets were identified. Subsequently, a Cox regression analysis determined genes and CpGs associated with survival. Gene set enrichment analysis was carried out to examine the biological significance of the features. Further, we identified CpG and gene pairs by mapping them in the promoter region of corresponding genes. The methylation and gene expression levels of these CpGs and genes were embedded in a lower-dimensional space with an autoencoder. Next, ANN and CNN were used to classify subtypes using the latent features from embedding space. CNN performs better than ANN for subtyping lower-grade gliomas (LGG) and glioblastoma multiforme (GBM). The subtyping accuracy of CNN was 98.03% (± 0.06) and 94.07% (± 0.01) in LGG and GBM, respectively. The precision of the models was 97.67% in LGG and 90.40% in GBM. The model sensitivity was 96.96% in LGG and 91.18% in GBM. Additionally, we observed the superior performance of CNN with external datasets. The genes and CpGs pairs used to develop the model showed better performance than the random CpGs-gene pairs, preprocessed data, and single omics data.

Conclusions: The current study showed that a novel feature selection and data integration strategy led to the development of DeepAutoGlioma, an effective framework for diagnosing glioma subtypes.

背景与目的:胶质瘤亚型的分类是精确治疗的基础。由于胶质瘤的异质性，可以通过整合和分析来自不同基因组层的高通量组学数据来捕获亚型特异性分子模式。深度学习框架的开发使多组学数据的集成能够对胶质瘤亚型进行分类，以支持临床诊断。结果:对胶质瘤患者的转录组和甲基组数据进行预处理，并从两个数据集中识别出差异表达特征。随后，Cox回归分析确定了与生存相关的基因和CpGs。进行基因集富集分析以检验这些特征的生物学意义。此外，我们通过在相应基因的启动子区域定位CpG和基因对来鉴定它们。这些CpGs和基因的甲基化和基因表达水平通过自编码器嵌入到低维空间中。然后，利用嵌入空间的潜在特征，利用ANN和CNN对子类型进行分类。CNN对低级别胶质瘤(LGG)和多形性胶质母细胞瘤(GBM)的分型优于ANN。CNN在LGG和GBM的亚型分型准确率分别为98.03%(±0.06)和94.07%(±0.01)。模型在LGG和GBM中的精度分别为97.67%和90.40%。模型敏感性在LGG为96.96%，在GBM为91.18%。此外，我们观察到CNN在外部数据集上的优越性能。与随机CpGs-基因对、预处理数据和单组学数据相比，用于构建模型的基因和CpGs对具有更好的性能。结论:目前的研究表明，一种新的特征选择和数据整合策略导致了DeepAutoGlioma的发展，这是一种诊断胶质瘤亚型的有效框架。

{"title":"DeepAutoGlioma: a deep learning autoencoder-based multi-omics data integration and classification tools for glioma subtyping.","authors":"Sana Munquad, Asim Bikas Das","doi":"10.1186/s13040-023-00349-7","DOIUrl":"10.1186/s13040-023-00349-7","url":null,"abstract":"Background and objective: The classification of glioma subtypes is essential for precision therapy. Due to the heterogeneity of gliomas, the subtype-specific molecular pattern can be captured by integrating and analyzing high-throughput omics data from different genomic layers. The development of a deep-learning framework enables the integration of multi-omics data to classify the glioma subtypes to support the clinical diagnosis.Results: Transcriptome and methylome data of glioma patients were preprocessed, and differentially expressed features from both datasets were identified. Subsequently, a Cox regression analysis determined genes and CpGs associated with survival. Gene set enrichment analysis was carried out to examine the biological significance of the features. Further, we identified CpG and gene pairs by mapping them in the promoter region of corresponding genes. The methylation and gene expression levels of these CpGs and genes were embedded in a lower-dimensional space with an autoencoder. Next, ANN and CNN were used to classify subtypes using the latent features from embedding space. CNN performs better than ANN for subtyping lower-grade gliomas (LGG) and glioblastoma multiforme (GBM). The subtyping accuracy of CNN was 98.03% (± 0.06) and 94.07% (± 0.01) in LGG and GBM, respectively. The precision of the models was 97.67% in LGG and 90.40% in GBM. The model sensitivity was 96.96% in LGG and 91.18% in GBM. Additionally, we observed the superior performance of CNN with external datasets. The genes and CpGs pairs used to develop the model showed better performance than the random CpGs-gene pairs, preprocessed data, and single omics data.Conclusions: The current study showed that a novel feature selection and data integration strategy led to the development of DeepAutoGlioma, an effective framework for diagnosing glioma subtypes.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"32"},"PeriodicalIF":4.5,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10652591/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134650252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Research agenda for using artificial intelligence in health governance: interpretive scoping review and framework. 在卫生治理中使用人工智能的研究议程：解释性范围界定审查和框架。

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2023-10-31 DOI: 10.1186/s13040-023-00346-w

Maryam Ramezani, Amirhossein Takian, Ahad Bakhtiari, Hamid R Rabiee, Sadegh Ghazanfari, Saharnaz Sazgarnejad

Background: The governance of health systems is complex in nature due to several intertwined and multi-dimensional factors contributing to it. Recent challenges of health systems reflect the need for innovative approaches that can minimize adverse consequences of policies. Hence, there is compelling evidence of a distinct outlook on the health ecosystem using artificial intelligence (AI). Therefore, this study aimed to investigate the roles of AI and its applications in health system governance through an interpretive scoping review of current evidence.

Method: This study intended to offer a research agenda and framework for the applications of AI in health systems governance. To include shreds of evidence with a greater focus on the application of AI in health governance from different perspectives, we searched the published literature from 2000 to 2023 through PubMed, Scopus, and Web of Science Databases.

Results: Our findings showed that integrating AI capabilities into health systems governance has the potential to influence three cardinal dimensions of health. These include social determinants of health, elements of governance, and health system tasks and goals. AI paves the way for strengthening the health system's governance through various aspects, i.e., intelligence innovations, flexible boundaries, multidimensional analysis, new insights, and cognition modifications to the health ecosystem area.

Conclusion: AI is expected to be seen as a tool with new applications and capabilities, with the potential to change each component of governance in the health ecosystem, which can eventually help achieve health-related goals.

背景：卫生系统的治理本质上是复杂的，这是由几个相互交织的多维度因素造成的。卫生系统最近面临的挑战反映出需要创新的方法，以最大限度地减少政策的不利后果。因此，有令人信服的证据表明，使用人工智能对健康生态系统有着独特的看法。因此，本研究旨在通过对现有证据的解释性范围审查，调查人工智能及其在卫生系统治理中的作用。方法：本研究旨在为人工智能在卫生系统治理中的应用提供一个研究议程和框架。为了从不同角度纳入更多关于人工智能在卫生治理中应用的证据，我们通过PubMed、Scopus和Web of Science数据库搜索了2000年至2023年发表的文献。结果：我们的研究结果表明，将人工智能能力融入卫生系统治理有可能影响健康的三个基本维度。其中包括健康的社会决定因素、治理要素以及卫生系统的任务和目标。人工智能通过各个方面为加强卫生系统的治理铺平了道路，即智能创新、灵活的边界、多维分析、新的见解和对卫生生态系统领域的认知修改。结论：人工智能有望被视为一种具有新应用和能力的工具，有可能改变健康生态系统中治理的每个组成部分，最终有助于实现与健康相关的目标。

{"title":"Research agenda for using artificial intelligence in health governance: interpretive scoping review and framework.","authors":"Maryam Ramezani, Amirhossein Takian, Ahad Bakhtiari, Hamid R Rabiee, Sadegh Ghazanfari, Saharnaz Sazgarnejad","doi":"10.1186/s13040-023-00346-w","DOIUrl":"10.1186/s13040-023-00346-w","url":null,"abstract":"Background: The governance of health systems is complex in nature due to several intertwined and multi-dimensional factors contributing to it. Recent challenges of health systems reflect the need for innovative approaches that can minimize adverse consequences of policies. Hence, there is compelling evidence of a distinct outlook on the health ecosystem using artificial intelligence (AI). Therefore, this study aimed to investigate the roles of AI and its applications in health system governance through an interpretive scoping review of current evidence.Method: This study intended to offer a research agenda and framework for the applications of AI in health systems governance. To include shreds of evidence with a greater focus on the application of AI in health governance from different perspectives, we searched the published literature from 2000 to 2023 through PubMed, Scopus, and Web of Science Databases.Results: Our findings showed that integrating AI capabilities into health systems governance has the potential to influence three cardinal dimensions of health. These include social determinants of health, elements of governance, and health system tasks and goals. AI paves the way for strengthening the health system's governance through various aspects, i.e., intelligence innovations, flexible boundaries, multidimensional analysis, new insights, and cognition modifications to the health ecosystem area.Conclusion: AI is expected to be seen as a tool with new applications and capabilities, with the potential to change each component of governance in the health ecosystem, which can eventually help achieve health-related goals.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"31"},"PeriodicalIF":4.0,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10617108/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71414915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Correction: A prognostic model based on seven immune-related genes predicts the overall survival of patients with hepatocellular carcinoma. 更正：基于七个免疫相关基因的预后模型预测了肝细胞癌患者的总体生存率。

IF 4.5 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2023-10-26 DOI: 10.1186/s13040-023-00347-9

Qian Yan, Wenjiang Zheng, Boqing Wang, Baoqian Ye, Huiyan Luo, Xinqian Yang, Ping Zhang, Xiongwen Wang

引用次数: 0

Prescription pattern analysis of Type 2 Diabetes Mellitus: a cross-sectional study in Isfahan, Iran. 2型糖尿病的处方模式分析：伊朗伊斯法罕的一项横断面研究。

IF 4.5 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2023-10-20 DOI: 10.1186/s13040-023-00344-y

Elnaz Ziad, Somayeh Sadat, Farshad Farzadfar, Mohammad-Reza Malekpour

Background: Patients with Type 2 Diabetes Mellitus (T2DM) are at a higher risk of polypharmacy and more susceptible to irrational prescriptions; therefore, pharmacological therapy patterns are important to be monitored. The primary objective of this study was to highlight current prescription patterns in T2DM patients and compare them with existing Standards of Medical Care in Diabetes. The second objective was to analyze whether age and gender affect prescription patterns.

Method: This cross-sectional study was conducted using the Iran Health Insurance Organization (IHIO) prescription database. It was mined by an Association Rule Mining (ARM) technique, FP-Growth, in order to find co-prescribed drugs with anti-diabetic medications. The algorithm was implemented at different levels of the Anatomical Therapeutic Chemical (ATC) classification system, which assigns different codes to drugs based on their anatomy, pharmacological, therapeutic, and chemical properties to provide an in-depth analysis of co-prescription patterns.

Results: Altogether, the prescriptions of 914,652 patients were analyzed, of whom 91,505 were found to have diabetes. According to our results, prescribing Lipid Modifying Agents (C10) (56.3%), Agents Acting on The Renin-Angiotensin System (C09) (48.9%), Antithrombotic Agents (B01) (35.7%), and Beta Blocking Agents (C07) (30.1%) were meaningfully associated with the prescription of Drugs Used in Diabetes. Our study also revealed that female diabetic patients have a higher lift for taking Thyroid Preparations, and the older the patients were, the more they were prone to take neuropathy-related medications. Additionally, the results suggest that there are gender differences in the association between aspirin and diabetes drugs, with the differences becoming less pronounced in old age.

Conclusions: Almost all of the association rules found in this research were clinically meaningful, proving the potential of ARM for co-prescription pattern discovery. Moreover, implementing level-based ARM was effective in detecting difficult-to-spot rules. Additionally, the majority of drugs prescribed by physicians were consistent with the Standards of Medical Care in Diabetes.

背景：2型糖尿病（T2DM）患者服用多种药物的风险更高，更容易受到不合理处方的影响；因此，药物治疗模式的监测非常重要。本研究的主要目的是强调T2DM患者目前的处方模式，并将其与现有的糖尿病医疗护理标准进行比较。第二个目的是分析年龄和性别是否会影响处方模式。方法：这项横断面研究使用伊朗健康保险组织（IHIO）处方数据库进行。它是通过关联规则挖掘（ARM）技术FP Growth进行挖掘的，目的是找到与抗糖尿病药物合用的药物。该算法在解剖治疗化学（ATC）分类系统的不同级别上实现，该系统根据药物的解剖、药理学、治疗和化学特性为其分配不同的代码，以深入分析共同处方模式。结果：共分析914652例患者的处方，其中91505例为糖尿病患者。根据我们的研究结果，处方脂质修饰剂（C10）（56.3%）、作用于肾素-血管紧张素系统的药物（C09）（48.9%）、抗血栓药物（B01）（35.7%）和β-阻断剂（C07）（30.1%）与糖尿病药物的处方有显著相关性。我们的研究还表明，女性糖尿病患者服用甲状腺制剂的几率更高，而且患者年龄越大，就越容易服用与神经病变相关的药物。此外，研究结果表明，阿司匹林和糖尿病药物之间的相关性存在性别差异，这种差异在老年时变得不那么明显。结论：本研究中发现的几乎所有关联规则都具有临床意义，证明了ARM在发现联合处方模式方面的潜力。此外，实现基于层次的ARM在检测难以发现的规则方面是有效的。此外，医生开出的大多数药物都符合糖尿病医疗保健标准。

{"title":"Prescription pattern analysis of Type 2 Diabetes Mellitus: a cross-sectional study in Isfahan, Iran.","authors":"Elnaz Ziad, Somayeh Sadat, Farshad Farzadfar, Mohammad-Reza Malekpour","doi":"10.1186/s13040-023-00344-y","DOIUrl":"10.1186/s13040-023-00344-y","url":null,"abstract":"Background: Patients with Type 2 Diabetes Mellitus (T2DM) are at a higher risk of polypharmacy and more susceptible to irrational prescriptions; therefore, pharmacological therapy patterns are important to be monitored. The primary objective of this study was to highlight current prescription patterns in T2DM patients and compare them with existing Standards of Medical Care in Diabetes. The second objective was to analyze whether age and gender affect prescription patterns.Method: This cross-sectional study was conducted using the Iran Health Insurance Organization (IHIO) prescription database. It was mined by an Association Rule Mining (ARM) technique, FP-Growth, in order to find co-prescribed drugs with anti-diabetic medications. The algorithm was implemented at different levels of the Anatomical Therapeutic Chemical (ATC) classification system, which assigns different codes to drugs based on their anatomy, pharmacological, therapeutic, and chemical properties to provide an in-depth analysis of co-prescription patterns.Results: Altogether, the prescriptions of 914,652 patients were analyzed, of whom 91,505 were found to have diabetes. According to our results, prescribing Lipid Modifying Agents (C10) (56.3%), Agents Acting on The Renin-Angiotensin System (C09) (48.9%), Antithrombotic Agents (B01) (35.7%), and Beta Blocking Agents (C07) (30.1%) were meaningfully associated with the prescription of Drugs Used in Diabetes. Our study also revealed that female diabetic patients have a higher lift for taking Thyroid Preparations, and the older the patients were, the more they were prone to take neuropathy-related medications. Additionally, the results suggest that there are gender differences in the association between aspirin and diabetes drugs, with the differences becoming less pronounced in old age.Conclusions: Almost all of the association rules found in this research were clinically meaningful, proving the potential of ARM for co-prescription pattern discovery. Moreover, implementing level-based ARM was effective in detecting difficult-to-spot rules. Additionally, the majority of drugs prescribed by physicians were consistent with the Standards of Medical Care in Diabetes.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"29"},"PeriodicalIF":4.5,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10588025/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49683949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Attention-based dual-path feature fusion network for automatic skin lesion segmentation. 基于注意力的双路径特征融合网络用于皮肤损伤的自动分割。

IF 4.5 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2023-10-09 DOI: 10.1186/s13040-023-00345-x

Zhenxiang He, Xiaoxia Li, Yuling Chen, Nianzu Lv, Yong Cai

Automatic segmentation of skin lesions is a critical step in Computer Aided Diagnosis (CAD) of melanoma. However, due to the blurring of the lesion boundary, uneven color distribution, and low image contrast, resulting in poor segmentation result. Aiming at the problem of difficult segmentation of skin lesions, this paper proposes an Attention-based Dual-path Feature Fusion Network (ADFFNet) for automatic skin lesion segmentation. Firstly, in the spatial path, a Boundary Refinement (BR) module is designed for the output of low-level features to filter out irrelevant background information and retain more boundary details of the lesion area. Secondly, in the context path, a Multi-scale Feature Selection (MFS) module is constructed for high-level feature output to capture multi-scale context information and use the attention mechanism to filter out redundant semantic information. Finally, we design a Dual-path Feature Fusion (DFF) module, which uses high-level global attention information to guide the step-by-step fusion of high-level semantic features and low-level detail features, which is beneficial to restore image detail information and further improve the pixel-level segmentation accuracy of skin lesion. In the experiment, the ISIC 2018 and PH2 datasets are employed to evaluate the effectiveness of the proposed method. It achieves a performance of 0.890/ 0.925 and 0.933 /0.954 on the F1-score and SE index, respectively. Comparative analysis with state-of-the-art segmentation methods reveals that the ADFFNet algorithm exhibits superior segmentation performance.

皮肤病变的自动分割是黑色素瘤计算机辅助诊断（CAD）的关键步骤。然而，由于病变边界模糊，颜色分布不均匀，图像对比度低，导致分割效果差。针对皮肤病变分割困难的问题，提出了一种基于注意力的双路径特征融合网络（ADFNet）用于皮肤病变的自动分割。首先，在空间路径中，设计了一个边界细化（BR）模块，用于输出低级特征，以过滤掉不相关的背景信息，并保留病变区域的更多边界细节。其次，在上下文路径中，构造了一个多尺度特征选择（MFS）模块，用于高级特征输出，以捕获多尺度上下文信息，并利用注意力机制过滤掉冗余的语义信息。最后，我们设计了一个双路径特征融合（DFF）模块，该模块利用高级全局注意力信息来指导高级语义特征和低级细节特征的逐步融合，有利于恢复图像细节信息，进一步提高皮肤损伤的像素级分割精度。在实验中，使用ISIC 2018和PH2数据集来评估所提出方法的有效性。它在F1得分和SE指数上分别达到0.890/0.925和0.933/0.954。与最先进的分割方法的比较分析表明，ADFNet算法表现出优越的分割性能。

{"title":"Attention-based dual-path feature fusion network for automatic skin lesion segmentation.","authors":"Zhenxiang He, Xiaoxia Li, Yuling Chen, Nianzu Lv, Yong Cai","doi":"10.1186/s13040-023-00345-x","DOIUrl":"10.1186/s13040-023-00345-x","url":null,"abstract":"Automatic segmentation of skin lesions is a critical step in Computer Aided Diagnosis (CAD) of melanoma. However, due to the blurring of the lesion boundary, uneven color distribution, and low image contrast, resulting in poor segmentation result. Aiming at the problem of difficult segmentation of skin lesions, this paper proposes an Attention-based Dual-path Feature Fusion Network (ADFFNet) for automatic skin lesion segmentation. Firstly, in the spatial path, a Boundary Refinement (BR) module is designed for the output of low-level features to filter out irrelevant background information and retain more boundary details of the lesion area. Secondly, in the context path, a Multi-scale Feature Selection (MFS) module is constructed for high-level feature output to capture multi-scale context information and use the attention mechanism to filter out redundant semantic information. Finally, we design a Dual-path Feature Fusion (DFF) module, which uses high-level global attention information to guide the step-by-step fusion of high-level semantic features and low-level detail features, which is beneficial to restore image detail information and further improve the pixel-level segmentation accuracy of skin lesion. In the experiment, the ISIC 2018 and PH2 datasets are employed to evaluate the effectiveness of the proposed method. It achieves a performance of 0.890/ 0.925 and 0.933 /0.954 on the F1-score and SE index, respectively. Comparative analysis with state-of-the-art segmentation methods reveals that the ADFFNet algorithm exhibits superior segmentation performance.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"28"},"PeriodicalIF":4.5,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10561442/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41155445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Quantum analysis of squiggle data. 扭曲数据的量子分析。

IF 4.5 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2023-10-06 DOI: 10.1186/s13040-023-00343-z

Naya Nagy, Matthew Stuart-Edwards, Marius Nagy, Liam Mitchell, Athanasios Zovoilis

Squiggle data is the numerical output of DNA and RNA sequencing by the Nanopore next generation sequencing platform. Nanopore sequencing offers expanded applications compared to previous sequencing techniques but produces a large amount of data in the form of current measurements over time. The analysis of these segments of current measurements require more complex and computationally intensive algorithms than previous sequencing technologies. The purpose of this study is to investigate in principle the potential of using quantum computers to speed up Nanopore data analysis. Quantum circuits are designed to extract major features of squiggle current measurements. The circuits are analyzed theoretically in terms of size and performance. Practical experiments on IBM QX show the limitations of the state of the art quantum computer to tackle real life squiggle data problems. Nevertheless, pre-processing of the squiggle data using the inverse wavelet transform, as experimented and analyzed in this paper as well, reduces the dimensionality of the problem in order to fit a reasonable size quantum computer in the hopefully near future.

波形数据是纳米孔下一代测序平台对DNA和RNA测序的数字输出。与以前的测序技术相比，纳米孔测序提供了更广泛的应用，但随着时间的推移，会以当前测量的形式产生大量数据。与以前的测序技术相比，对当前测量的这些片段的分析需要更复杂和计算密集的算法。本研究的目的是从原理上研究使用量子计算机加速纳米孔数据分析的潜力。量子电路设计用于提取波形电流测量的主要特征。从尺寸和性能方面对电路进行了理论分析。在IBM QX上进行的实际实验表明，最先进的量子计算机在解决现实生活中的数据问题方面存在局限性。然而，正如本文所实验和分析的那样，使用小波逆变换对波形数据进行预处理，降低了问题的维数，以便在不久的将来适合一台合理尺寸的量子计算机。

{"title":"Quantum analysis of squiggle data.","authors":"Naya Nagy, Matthew Stuart-Edwards, Marius Nagy, Liam Mitchell, Athanasios Zovoilis","doi":"10.1186/s13040-023-00343-z","DOIUrl":"10.1186/s13040-023-00343-z","url":null,"abstract":"Squiggle data is the numerical output of DNA and RNA sequencing by the Nanopore next generation sequencing platform. Nanopore sequencing offers expanded applications compared to previous sequencing techniques but produces a large amount of data in the form of current measurements over time. The analysis of these segments of current measurements require more complex and computationally intensive algorithms than previous sequencing technologies. The purpose of this study is to investigate in principle the potential of using quantum computers to speed up Nanopore data analysis. Quantum circuits are designed to extract major features of squiggle current measurements. The circuits are analyzed theoretically in terms of size and performance. Practical experiments on IBM QX show the limitations of the state of the art quantum computer to tackle real life squiggle data problems. Nevertheless, pre-processing of the squiggle data using the inverse wavelet transform, as experimented and analyzed in this paper as well, reduces the dimensionality of the problem in order to fit a reasonable size quantum computer in the hopefully near future.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"27"},"PeriodicalIF":4.5,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10557310/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41135068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Disclosing transcriptomics network-based signatures of glioma heterogeneity using sparse methods. 使用稀疏方法揭示神经胶质瘤异质性的基于转录组学的特征。

IF 4.5 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2023-09-26 DOI: 10.1186/s13040-023-00341-1

Sofia Martins, Roberta Coletti, Marta B Lopes

Gliomas are primary malignant brain tumors with poor survival and high resistance to available treatments. Improving the molecular understanding of glioma and disclosing novel biomarkers of tumor development and progression could help to find novel targeted therapies for this type of cancer. Public databases such as The Cancer Genome Atlas (TCGA) provide an invaluable source of molecular information on cancer tissues. Machine learning tools show promise in dealing with the high dimension of omics data and extracting relevant information from it. In this work, network inference and clustering methods, namely Joint Graphical lasso and Robust Sparse K-means Clustering, were applied to RNA-sequencing data from TCGA glioma patients to identify shared and distinct gene networks among different types of glioma (glioblastoma, astrocytoma, and oligodendroglioma) and disclose new patient groups and the relevant genes behind groups' separation. The results obtained suggest that astrocytoma and oligodendroglioma have more similarities compared with glioblastoma, highlighting the molecular differences between glioblastoma and the others glioma subtypes. After a comprehensive literature search on the relevant genes pointed our from our analysis, we identified potential candidates for biomarkers of glioma. Further molecular validation of these genes is encouraged to understand their potential role in diagnosis and in the design of novel therapies.

胶质瘤是原发性恶性脑肿瘤，生存率低，对现有治疗方法的耐药性高。提高对神经胶质瘤的分子理解并揭示肿瘤发展和进展的新生物标志物可能有助于找到这种类型癌症的新靶向治疗方法。癌症基因组图谱（TCGA）等公共数据库为癌症组织的分子信息提供了宝贵的来源。机器学习工具在处理高维组学数据并从中提取相关信息方面表现出了良好的前景，应用于TCGA神经胶质瘤患者的RNA测序数据，以确定不同类型神经胶质瘤（胶质母细胞瘤、星形细胞瘤和少突胶质瘤）之间共享和不同的基因网络，并揭示新的患者群体和群体分离背后的相关基因。结果表明，与胶质母细胞瘤相比，星形细胞瘤和少突胶质瘤有更多的相似性，突出了胶质母细胞癌与其他胶质瘤亚型之间的分子差异。在对我们分析的相关基因进行全面的文献检索后，我们确定了神经胶质瘤生物标志物的潜在候选者。鼓励对这些基因进行进一步的分子验证，以了解它们在诊断和新疗法设计中的潜在作用。

{"title":"Disclosing transcriptomics network-based signatures of glioma heterogeneity using sparse methods.","authors":"Sofia Martins, Roberta Coletti, Marta B Lopes","doi":"10.1186/s13040-023-00341-1","DOIUrl":"10.1186/s13040-023-00341-1","url":null,"abstract":"Gliomas are primary malignant brain tumors with poor survival and high resistance to available treatments. Improving the molecular understanding of glioma and disclosing novel biomarkers of tumor development and progression could help to find novel targeted therapies for this type of cancer. Public databases such as The Cancer Genome Atlas (TCGA) provide an invaluable source of molecular information on cancer tissues. Machine learning tools show promise in dealing with the high dimension of omics data and extracting relevant information from it. In this work, network inference and clustering methods, namely Joint Graphical lasso and Robust Sparse K-means Clustering, were applied to RNA-sequencing data from TCGA glioma patients to identify shared and distinct gene networks among different types of glioma (glioblastoma, astrocytoma, and oligodendroglioma) and disclose new patient groups and the relevant genes behind groups' separation. The results obtained suggest that astrocytoma and oligodendroglioma have more similarities compared with glioblastoma, highlighting the molecular differences between glioblastoma and the others glioma subtypes. After a comprehensive literature search on the relevant genes pointed our from our analysis, we identified potential candidates for biomarkers of glioma. Further molecular validation of these genes is encouraged to understand their potential role in diagnosis and in the design of novel therapies.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"26"},"PeriodicalIF":4.5,"publicationDate":"2023-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10523751/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41161853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

STAR_outliers: a python package that separates univariate outliers from non-normal distributions. STAR_outliers:一个python包，用于从非正态分布中分离单变量异常值。

IF 4.5 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2023-09-04 DOI: 10.1186/s13040-023-00342-0

John T Gregg, Jason H Moore

There are not currently any univariate outlier detection algorithms that transform and model arbitrarily shaped distributions to remove univariate outliers. Some algorithms model skew, even fewer model kurtosis, and none of them model bimodality and monotonicity. To overcome these challenges, we have implemented an algorithm for Skew and Tail-heaviness Adjusted Removal of Outliers (STAR_outliers) that robustly removes univariate outliers from distributions with many different shape profiles, including extreme skew, extreme kurtosis, bimodality, and monotonicity. We show that STAR_outliers removes simulated outliers with greater recall and precision than several general algorithms, and it also models the outlier bounds of real data distributions with greater accuracy.Background Reliably removing univariate outliers from arbitrarily shaped distributions is a difficult task. Incorrectly assuming unimodality or overestimating tail heaviness fails to remove outliers, while underestimating tail heaviness incorrectly removes regular data from the tails. Skew often produces one heavy tail and one light tail, and we show that several sophisticated outlier removal algorithms often fail to remove outliers from the light tail. Multivariate outlier detection algorithms have recently become popular, but having tested PyOD's multivariate outlier removal algorithms, we found them to be inadequate for univariate outlier removal. They usually do not allow for univariate input, and they do not fit their distributions of outliership scores with a model on which an outlier threshold can be accurately established. Thus, there is a need for a flexible outlier removal algorithm that can model arbitrarily shaped univariate distributions.Results In order to effectively model arbitrarily shaped univariate distributions, we have combined several well-established algorithms into a new algorithm called STAR_outliers. STAR_outliers removes more simulated true outliers and fewer non-outliers than several other univariate algorithms. These include several normality-assuming outlier removal methods, PyOD's isolation forest (IF) outlier removal algorithm (ACM Transactions on Knowledge Discovery from Data (TKDD) 6:3, 2012) with default settings, and an IQR based algorithm by Verardi and Vermandele that removes outliers while accounting for skew and kurtosis (Verardi and Vermandele, Journal de la Société Française de Statistique 157:90-114, 2016). Since the IF algorithm's default model poorly fit the outliership scores, we also compared the isolation forest algorithm with a model that entails removing as many datapoints as STAR_outliers does in order of decreasing outliership scores. We also compared these algorithms on the publicly available 2018 National Health and Nutrition Examination Survey (NHANES) data by setting the outlier threshold to keep values falling within the main 99.3 percent of the fitted model's domain. We show that our STAR_outliers algorithm removes signif

目前还没有任何单变量离群点检测算法，可以对任意形状的分布进行变换和建模，以去除单变量离群点。有些算法建模偏态，甚至更少建模峰度，没有一个算法建模双峰性和单调性。为了克服这些挑战，我们实现了一种针对偏度和尾重调整的异常值去除(STAR_outliers)的算法，该算法可以从具有许多不同形状轮廓的分布中稳健地去除单变量异常值，包括极端偏度、极端峰度、双峰性和单调性。我们表明，STAR_outliers比几种通用算法具有更高的召回率和精度来去除模拟的异常值，并且它还以更高的精度建模真实数据分布的异常边界。从任意形状的分布中可靠地去除单变量异常值是一项艰巨的任务。错误地假设单峰性或高估尾重不能去除异常值，而低估尾重则错误地从尾部去除常规数据。偏态通常会产生一条重尾和一条轻尾，我们表明一些复杂的离群值去除算法通常不能从轻尾中去除离群值。多元离群值检测算法最近变得很流行，但在测试了PyOD的多元离群值去除算法后，我们发现它们对于单变量离群值去除是不够的。它们通常不允许单变量输入，并且它们的异常值分数分布不能与可以准确建立异常值阈值的模型相拟合。因此，需要一种灵活的离群值去除算法来模拟任意形状的单变量分布。为了有效地模拟任意形状的单变量分布，我们将几种成熟的算法组合成一个名为STAR_outliers的新算法。与其他几种单变量算法相比，STAR_outliers删除了更多模拟的真实异常值和更少的非异常值。其中包括几种假设正态性的离群值去除方法，PyOD的隔离森林(IF)离群值去除算法(ACM Transactions on Knowledge Discovery from Data (TKDD) 6:3, 2012)默认设置，以及Verardi和Vermandele基于IQR的算法，该算法在考虑偏态和峰度的同时去除离群值(Verardi和Vermandele, Journal de la sociacims francalaise de statisque 157:90-114, 2016)。由于IF算法的默认模型不能很好地拟合离群值得分，因此我们还将隔离森林算法与一个模型进行了比较，该模型需要删除尽可能多的数据点，如STAR_outliers按照离群值得分的递减顺序删除。我们还将这些算法与公开的2018年国家健康和营养检查调查(NHANES)数据进行了比较，设置了异常值阈值，使数值落在拟合模型域的99.3%以内。我们发现，平均而言，我们的STAR_outliers算法比其他离群值去除方法从这些特征中去除的值明显接近0.7%。STAR_outliers是一个易于实现的python包，用于去除异常值，优于多种常用的单变量异常值去除方法。

{"title":"STAR_outliers: a python package that separates univariate outliers from non-normal distributions.","authors":"John T Gregg, Jason H Moore","doi":"10.1186/s13040-023-00342-0","DOIUrl":"10.1186/s13040-023-00342-0","url":null,"abstract":"There are not currently any univariate outlier detection algorithms that transform and model arbitrarily shaped distributions to remove univariate outliers. Some algorithms model skew, even fewer model kurtosis, and none of them model bimodality and monotonicity. To overcome these challenges, we have implemented an algorithm for Skew and Tail-heaviness Adjusted Removal of Outliers (STAR_outliers) that robustly removes univariate outliers from distributions with many different shape profiles, including extreme skew, extreme kurtosis, bimodality, and monotonicity. We show that STAR_outliers removes simulated outliers with greater recall and precision than several general algorithms, and it also models the outlier bounds of real data distributions with greater accuracy.Background Reliably removing univariate outliers from arbitrarily shaped distributions is a difficult task. Incorrectly assuming unimodality or overestimating tail heaviness fails to remove outliers, while underestimating tail heaviness incorrectly removes regular data from the tails. Skew often produces one heavy tail and one light tail, and we show that several sophisticated outlier removal algorithms often fail to remove outliers from the light tail. Multivariate outlier detection algorithms have recently become popular, but having tested PyOD's multivariate outlier removal algorithms, we found them to be inadequate for univariate outlier removal. They usually do not allow for univariate input, and they do not fit their distributions of outliership scores with a model on which an outlier threshold can be accurately established. Thus, there is a need for a flexible outlier removal algorithm that can model arbitrarily shaped univariate distributions.Results In order to effectively model arbitrarily shaped univariate distributions, we have combined several well-established algorithms into a new algorithm called STAR_outliers. STAR_outliers removes more simulated true outliers and fewer non-outliers than several other univariate algorithms. These include several normality-assuming outlier removal methods, PyOD's isolation forest (IF) outlier removal algorithm (ACM Transactions on Knowledge Discovery from Data (TKDD) 6:3, 2012) with default settings, and an IQR based algorithm by Verardi and Vermandele that removes outliers while accounting for skew and kurtosis (Verardi and Vermandele, Journal de la Société Française de Statistique 157:90-114, 2016). Since the IF algorithm's default model poorly fit the outliership scores, we also compared the isolation forest algorithm with a model that entails removing as many datapoints as STAR_outliers does in order of decreasing outliership scores. We also compared these algorithms on the publicly available 2018 National Health and Nutrition Examination Survey (NHANES) data by setting the outlier threshold to keep values falling within the main 99.3 percent of the fitted model's domain. We show that our STAR_outliers algorithm removes signif","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"25"},"PeriodicalIF":4.5,"publicationDate":"2023-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10476292/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10166430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Machine learning based study for the classification of Type 2 diabetes mellitus subtypes. 基于机器学习的2型糖尿病亚型分类研究。

IF 4.5 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2023-08-22 DOI: 10.1186/s13040-023-00340-2

Nelson E Ordoñez-Guillen, Jose Luis Gonzalez-Compean, Ivan Lopez-Arevalo, Miguel Contreras-Murillo, Edwin Aldana-Bobadilla

Purpose: Data-driven diabetes research has increased its interest in exploring the heterogeneity of the disease, aiming to support in the development of more specific prognoses and treatments within the so-called precision medicine. Recently, one of these studies found five diabetes subgroups with varying risks of complications and treatment responses. Here, we tackle the development and assessment of different models for classifying Type 2 Diabetes (T2DM) subtypes through machine learning approaches, with the aim of providing a performance comparison and new insights on the matter.

Methods: We developed a three-stage methodology starting with the preprocessing of public databases NHANES (USA) and ENSANUT (Mexico) to construct a dataset with N = 10,077 adult diabetes patient records. We used N = 2,768 records for training/validation of models and left the remaining (N = 7,309) for testing. In the second stage, groups of observations -each one representing a T2DM subtype- were identified. We tested different clustering techniques and strategies and validated them by using internal and external clustering indices; obtaining two annotated datasets Dset A and Dset B. In the third stage, we developed different classification models assaying four algorithms, seven input-data schemes, and two validation settings on each annotated dataset. We also tested the obtained models using a majority-vote approach for classifying unseen patient records in the hold-out dataset.

Results: From the independently obtained bootstrap validation for Dset A and Dset B, mean accuracies across all seven data schemes were [Formula: see text] ([Formula: see text]) and [Formula: see text] ([Formula: see text]), respectively. Best accuracies were [Formula: see text] and [Formula: see text]. Both validation setting results were consistent. For the hold-out dataset, results were consonant with most of those obtained in the literature in terms of class proportions.

Conclusion: The development of machine learning systems for the classification of diabetes subtypes constitutes an important task to support physicians for fast and timely decision-making. We expect to deploy this methodology in a data analysis platform to conduct studies for identifying T2DM subtypes in patient records from hospitals.

目的:数据驱动的糖尿病研究增加了对探索疾病异质性的兴趣，旨在支持所谓的精准医学中更具体的预后和治疗的发展。最近，其中一项研究发现了五个糖尿病亚组，它们的并发症风险和治疗反应各不相同。在这里，我们通过机器学习方法解决了2型糖尿病(T2DM)亚型分类的不同模型的开发和评估，目的是提供性能比较和对该问题的新见解。方法:我们开发了一个三阶段的方法，从公共数据库NHANES(美国)和ENSANUT(墨西哥)的预处理开始，构建了一个包含N = 10,077例成人糖尿病患者记录的数据集。我们使用N = 2768条记录用于模型的训练/验证，剩下的(N = 7309)用于测试。在第二阶段，确定观察组-每组代表一个T2DM亚型。对不同的聚类技术和策略进行了测试，并利用内外聚类指标对其进行了验证;在第三阶段，我们开发了不同的分类模型，分析了每个注释数据集上的四种算法、七种输入数据方案和两种验证设置。我们还使用多数投票方法测试了获得的模型，用于对保留数据集中未见的患者记录进行分类。结果:从独立获得的Dset A和Dset B的bootstrap验证中，所有七个数据方案的平均精度分别为[公式:见文]([公式:见文])和[公式:见文]([公式:见文])。准确度最高的是[公式:见文]和[公式:见文]。两种验证设置结果一致。对于hold-out数据集，就类比例而言，结果与文献中获得的大多数结果一致。结论:开发用于糖尿病亚型分类的机器学习系统是支持医生快速及时决策的重要任务。我们希望在数据分析平台中部署这种方法，以开展在医院患者记录中识别T2DM亚型的研究。

{"title":"Machine learning based study for the classification of Type 2 diabetes mellitus subtypes.","authors":"Nelson E Ordoñez-Guillen, Jose Luis Gonzalez-Compean, Ivan Lopez-Arevalo, Miguel Contreras-Murillo, Edwin Aldana-Bobadilla","doi":"10.1186/s13040-023-00340-2","DOIUrl":"10.1186/s13040-023-00340-2","url":null,"abstract":"Purpose: Data-driven diabetes research has increased its interest in exploring the heterogeneity of the disease, aiming to support in the development of more specific prognoses and treatments within the so-called precision medicine. Recently, one of these studies found five diabetes subgroups with varying risks of complications and treatment responses. Here, we tackle the development and assessment of different models for classifying Type 2 Diabetes (T2DM) subtypes through machine learning approaches, with the aim of providing a performance comparison and new insights on the matter.Methods: We developed a three-stage methodology starting with the preprocessing of public databases NHANES (USA) and ENSANUT (Mexico) to construct a dataset with N = 10,077 adult diabetes patient records. We used N = 2,768 records for training/validation of models and left the remaining (N = 7,309) for testing. In the second stage, groups of observations -each one representing a T2DM subtype- were identified. We tested different clustering techniques and strategies and validated them by using internal and external clustering indices; obtaining two annotated datasets Dset A and Dset B. In the third stage, we developed different classification models assaying four algorithms, seven input-data schemes, and two validation settings on each annotated dataset. We also tested the obtained models using a majority-vote approach for classifying unseen patient records in the hold-out dataset.Results: From the independently obtained bootstrap validation for Dset A and Dset B, mean accuracies across all seven data schemes were [Formula: see text] ([Formula: see text]) and [Formula: see text] ([Formula: see text]), respectively. Best accuracies were [Formula: see text] and [Formula: see text]. Both validation setting results were consistent. For the hold-out dataset, results were consonant with most of those obtained in the literature in terms of class proportions.Conclusion: The development of machine learning systems for the classification of diabetes subtypes constitutes an important task to support physicians for fast and timely decision-making. We expect to deploy this methodology in a data analysis platform to conduct studies for identifying T2DM subtypes in patient records from hospitals.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"24"},"PeriodicalIF":4.5,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10463725/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10173698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Assessment of emerging pretraining strategies in interpretable multimodal deep learning for cancer prognostication. 评估可解释多模态深度学习中用于癌症预测的新兴预训练策略。

IF 4.5 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2023-07-22 DOI: 10.1186/s13040-023-00338-w

Zarif L Azher, Anish Suvarna, Ji-Qing Chen, Ze Zhang, Brock C Christensen, Lucas A Salas, Louis J Vaickus, Joshua J Levy

Background: Deep learning models can infer cancer patient prognosis from molecular and anatomic pathology information. Recent studies that leveraged information from complementary multimodal data improved prognostication, further illustrating the potential utility of such methods. However, current approaches: 1) do not comprehensively leverage biological and histomorphological relationships and 2) make use of emerging strategies to "pretrain" models (i.e., train models on a slightly orthogonal dataset/modeling objective) which may aid prognostication by reducing the amount of information required for achieving optimal performance. In addition, model interpretation is crucial for facilitating the clinical adoption of deep learning methods by fostering practitioner understanding and trust in the technology.

Methods: Here, we develop an interpretable multimodal modeling framework that combines DNA methylation, gene expression, and histopathology (i.e., tissue slides) data, and we compare performance of crossmodal pretraining, contrastive learning, and transfer learning versus the standard procedure.

Results: Our models outperform the existing state-of-the-art method (average 11.54% C-index increase), and baseline clinically driven models (average 11.7% C-index increase). Model interpretations elucidate consideration of biologically meaningful factors in making prognosis predictions.

Discussion: Our results demonstrate that the selection of pretraining strategies is crucial for obtaining highly accurate prognostication models, even more so than devising an innovative model architecture, and further emphasize the all-important role of the tumor microenvironment on disease progression.

背景:深度学习模型可以从分子和解剖病理信息中推断癌症患者的预后。最近的研究利用了来自互补多模态数据的信息，改善了预测，进一步说明了这些方法的潜在效用。然而，目前的方法:1)没有全面利用生物和组织形态学的关系，2)利用新兴的策略来“预训练”模型(即，在稍微正交的数据集/建模目标上训练模型)，这可能通过减少实现最佳性能所需的信息量来帮助预测。此外，通过培养从业者对技术的理解和信任，模型解释对于促进临床采用深度学习方法至关重要。方法:在这里，我们开发了一个可解释的多模态建模框架，该框架结合了DNA甲基化、基因表达和组织病理学(即组织切片)数据，并将跨模态预训练、对比学习和迁移学习的性能与标准程序进行了比较。结果:我们的模型优于现有的最先进的方法(平均11.54%的c -指数增加)和基线临床驱动模型(平均11.7%的c -指数增加)。模型解释阐明了在进行预后预测时考虑生物学上有意义的因素。讨论:我们的研究结果表明，选择预训练策略对于获得高度准确的预测模型至关重要，甚至比设计创新的模型架构更重要，并进一步强调了肿瘤微环境在疾病进展中的重要作用。

{"title":"Assessment of emerging pretraining strategies in interpretable multimodal deep learning for cancer prognostication.","authors":"Zarif L Azher, Anish Suvarna, Ji-Qing Chen, Ze Zhang, Brock C Christensen, Lucas A Salas, Louis J Vaickus, Joshua J Levy","doi":"10.1186/s13040-023-00338-w","DOIUrl":"https://doi.org/10.1186/s13040-023-00338-w","url":null,"abstract":"Background: Deep learning models can infer cancer patient prognosis from molecular and anatomic pathology information. Recent studies that leveraged information from complementary multimodal data improved prognostication, further illustrating the potential utility of such methods. However, current approaches: 1) do not comprehensively leverage biological and histomorphological relationships and 2) make use of emerging strategies to \"pretrain\" models (i.e., train models on a slightly orthogonal dataset/modeling objective) which may aid prognostication by reducing the amount of information required for achieving optimal performance. In addition, model interpretation is crucial for facilitating the clinical adoption of deep learning methods by fostering practitioner understanding and trust in the technology.Methods: Here, we develop an interpretable multimodal modeling framework that combines DNA methylation, gene expression, and histopathology (i.e., tissue slides) data, and we compare performance of crossmodal pretraining, contrastive learning, and transfer learning versus the standard procedure.Results: Our models outperform the existing state-of-the-art method (average 11.54% C-index increase), and baseline clinically driven models (average 11.7% C-index increase). Model interpretations elucidate consideration of biologically meaningful factors in making prognosis predictions.Discussion: Our results demonstrate that the selection of pretraining strategies is crucial for obtaining highly accurate prognostication models, even more so than devising an innovative model architecture, and further emphasize the all-important role of the tumor microenvironment on disease progression.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"23"},"PeriodicalIF":4.5,"publicationDate":"2023-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10363299/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9865606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2