首页 > 最新文献

Biodata Mining最新文献

英文 中文
Deep learning-based approaches for multi-omics data integration and analysis. 基于深度学习的多组学数据整合与分析方法。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-10-02 DOI: 10.1186/s13040-024-00391-z
Jenna L Ballard, Zexuan Wang, Wenrui Li, Li Shen, Qi Long

Background: The rapid growth of deep learning, as well as the vast and ever-growing amount of available data, have provided ample opportunity for advances in fusion and analysis of complex and heterogeneous data types. Different data modalities provide complementary information that can be leveraged to gain a more complete understanding of each subject. In the biomedical domain, multi-omics data includes molecular (genomics, transcriptomics, proteomics, epigenomics, metabolomics, etc.) and imaging (radiomics, pathomics) modalities which, when combined, have the potential to improve performance on prediction, classification, clustering and other tasks. Deep learning encompasses a wide variety of methods, each of which have certain strengths and weaknesses for multi-omics integration.

Method: In this review, we categorize recent deep learning-based approaches by their basic architectures and discuss their unique capabilities in relation to one another. We also discuss some emerging themes advancing the field of multi-omics integration.

Results: Deep learning-based multi-omics integration methods were categorized broadly into non-generative (feedforward neural networks, graph convolutional neural networks, and autoencoders) and generative (variational methods, generative adversarial models, and a generative pretrained model). Generative methods have the advantage of being able to impose constraints on the shared representations to enforce certain properties or incorporate prior knowledge. They can also be used to generate or impute missing modalities. Recent advances achieved by these methods include the ability to handle incomplete data as well as going beyond the traditional molecular omics data types to integrate other modalities such as imaging data.

Conclusion: We expect to see further growth in methods that can handle missingness, as this is a common challenge in working with complex and heterogeneous data. Additionally, methods that integrate more data types are expected to improve performance on downstream tasks by capturing a comprehensive view of each sample.

背景:深度学习的迅猛发展以及不断增长的海量可用数据,为复杂和异构数据类型的融合与分析提供了充足的进步机会。不同的数据模式可以提供互补信息,利用这些信息可以更全面地了解每个主题。在生物医学领域,多组学数据包括分子(基因组学、转录组学、蛋白质组学、表观基因组学、代谢组学等)和成像(放射组学、病理组学)模式,这些模式结合在一起,有可能提高预测、分类、聚类和其他任务的性能。深度学习包含多种方法,每种方法在多组学整合方面都有一定的优缺点:在这篇综述中,我们按照基本架构对近期基于深度学习的方法进行了分类,并讨论了它们相互之间的独特能力。我们还讨论了推动多组学整合领域发展的一些新兴主题:基于深度学习的多组学整合方法大致分为非生成型(前馈神经网络、图卷积神经网络和自动编码器)和生成型(变异方法、生成对抗模型和生成预训练模型)。生成式方法的优势在于能够对共享表征施加约束,以强制执行某些属性或纳入先验知识。它们还可用于生成或估算缺失的模态。这些方法最近取得的进展包括能够处理不完整数据,以及超越传统的分子 omics 数据类型,整合成像数据等其他模态:我们希望看到能够处理缺失数据的方法进一步发展,因为这是处理复杂和异构数据时面临的共同挑战。此外,整合更多数据类型的方法有望通过捕捉每个样本的综合视图来提高下游任务的性能。
{"title":"Deep learning-based approaches for multi-omics data integration and analysis.","authors":"Jenna L Ballard, Zexuan Wang, Wenrui Li, Li Shen, Qi Long","doi":"10.1186/s13040-024-00391-z","DOIUrl":"10.1186/s13040-024-00391-z","url":null,"abstract":"<p><strong>Background: </strong>The rapid growth of deep learning, as well as the vast and ever-growing amount of available data, have provided ample opportunity for advances in fusion and analysis of complex and heterogeneous data types. Different data modalities provide complementary information that can be leveraged to gain a more complete understanding of each subject. In the biomedical domain, multi-omics data includes molecular (genomics, transcriptomics, proteomics, epigenomics, metabolomics, etc.) and imaging (radiomics, pathomics) modalities which, when combined, have the potential to improve performance on prediction, classification, clustering and other tasks. Deep learning encompasses a wide variety of methods, each of which have certain strengths and weaknesses for multi-omics integration.</p><p><strong>Method: </strong>In this review, we categorize recent deep learning-based approaches by their basic architectures and discuss their unique capabilities in relation to one another. We also discuss some emerging themes advancing the field of multi-omics integration.</p><p><strong>Results: </strong>Deep learning-based multi-omics integration methods were categorized broadly into non-generative (feedforward neural networks, graph convolutional neural networks, and autoencoders) and generative (variational methods, generative adversarial models, and a generative pretrained model). Generative methods have the advantage of being able to impose constraints on the shared representations to enforce certain properties or incorporate prior knowledge. They can also be used to generate or impute missing modalities. Recent advances achieved by these methods include the ability to handle incomplete data as well as going beyond the traditional molecular omics data types to integrate other modalities such as imaging data.</p><p><strong>Conclusion: </strong>We expect to see further growth in methods that can handle missingness, as this is a common challenge in working with complex and heterogeneous data. Additionally, methods that integrate more data types are expected to improve performance on downstream tasks by capturing a comprehensive view of each sample.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11446004/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142367123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessing the limitations of relief-based algorithms in detecting higher-order interactions. 评估基于浮雕的算法在检测高阶交互作用方面的局限性。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-10-01 DOI: 10.1186/s13040-024-00390-0
Philip J Freda, Suyu Ye, Robert Zhang, Jason H Moore, Ryan J Urbanowicz

Background: Epistasis, the interaction between genetic loci where the effect of one locus is influenced by one or more other loci, plays a crucial role in the genetic architecture of complex traits. However, as the number of loci considered increases, the investigation of epistasis becomes exponentially more complex, making the selection of key features vital for effective downstream analyses. Relief-Based Algorithms (RBAs) are often employed for this purpose due to their reputation as "interaction-sensitive" algorithms and uniquely non-exhaustive approach. However, the limitations of RBAs in detecting interactions, particularly those involving multiple loci, have not been thoroughly defined. This study seeks to address this gap by evaluating the efficiency of RBAs in detecting higher-order epistatic interactions. Motivated by previous findings that suggest some RBAs may rank predictive features involved in higher-order epistasis negatively, we explore the potential of absolute value ranking of RBA feature weights as an alternative approach for capturing complex interactions. In this study, we assess the performance of ReliefF, MultiSURF, and MultiSURFstar on simulated genetic datasets that model various patterns of genotype-phenotype associations, including 2-way to 5-way genetic interactions, and compare their performance to two control methods: a random shuffle and mutual information.

Results: Our findings indicate that while RBAs effectively identify lower-order (2 to 3-way) interactions, their capability to detect higher-order interactions is significantly limited, primarily by large feature count but also by signal noise. Specifically, we observe that RBAs are successful in detecting fully penetrant 4-way XOR interactions using an absolute value ranking approach, but this is restricted to datasets with only 20 total features.

Conclusions: These results highlight the inherent limitations of current RBAs and underscore the need for the development of Relief-based approaches with enhanced detection capabilities for the investigation of epistasis, particularly in datasets with large feature counts and complex higher-order interactions.

背景:外显性是遗传位点之间的相互作用,其中一个位点的效应受一个或多个其他位点的影响,在复杂性状的遗传结构中起着至关重要的作用。然而,随着所考虑的基因位点数量的增加,外显性的研究也变得更加复杂,因此选择关键特征对于有效的下游分析至关重要。基于救济的算法(RBA)因其 "交互敏感 "算法的美誉和独特的非穷举方法而经常被用于此目的。然而,RBA 在检测相互作用,尤其是涉及多个位点的相互作用方面的局限性尚未得到彻底界定。本研究试图通过评估 RBA 在检测高阶表观相互作用方面的效率来弥补这一不足。之前的研究结果表明,一些 RBA 可能会对涉及高阶表观相互作用的预测特征进行负排序,受此启发,我们探索了 RBA 特征权重绝对值排序作为捕捉复杂相互作用的另一种方法的潜力。在这项研究中,我们评估了 ReliefF、MultiSURF 和 MultiSURFstar 在模拟遗传数据集上的表现,这些数据集模拟了基因型与表型关联的各种模式,包括 2 向到 5 向遗传相互作用,并将它们的表现与两种对照方法(随机洗牌和互信息)进行了比较:我们的研究结果表明,虽然 RBA 能有效识别低阶(2 至 3 向)相互作用,但其检测高阶相互作用的能力却受到很大限制,这主要是由于特征数量较大,同时也受到信号噪声的影响。具体来说,我们观察到,使用绝对值排序方法,RBA 可以成功检测出完全穿透的 4 向 XOR 相互作用,但这仅限于总特征数只有 20 个的数据集:这些结果凸显了当前 RBAs 的固有局限性,并强调了开发基于 Relief 的方法的必要性,这种方法具有更强的检测能力,可用于研究表观性,特别是在具有大量特征和复杂高阶相互作用的数据集中。
{"title":"Assessing the limitations of relief-based algorithms in detecting higher-order interactions.","authors":"Philip J Freda, Suyu Ye, Robert Zhang, Jason H Moore, Ryan J Urbanowicz","doi":"10.1186/s13040-024-00390-0","DOIUrl":"10.1186/s13040-024-00390-0","url":null,"abstract":"<p><strong>Background: </strong>Epistasis, the interaction between genetic loci where the effect of one locus is influenced by one or more other loci, plays a crucial role in the genetic architecture of complex traits. However, as the number of loci considered increases, the investigation of epistasis becomes exponentially more complex, making the selection of key features vital for effective downstream analyses. Relief-Based Algorithms (RBAs) are often employed for this purpose due to their reputation as \"interaction-sensitive\" algorithms and uniquely non-exhaustive approach. However, the limitations of RBAs in detecting interactions, particularly those involving multiple loci, have not been thoroughly defined. This study seeks to address this gap by evaluating the efficiency of RBAs in detecting higher-order epistatic interactions. Motivated by previous findings that suggest some RBAs may rank predictive features involved in higher-order epistasis negatively, we explore the potential of absolute value ranking of RBA feature weights as an alternative approach for capturing complex interactions. In this study, we assess the performance of ReliefF, MultiSURF, and MultiSURFstar on simulated genetic datasets that model various patterns of genotype-phenotype associations, including 2-way to 5-way genetic interactions, and compare their performance to two control methods: a random shuffle and mutual information.</p><p><strong>Results: </strong>Our findings indicate that while RBAs effectively identify lower-order (2 to 3-way) interactions, their capability to detect higher-order interactions is significantly limited, primarily by large feature count but also by signal noise. Specifically, we observe that RBAs are successful in detecting fully penetrant 4-way XOR interactions using an absolute value ranking approach, but this is restricted to datasets with only 20 total features.</p><p><strong>Conclusions: </strong>These results highlight the inherent limitations of current RBAs and underscore the need for the development of Relief-based approaches with enhanced detection capabilities for the investigation of epistasis, particularly in datasets with large feature counts and complex higher-order interactions.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11443793/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142362274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identifying heterogeneous subgroups of systemic autoimmune diseases by applying a joint dimension reduction and clustering approach to immunomarkers 通过对免疫标记物采用联合降维和聚类方法识别全身性自身免疫疾病的异质亚组
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-09-16 DOI: 10.1186/s13040-024-00389-7
Chia-Wei Chang, Hsin-Yao Wang, Wan-Ying Lin, Yu-Chiang Wang, Wei-Lin Lo, Ting-Wei Lin, Jia-Ruei Yu, Yi-Ju Tseng
The high complexity of systemic autoimmune diseases (SADs) has hindered precise management. This study aims to investigate heterogeneity in SADs. We applied a joint cluster analysis, which jointed multiple correspondence analysis and k-means, to immunomarkers and measured the heterogeneity of clusters by examining differences in immunomarkers and clinical manifestations. The electronic health records of patients who received an antinuclear antibody test and were diagnosed with SADs, namely systemic lupus erythematosus (SLE), rheumatoid arthritis (RA), and Sjögren’s syndrome (SS), were retrieved between 2001 and 2016 from hospitals in Taiwan. With distinctive patterns of immunomarkers, a total of 11,923 patients with the three SADs were grouped into six clusters. None of the clusters was composed only of a single SAD, and these clusters demonstrated considerable differences in clinical manifestation. Both patients with SLE and SS had a more dispersed distribution in the six clusters. Among patients with SLE, the occurrence of renal compromise was higher in Clusters 3 and 6 (52% and 51%) than in the other clusters (p < 0.001). Cluster 3 also had a high proportion of patients with discoid lupus (60%) than did Cluster 6 (39%; p < 0.001). Patients with SS in Cluster 3 were the most distinctive because of the high occurrence of immunity disorders (63%) and other and unspecified benign neoplasm (58%) with statistical significance compared with the other clusters (all p < 0.05). The immunomarker-driven clustering method could recognise more clinically relevant subgroups of the SADs and would provide a more precise diagnosis basis.
系统性自身免疫性疾病(SAD)的高度复杂性阻碍了精确的管理。本研究旨在调查 SAD 的异质性。我们对免疫标志物进行了联合聚类分析,将多重对应分析和k-means联合起来,通过研究免疫标志物和临床表现的差异来衡量聚类的异质性。研究人员检索了台湾各医院2001年至2016年间接受抗核抗体检测并被诊断为系统性红斑狼疮(SLE)、类风湿性关节炎(RA)和斯约格伦综合征(SS)的患者的电子病历。三种 SAD 患者的免疫标志物模式各不相同,共有 11,923 名患者被分为六个群组。没有一个群组仅由单一的 SAD 组成,而且这些群组在临床表现上有很大差异。系统性红斑狼疮和 SS 患者在六个群组中的分布较为分散。在系统性红斑狼疮患者中,第 3 组和第 6 组的肾功能损害发生率(52% 和 51%)高于其他组群(P < 0.001)。群组 3 中盘状狼疮患者的比例(60%)也高于群组 6(39%;P < 0.001)。群组 3 中的 SS 患者与其他群组相比,免疫紊乱(63%)和其他及未指定的良性肿瘤(58%)的发生率较高,具有统计学意义(均为 p <0.05),因此群组 3 的 SS 患者最具特色。免疫标记物驱动的聚类方法可以识别出更多与临床相关的 SADs 亚群,并提供更精确的诊断依据。
{"title":"Identifying heterogeneous subgroups of systemic autoimmune diseases by applying a joint dimension reduction and clustering approach to immunomarkers","authors":"Chia-Wei Chang, Hsin-Yao Wang, Wan-Ying Lin, Yu-Chiang Wang, Wei-Lin Lo, Ting-Wei Lin, Jia-Ruei Yu, Yi-Ju Tseng","doi":"10.1186/s13040-024-00389-7","DOIUrl":"https://doi.org/10.1186/s13040-024-00389-7","url":null,"abstract":"The high complexity of systemic autoimmune diseases (SADs) has hindered precise management. This study aims to investigate heterogeneity in SADs. We applied a joint cluster analysis, which jointed multiple correspondence analysis and k-means, to immunomarkers and measured the heterogeneity of clusters by examining differences in immunomarkers and clinical manifestations. The electronic health records of patients who received an antinuclear antibody test and were diagnosed with SADs, namely systemic lupus erythematosus (SLE), rheumatoid arthritis (RA), and Sjögren’s syndrome (SS), were retrieved between 2001 and 2016 from hospitals in Taiwan. With distinctive patterns of immunomarkers, a total of 11,923 patients with the three SADs were grouped into six clusters. None of the clusters was composed only of a single SAD, and these clusters demonstrated considerable differences in clinical manifestation. Both patients with SLE and SS had a more dispersed distribution in the six clusters. Among patients with SLE, the occurrence of renal compromise was higher in Clusters 3 and 6 (52% and 51%) than in the other clusters (p < 0.001). Cluster 3 also had a high proportion of patients with discoid lupus (60%) than did Cluster 6 (39%; p < 0.001). Patients with SS in Cluster 3 were the most distinctive because of the high occurrence of immunity disorders (63%) and other and unspecified benign neoplasm (58%) with statistical significance compared with the other clusters (all p < 0.05). The immunomarker-driven clustering method could recognise more clinically relevant subgroups of the SADs and would provide a more precise diagnosis basis.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142258992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Development, evaluation and comparison of machine learning algorithms for predicting in-hospital patient charges for congestive heart failure exacerbations, chronic obstructive pulmonary disease exacerbations and diabetic ketoacidosis 开发、评估和比较用于预测充血性心力衰竭加重、慢性阻塞性肺病加重和糖尿病酮症酸中毒住院患者费用的机器学习算法
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-09-12 DOI: 10.1186/s13040-024-00387-9
Monique Arnold, Lathan Liou, Mary Regina Boland
Hospitalizations for exacerbations of congestive heart failure (CHF), chronic obstructive pulmonary disease (COPD) and diabetic ketoacidosis (DKA) are costly in the United States. The purpose of this study was to predict in-hospital charges for each condition using machine learning (ML) models. We conducted a retrospective cohort study on national discharge records of hospitalized adult patients from January 1st, 2016, to December 31st, 2019. We constructed six ML models (linear regression, ridge regression, support vector machine, random forest, gradient boosting and extreme gradient boosting) to predict total in-hospital cost for admission for each condition. Our models had good predictive performance, with testing R-squared values of 0.701-0.750 (mean of 0.713) for CHF; 0.694-0.724 (mean 0.709) for COPD; and 0.615-0.729 (mean 0.694) for DKA. We identified important key features driving costs, including patient age, length of stay, number of procedures, and elective/nonelective admission. ML methods may be used to accurately predict costs and identify drivers of high cost for COPD exacerbations, CHF exacerbations and DKA. Overall, our findings may inform future studies that seek to decrease the underlying high patient costs for these conditions.
在美国,因充血性心力衰竭 (CHF)、慢性阻塞性肺病 (COPD) 和糖尿病酮症酸中毒 (DKA) 恶化而住院的费用很高。本研究的目的是利用机器学习(ML)模型预测每种疾病的住院费用。我们对 2016 年 1 月 1 日至 2019 年 12 月 31 日住院成人患者的全国出院记录进行了回顾性队列研究。我们构建了六个 ML 模型(线性回归、脊回归、支持向量机、随机森林、梯度提升和极端梯度提升)来预测每种病症的住院总费用。我们的模型具有良好的预测性能,对慢性阻塞性肺病的测试 R 平方值为 0.701-0.750(平均值为 0.713);对慢性阻塞性肺病的测试 R 平方值为 0.694-0.724(平均值为 0.709);对 DKA 的测试 R 平方值为 0.615-0.729(平均值为 0.694)。我们确定了影响成本的重要关键特征,包括患者年龄、住院时间、手术次数和选择性/非选择性入院。ML 方法可用于准确预测慢性阻塞性肺病加重、慢性阻塞性肺病加重和 DKA 的成本,并确定导致高成本的因素。总之,我们的研究结果可为今后旨在降低这些疾病潜在高额患者费用的研究提供参考。
{"title":"Development, evaluation and comparison of machine learning algorithms for predicting in-hospital patient charges for congestive heart failure exacerbations, chronic obstructive pulmonary disease exacerbations and diabetic ketoacidosis","authors":"Monique Arnold, Lathan Liou, Mary Regina Boland","doi":"10.1186/s13040-024-00387-9","DOIUrl":"https://doi.org/10.1186/s13040-024-00387-9","url":null,"abstract":"Hospitalizations for exacerbations of congestive heart failure (CHF), chronic obstructive pulmonary disease (COPD) and diabetic ketoacidosis (DKA) are costly in the United States. The purpose of this study was to predict in-hospital charges for each condition using machine learning (ML) models. We conducted a retrospective cohort study on national discharge records of hospitalized adult patients from January 1st, 2016, to December 31st, 2019. We constructed six ML models (linear regression, ridge regression, support vector machine, random forest, gradient boosting and extreme gradient boosting) to predict total in-hospital cost for admission for each condition. Our models had good predictive performance, with testing R-squared values of 0.701-0.750 (mean of 0.713) for CHF; 0.694-0.724 (mean 0.709) for COPD; and 0.615-0.729 (mean 0.694) for DKA. We identified important key features driving costs, including patient age, length of stay, number of procedures, and elective/nonelective admission. ML methods may be used to accurately predict costs and identify drivers of high cost for COPD exacerbations, CHF exacerbations and DKA. Overall, our findings may inform future studies that seek to decrease the underlying high patient costs for these conditions.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Private pathological assessment via machine learning and homomorphic encryption 通过机器学习和同态加密进行私人病理评估
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-09-10 DOI: 10.1186/s13040-024-00379-9
Ahmad Al Badawi, Mohd Faizal Bin Yusof
The objective of this research is to explore the applicability of machine learning and fully homomorphic encryption (FHE) in the private pathological assessment, with a focus on the inference phase of support vector machines (SVM) for the classification of confidential medical data. A framework is introduced that utilizes the Cheon-Kim-Kim-Song (CKKS) FHE scheme, facilitating the execution of SVM inference on encrypted datasets. This framework ensures the privacy of patient data and negates the necessity of decryption during the analytical process. Additionally, an efficient feature extraction technique is presented for the transformation of medical imagery into vectorial representations. The system’s evaluation across various datasets substantiates its practicality and efficacy. The proposed method delivers classification accuracy and performance on par with traditional, non-encrypted SVM inference, while upholding a 128-bit security level against established cryptographic attacks targeting the CKKS scheme. The secure inference process is executed within a temporal span of mere seconds. The findings of this study underscore the viability of FHE in enhancing the security and efficiency of bioinformatics analyses, potentially benefiting fields such as cardiology, oncology, and medical imagery. The implications of this research are significant for the future of privacy-preserving machine learning, promoting progress in diagnostic procedures, tailored medical treatments, and clinical investigations.
本研究的目的是探索机器学习和全同态加密(FHE)在私密病理评估中的适用性,重点是用于机密医疗数据分类的支持向量机(SVM)的推理阶段。本文介绍了一种利用 Cheon-Kim-Kim-Song (CKKS) FHE 方案的框架,该框架有助于在加密数据集上执行 SVM 推断。该框架确保了患者数据的隐私性,并消除了分析过程中解密的必要性。此外,还介绍了一种高效的特征提取技术,用于将医学图像转换为矢量表示。该系统在各种数据集上的评估证明了其实用性和有效性。所提出的方法在分类准确性和性能上与传统的非加密 SVM 推理不相上下,同时还具有 128 位的安全级别,可抵御针对 CKKS 方案的加密攻击。安全推理过程的执行时间跨度仅为几秒钟。这项研究的发现强调了 FHE 在提高生物信息学分析的安全性和效率方面的可行性,可能会使心脏病学、肿瘤学和医学影像等领域受益。这项研究对保护隐私的机器学习的未来意义重大,可促进诊断程序、定制医疗和临床研究的进步。
{"title":"Private pathological assessment via machine learning and homomorphic encryption","authors":"Ahmad Al Badawi, Mohd Faizal Bin Yusof","doi":"10.1186/s13040-024-00379-9","DOIUrl":"https://doi.org/10.1186/s13040-024-00379-9","url":null,"abstract":"The objective of this research is to explore the applicability of machine learning and fully homomorphic encryption (FHE) in the private pathological assessment, with a focus on the inference phase of support vector machines (SVM) for the classification of confidential medical data. A framework is introduced that utilizes the Cheon-Kim-Kim-Song (CKKS) FHE scheme, facilitating the execution of SVM inference on encrypted datasets. This framework ensures the privacy of patient data and negates the necessity of decryption during the analytical process. Additionally, an efficient feature extraction technique is presented for the transformation of medical imagery into vectorial representations. The system’s evaluation across various datasets substantiates its practicality and efficacy. The proposed method delivers classification accuracy and performance on par with traditional, non-encrypted SVM inference, while upholding a 128-bit security level against established cryptographic attacks targeting the CKKS scheme. The secure inference process is executed within a temporal span of mere seconds. The findings of this study underscore the viability of FHE in enhancing the security and efficiency of bioinformatics analyses, potentially benefiting fields such as cardiology, oncology, and medical imagery. The implications of this research are significant for the future of privacy-preserving machine learning, promoting progress in diagnostic procedures, tailored medical treatments, and clinical investigations.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data 针对高维数据和小样本量的知识倾斜随机森林方法与基因表达数据的特征选择应用
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-09-10 DOI: 10.1186/s13040-024-00388-8
Erika Cantor, Sandra Guauque-Olarte, Roberto León, Steren Chabert, Rodrigo Salas
The use of prior knowledge in the machine learning framework has been considered a potential tool to handle the curse of dimensionality in genetic and genomics data. Although random forest (RF) represents a flexible non-parametric approach with several advantages, it can provide poor accuracy in high-dimensional settings, mainly in scenarios with small sample sizes. We propose a knowledge-slanted RF that integrates biological networks as prior knowledge into the model to improve its performance and explainability, exemplifying its use for selecting and identifying relevant genes. knowledge-slanted RF is a combination of two stages. First, prior knowledge represented by graphs is translated by running a random walk with restart algorithm to determine the relevance of each gene based on its connection and localization on a protein-protein interaction network. Then, each relevance is used to modify the selection probability to draw a gene as a candidate split-feature in the conventional RF. Experiments in simulated datasets with very small sample sizes $$(n le 30)$$ comparing knowledge-slanted RF against conventional RF and logistic lasso regression, suggest an improved precision in outcome prediction compared to the other methods. The knowledge-slanted RF was completed with the introduction of a modified version of the Boruta feature selection algorithm. Finally, knowledge-slanted RF identified more relevant biological genes, offering a higher level of explainability for users than conventional RF. These findings were corroborated in one real case to identify relevant genes to calcific aortic valve stenosis.
在机器学习框架中使用先验知识一直被认为是处理遗传和基因组学数据维度诅咒的潜在工具。虽然随机森林(RF)是一种灵活的非参数方法,具有多种优势,但在高维环境下,主要是在样本量较小的情况下,其准确性可能较差。我们提出了一种知识倾斜 RF,将生物网络作为先验知识整合到模型中,以提高其性能和可解释性,并将其用于选择和识别相关基因。首先,通过运行带重启算法的随机行走来转换由图代表的先验知识,从而根据每个基因在蛋白质-蛋白质相互作用网络上的连接和定位来确定其相关性。然后,利用每个相关性来修改选择概率,从而在传统的 RF 中将某个基因作为候选分割特征提取出来。在样本量极小的模拟数据集上进行的实验表明,知识倾斜RF与传统RF和logistic lasso回归相比,结果预测的精确度有所提高。通过引入改进版的 Boruta 特征选择算法,知识倾斜 RF 得到了完善。最后,与传统 RF 相比,知识倾斜 RF 识别出了更多相关的生物基因,为用户提供了更高水平的可解释性。这些发现在一个真实病例中得到了证实,从而确定了钙化性主动脉瓣狭窄的相关基因。
{"title":"Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data","authors":"Erika Cantor, Sandra Guauque-Olarte, Roberto León, Steren Chabert, Rodrigo Salas","doi":"10.1186/s13040-024-00388-8","DOIUrl":"https://doi.org/10.1186/s13040-024-00388-8","url":null,"abstract":"The use of prior knowledge in the machine learning framework has been considered a potential tool to handle the curse of dimensionality in genetic and genomics data. Although random forest (RF) represents a flexible non-parametric approach with several advantages, it can provide poor accuracy in high-dimensional settings, mainly in scenarios with small sample sizes. We propose a knowledge-slanted RF that integrates biological networks as prior knowledge into the model to improve its performance and explainability, exemplifying its use for selecting and identifying relevant genes. knowledge-slanted RF is a combination of two stages. First, prior knowledge represented by graphs is translated by running a random walk with restart algorithm to determine the relevance of each gene based on its connection and localization on a protein-protein interaction network. Then, each relevance is used to modify the selection probability to draw a gene as a candidate split-feature in the conventional RF. Experiments in simulated datasets with very small sample sizes $$(n le 30)$$ comparing knowledge-slanted RF against conventional RF and logistic lasso regression, suggest an improved precision in outcome prediction compared to the other methods. The knowledge-slanted RF was completed with the introduction of a modified version of the Boruta feature selection algorithm. Finally, knowledge-slanted RF identified more relevant biological genes, offering a higher level of explainability for users than conventional RF. These findings were corroborated in one real case to identify relevant genes to calcific aortic valve stenosis.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhanced labor pain monitoring using machine learning and ECG waveform analysis for uterine contraction-induced pain. 利用机器学习和心电图波形分析对子宫收缩引起的疼痛加强分娩疼痛监测。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-09-07 DOI: 10.1186/s13040-024-00383-z
Yuan-Chia Chu, Saint Shiou-Sheng Chen, Kuen-Bao Chen, Jui-Sheng Sun, Tzu-Kuei Shen, Li-Kuei Chen

Objectives: This study aims to develop an innovative approach for monitoring and assessing labor pain through ECG waveform analysis, utilizing machine learning techniques to monitor pain resulting from uterine contractions.

Methods: The study was conducted at National Taiwan University Hospital between January and July 2020. We collected a dataset of 6010 ECG samples from women preparing for natural spontaneous delivery (NSD). The ECG data was used to develop an ECG waveform-based Nociception Monitoring Index (NoM). The dataset was divided into training (80%) and validation (20%) sets. Multiple machine learning models, including LightGBM, XGBoost, SnapLogisticRegression, and SnapDecisionTree, were developed and evaluated. Hyperparameter optimization was performed using grid search and five-fold cross-validation to enhance model performance.

Results: The LightGBM model demonstrated superior performance with an AUC of 0.96 and an accuracy of 90%, making it the optimal model for monitoring labor pain based on ECG data. Other models, such as XGBoost and SnapLogisticRegression, also showed strong performance, with AUC values ranging from 0.88 to 0.95.

Conclusions: This study demonstrates that the integration of machine learning algorithms with ECG data significantly enhances the accuracy and reliability of labor pain monitoring. Specifically, the LightGBM model exhibits exceptional precision and robustness in continuous pain monitoring during labor, with potential applicability extending to broader healthcare settings.

Trial registration: ClinicalTrials.gov Identifier: NCT04461704.

目的:本研究旨在开发一种通过心电图波形分析监测和评估分娩疼痛的创新方法:本研究旨在开发一种通过心电图波形分析监测和评估分娩疼痛的创新方法,利用机器学习技术监测子宫收缩引起的疼痛:研究于 2020 年 1 月至 7 月在台湾大学医院进行。我们从准备自然自然分娩(NSD)的产妇中收集了 6010 份心电图样本数据集。心电图数据被用于开发基于心电图波形的痛觉监测指数(NoM)。数据集分为训练集(80%)和验证集(20%)。开发并评估了多种机器学习模型,包括 LightGBM、XGBoost、SnapLogisticRegression 和 SnapDecisionTree。使用网格搜索和五倍交叉验证对超参数进行了优化,以提高模型性能:结果:LightGBM 模型表现优异,AUC 为 0.96,准确率达 90%,是基于心电图数据监测分娩疼痛的最佳模型。其他模型,如 XGBoost 和 SnapLogisticRegression,也表现出很强的性能,AUC 值从 0.88 到 0.95 不等:本研究表明,将机器学习算法与心电图数据相结合可显著提高分娩疼痛监测的准确性和可靠性。具体来说,LightGBM 模型在分娩过程中的连续疼痛监测中表现出了卓越的精确性和鲁棒性,其潜在的适用性可扩展到更广泛的医疗保健环境中:试验注册:ClinicalTrials.gov Identifier:试验注册:ClinicalTrials.gov Identifier:NCT04461704。
{"title":"Enhanced labor pain monitoring using machine learning and ECG waveform analysis for uterine contraction-induced pain.","authors":"Yuan-Chia Chu, Saint Shiou-Sheng Chen, Kuen-Bao Chen, Jui-Sheng Sun, Tzu-Kuei Shen, Li-Kuei Chen","doi":"10.1186/s13040-024-00383-z","DOIUrl":"10.1186/s13040-024-00383-z","url":null,"abstract":"<p><strong>Objectives: </strong>This study aims to develop an innovative approach for monitoring and assessing labor pain through ECG waveform analysis, utilizing machine learning techniques to monitor pain resulting from uterine contractions.</p><p><strong>Methods: </strong>The study was conducted at National Taiwan University Hospital between January and July 2020. We collected a dataset of 6010 ECG samples from women preparing for natural spontaneous delivery (NSD). The ECG data was used to develop an ECG waveform-based Nociception Monitoring Index (NoM). The dataset was divided into training (80%) and validation (20%) sets. Multiple machine learning models, including LightGBM, XGBoost, SnapLogisticRegression, and SnapDecisionTree, were developed and evaluated. Hyperparameter optimization was performed using grid search and five-fold cross-validation to enhance model performance.</p><p><strong>Results: </strong>The LightGBM model demonstrated superior performance with an AUC of 0.96 and an accuracy of 90%, making it the optimal model for monitoring labor pain based on ECG data. Other models, such as XGBoost and SnapLogisticRegression, also showed strong performance, with AUC values ranging from 0.88 to 0.95.</p><p><strong>Conclusions: </strong>This study demonstrates that the integration of machine learning algorithms with ECG data significantly enhances the accuracy and reliability of labor pain monitoring. Specifically, the LightGBM model exhibits exceptional precision and robustness in continuous pain monitoring during labor, with potential applicability extending to broader healthcare settings.</p><p><strong>Trial registration: </strong>ClinicalTrials.gov Identifier: NCT04461704.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11380346/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142146633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The goldmine of GWAS summary statistics: a systematic review of methods and tools. GWAS 摘要统计的金矿:对方法和工具的系统回顾。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-09-05 DOI: 10.1186/s13040-024-00385-x
Panagiota I Kontou, Pantelis G Bagos

Genome-wide association studies (GWAS) have revolutionized our understanding of the genetic architecture of complex traits and diseases. GWAS summary statistics have become essential tools for various genetic analyses, including meta-analysis, fine-mapping, and risk prediction. However, the increasing number of GWAS summary statistics and the diversity of software tools available for their analysis can make it challenging for researchers to select the most appropriate tools for their specific needs. This systematic review aims to provide a comprehensive overview of the currently available software tools and databases for GWAS summary statistics analysis. We conducted a comprehensive literature search to identify relevant software tools and databases. We categorized the tools and databases by their functionality, including data management, quality control, single-trait analysis, and multiple-trait analysis. We also compared the tools and databases based on their features, limitations, and user-friendliness. Our review identified a total of 305 functioning software tools and databases dedicated to GWAS summary statistics, each with unique strengths and limitations. We provide descriptions of the key features of each tool and database, including their input/output formats, data types, and computational requirements. We also discuss the overall usability and applicability of each tool for different research scenarios. This comprehensive review will serve as a valuable resource for researchers who are interested in using GWAS summary statistics to investigate the genetic basis of complex traits and diseases. By providing a detailed overview of the available tools and databases, we aim to facilitate informed tool selection and maximize the effectiveness of GWAS summary statistics analysis.

全基因组关联研究(GWAS)彻底改变了我们对复杂性状和疾病遗传结构的认识。全基因组关联研究摘要统计已成为各种遗传分析(包括荟萃分析、精细图谱绘制和风险预测)的基本工具。然而,GWAS 统计摘要的数量越来越多,用于分析的软件工具也多种多样,这使得研究人员在选择最适合其特定需求的工具时面临挑战。本系统综述旨在全面概述目前可用于 GWAS 摘要统计分析的软件工具和数据库。我们进行了全面的文献检索,以确定相关的软件工具和数据库。我们按照工具和数据库的功能进行了分类,包括数据管理、质量控制、单性状分析和多性状分析。我们还根据工具和数据库的功能、局限性和易用性对其进行了比较。我们的研究共发现了 305 种专用于 GWAS 摘要统计的功能软件工具和数据库,每种工具和数据库都有其独特的优势和局限性。我们对每种工具和数据库的主要特点进行了描述,包括其输入/输出格式、数据类型和计算要求。我们还讨论了每种工具在不同研究方案中的整体可用性和适用性。对于有兴趣使用 GWAS 摘要统计来研究复杂性状和疾病遗传基础的研究人员来说,这篇综合综述将成为宝贵的资源。通过对现有工具和数据库的详细概述,我们旨在促进对工具的知情选择,并最大限度地提高 GWAS 概要统计分析的有效性。
{"title":"The goldmine of GWAS summary statistics: a systematic review of methods and tools.","authors":"Panagiota I Kontou, Pantelis G Bagos","doi":"10.1186/s13040-024-00385-x","DOIUrl":"10.1186/s13040-024-00385-x","url":null,"abstract":"<p><p>Genome-wide association studies (GWAS) have revolutionized our understanding of the genetic architecture of complex traits and diseases. GWAS summary statistics have become essential tools for various genetic analyses, including meta-analysis, fine-mapping, and risk prediction. However, the increasing number of GWAS summary statistics and the diversity of software tools available for their analysis can make it challenging for researchers to select the most appropriate tools for their specific needs. This systematic review aims to provide a comprehensive overview of the currently available software tools and databases for GWAS summary statistics analysis. We conducted a comprehensive literature search to identify relevant software tools and databases. We categorized the tools and databases by their functionality, including data management, quality control, single-trait analysis, and multiple-trait analysis. We also compared the tools and databases based on their features, limitations, and user-friendliness. Our review identified a total of 305 functioning software tools and databases dedicated to GWAS summary statistics, each with unique strengths and limitations. We provide descriptions of the key features of each tool and database, including their input/output formats, data types, and computational requirements. We also discuss the overall usability and applicability of each tool for different research scenarios. This comprehensive review will serve as a valuable resource for researchers who are interested in using GWAS summary statistics to investigate the genetic basis of complex traits and diseases. By providing a detailed overview of the available tools and databases, we aim to facilitate informed tool selection and maximize the effectiveness of GWAS summary statistics analysis.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11375927/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142141566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Processing imbalanced medical data at the data level with assisted-reproduction data as an example. 以辅助生产数据为例,在数据层面处理不平衡的医疗数据。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-09-04 DOI: 10.1186/s13040-024-00384-y
Junliang Zhu, Shaowei Pu, Jiaji He, Dongchao Su, Weijie Cai, Xueying Xu, Hongbo Liu

Objective: Data imbalance is a pervasive issue in medical data mining, often leading to biased and unreliable predictive models. This study aims to address the urgent need for effective strategies to mitigate the impact of data imbalance on classification models. We focus on quantifying the effects of different imbalance degrees and sample sizes on model performance, identifying optimal cut-off values, and evaluating the efficacy of various methods to enhance model accuracy in highly imbalanced and small sample size scenarios.

Methods: We collected medical records of patients receiving assisted reproductive treatment in a reproductive medicine center. Random forest was used to screen the key variables for the prediction target. Various datasets with different imbalance degrees and sample sizes were constructed to compare the classification performance of logistic regression models. Metrics such as AUC, G-mean, F1-Score, Accuracy, Recall, and Precision were used for evaluation. Four imbalance treatment methods (SMOTE, ADASYN, OSS, and CNN) were applied to datasets with low positive rates and small sample sizes to assess their effectiveness.

Results: The logistic model's performance was low when the positive rate was below 10% but stabilized beyond this threshold. Similarly, sample sizes below 1200 yielded poor results, with improvement seen above this threshold. For robustness, the optimal cut-offs for positive rate and sample size were identified as 15% and 1500, respectively. SMOTE and ADASYN oversampling significantly improved classification performance in datasets with low positive rates and small sample sizes.

Conclusions: The study identifies a positive rate of 15% and a sample size of 1500 as optimal cut-offs for stable logistic model performance. For datasets with low positive rates and small sample sizes, SMOTE and ADASYN are recommended to improve balance and model accuracy.

目的:数据不平衡是医学数据挖掘中普遍存在的问题,往往会导致预测模型有偏差且不可靠。本研究旨在满足对有效策略的迫切需求,以减轻数据不平衡对分类模型的影响。我们的重点是量化不同失衡程度和样本量对模型性能的影响,确定最佳截断值,并评估各种方法在高度失衡和样本量较小的情况下提高模型准确性的效果:方法:我们收集了一家生殖医学中心接受辅助生殖治疗的患者的医疗记录。方法:我们收集了一家生殖医学中心接受辅助生殖治疗的患者的医疗记录,并使用随机森林筛选预测目标的关键变量。我们构建了不同失衡程度和样本量的数据集,以比较逻辑回归模型的分类性能。评估指标包括 AUC、G-mean、F1-Score、Accuracy、Recall 和 Precision。四种不平衡处理方法(SMOTE、ADASYN、OSS 和 CNN)被应用于阳性率低、样本量小的数据集,以评估其有效性:结果:当阳性率低于 10%时,逻辑模型的性能较低,但超过这一阈值后性能趋于稳定。同样,样本量低于 1200 个时,效果不佳,超过这一临界值时,效果会有所改善。为确保稳健性,确定阳性率和样本量的最佳临界值分别为 15%和 1500。在阳性率低、样本量小的数据集中,SMOTE 和 ADASYN 超采样显著提高了分类性能:结论:这项研究确定了 15%的阳性率和 1500 个样本量是逻辑模型性能稳定的最佳临界值。对于阳性率低、样本量小的数据集,建议使用 SMOTE 和 ADASYN 来提高平衡性和模型准确性。
{"title":"Processing imbalanced medical data at the data level with assisted-reproduction data as an example.","authors":"Junliang Zhu, Shaowei Pu, Jiaji He, Dongchao Su, Weijie Cai, Xueying Xu, Hongbo Liu","doi":"10.1186/s13040-024-00384-y","DOIUrl":"10.1186/s13040-024-00384-y","url":null,"abstract":"<p><strong>Objective: </strong>Data imbalance is a pervasive issue in medical data mining, often leading to biased and unreliable predictive models. This study aims to address the urgent need for effective strategies to mitigate the impact of data imbalance on classification models. We focus on quantifying the effects of different imbalance degrees and sample sizes on model performance, identifying optimal cut-off values, and evaluating the efficacy of various methods to enhance model accuracy in highly imbalanced and small sample size scenarios.</p><p><strong>Methods: </strong>We collected medical records of patients receiving assisted reproductive treatment in a reproductive medicine center. Random forest was used to screen the key variables for the prediction target. Various datasets with different imbalance degrees and sample sizes were constructed to compare the classification performance of logistic regression models. Metrics such as AUC, G-mean, F1-Score, Accuracy, Recall, and Precision were used for evaluation. Four imbalance treatment methods (SMOTE, ADASYN, OSS, and CNN) were applied to datasets with low positive rates and small sample sizes to assess their effectiveness.</p><p><strong>Results: </strong>The logistic model's performance was low when the positive rate was below 10% but stabilized beyond this threshold. Similarly, sample sizes below 1200 yielded poor results, with improvement seen above this threshold. For robustness, the optimal cut-offs for positive rate and sample size were identified as 15% and 1500, respectively. SMOTE and ADASYN oversampling significantly improved classification performance in datasets with low positive rates and small sample sizes.</p><p><strong>Conclusions: </strong>The study identifies a positive rate of 15% and a sample size of 1500 as optimal cut-offs for stable logistic model performance. For datasets with low positive rates and small sample sizes, SMOTE and ADASYN are recommended to improve balance and model accuracy.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11373105/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142134276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
QIGTD: identifying critical genes in the evolution of lung adenocarcinoma with tensor decomposition. QIGTD:通过张量分解确定肺腺癌演变过程中的关键基因。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-09-04 DOI: 10.1186/s13040-024-00386-w
Bolin Chen, Jinlei Zhang, Ci Shao, Jun Bian, Ruiming Kang, Xuequn Shang

Background: Identifying critical genes is important for understanding the pathogenesis of complex diseases. Traditional studies typically comparing the change of biomecules between normal and disease samples or detecting important vertices from a single static biomolecular network, which often overlook the dynamic changes that occur between different disease stages. However, investigating temporal changes in biomolecular networks and identifying critical genes is critical for understanding the occurrence and development of diseases.

Methods: A novel method called Quantifying Importance of Genes with Tensor Decomposition (QIGTD) was proposed in this study. It first constructs a time series network by integrating both the intra and inter temporal network information, which preserving connections between networks at adjacent stages according to the local similarities. A tensor is employed to describe the connections of this time series network, and a 3-order tensor decomposition method was proposed to capture both the topological information of each network snapshot and the time series characteristics of the whole network. QIGTD is also a learning-free and efficient method that can be applied to datasets with a small number of samples.

Results: The effectiveness of QIGTD was evaluated using lung adenocarcinoma (LUAD) datasets and three state-of-the-art methods: T-degree, T-closeness, and T-betweenness were employed as benchmark methods. Numerical experimental results demonstrate that QIGTD outperforms these methods in terms of the indices of both precision and mAP. Notably, out of the top 50 genes, 29 have been verified to be highly related to LUAD according to the DisGeNET Database, and 36 are significantly enriched in LUAD related Gene Ontology (GO) terms, including nuclear division, mitotic nuclear division, chromosome segregation, organelle fission, and mitotic sister chromatid segregation.

Conclusion: In conclusion, QIGTD effectively captures the temporal changes in gene networks and identifies critical genes. It provides a valuable tool for studying temporal dynamics in biological networks and can aid in understanding the underlying mechanisms of diseases such as LUAD.

背景:识别关键基因对于了解复杂疾病的发病机制非常重要。传统研究通常比较正常样本与疾病样本之间生物分子的变化,或从单一静态生物分子网络中检测重要顶点,这往往忽略了不同疾病阶段之间发生的动态变化。然而,研究生物分子网络的时间变化并确定关键基因对于了解疾病的发生和发展至关重要:方法:本研究提出了一种名为 "张量分解基因重要性量化(QIGTD)"的新方法。它首先通过整合时间内和时间间的网络信息构建时间序列网络,根据局部相似性保留相邻阶段网络之间的连接。采用张量来描述该时间序列网络的连接,并提出了一种三阶张量分解方法,以捕捉每个网络快照的拓扑信息和整个网络的时间序列特征。QIGTD 也是一种无需学习的高效方法,可用于样本数量较少的数据集:使用肺腺癌(LUAD)数据集和三种最先进的方法评估了 QIGTD 的有效性:以 T-degree、T-closeness 和 T-betweenness 作为基准方法。数值实验结果表明,QIGTD 在精确度和 mAP 两项指标上都优于这些方法。值得注意的是,根据 DisGeNET 数据库,在前 50 个基因中,有 29 个已被证实与 LUAD 高度相关,有 36 个显著富集了与 LUAD 相关的基因本体(Gene Ontology,GO)术语,包括核分裂、有丝分裂核分裂、染色体分离、细胞器裂变和有丝分裂姐妹染色单体分离:总之,QIGTD 能有效捕捉基因网络的时间变化并识别关键基因。结论:QIGTD 能有效捕捉基因网络的时间变化并识别关键基因,它为研究生物网络的时间动态提供了一种有价值的工具,有助于了解 LUAD 等疾病的潜在机制。
{"title":"QIGTD: identifying critical genes in the evolution of lung adenocarcinoma with tensor decomposition.","authors":"Bolin Chen, Jinlei Zhang, Ci Shao, Jun Bian, Ruiming Kang, Xuequn Shang","doi":"10.1186/s13040-024-00386-w","DOIUrl":"10.1186/s13040-024-00386-w","url":null,"abstract":"<p><strong>Background: </strong>Identifying critical genes is important for understanding the pathogenesis of complex diseases. Traditional studies typically comparing the change of biomecules between normal and disease samples or detecting important vertices from a single static biomolecular network, which often overlook the dynamic changes that occur between different disease stages. However, investigating temporal changes in biomolecular networks and identifying critical genes is critical for understanding the occurrence and development of diseases.</p><p><strong>Methods: </strong>A novel method called Quantifying Importance of Genes with Tensor Decomposition (QIGTD) was proposed in this study. It first constructs a time series network by integrating both the intra and inter temporal network information, which preserving connections between networks at adjacent stages according to the local similarities. A tensor is employed to describe the connections of this time series network, and a 3-order tensor decomposition method was proposed to capture both the topological information of each network snapshot and the time series characteristics of the whole network. QIGTD is also a learning-free and efficient method that can be applied to datasets with a small number of samples.</p><p><strong>Results: </strong>The effectiveness of QIGTD was evaluated using lung adenocarcinoma (LUAD) datasets and three state-of-the-art methods: T-degree, T-closeness, and T-betweenness were employed as benchmark methods. Numerical experimental results demonstrate that QIGTD outperforms these methods in terms of the indices of both precision and mAP. Notably, out of the top 50 genes, 29 have been verified to be highly related to LUAD according to the DisGeNET Database, and 36 are significantly enriched in LUAD related Gene Ontology (GO) terms, including nuclear division, mitotic nuclear division, chromosome segregation, organelle fission, and mitotic sister chromatid segregation.</p><p><strong>Conclusion: </strong>In conclusion, QIGTD effectively captures the temporal changes in gene networks and identifies critical genes. It provides a valuable tool for studying temporal dynamics in biological networks and can aid in understanding the underlying mechanisms of diseases such as LUAD.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11376055/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142134277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Biodata Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1