Biodata Mining最新文献_第2页

An ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancer. 基于集成机器学习的性能评估确定了对癌症驱动突变进行最佳分类的顶级计算机致病性预测方法。

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2025-01-20 DOI: 10.1186/s13040-024-00420-x

Subrata Das, Vatsal Patel, Shouvik Chakravarty, Arnab Ghosh, Anirban Mukhopadhyay, Nidhan K Biswas

Background and objective: Accurate identification and prioritization of driver-mutations in cancer is critical for effective patient management. Despite the presence of numerous bioinformatic algorithms for estimating mutation pathogenicity, there is significant variation in their assessments. This inconsistency is evident even for well-established cancer driver mutations. This study aims to develop an ensemble machine learning approach to evaluate the performance (rank) of pathogenic and conservation scoring algorithms (PCSAs) based on their ability to distinguish pathogenic driver mutations from benign passenger (non-driver) mutations in head and neck squamous cell carcinoma (HNSC).

Methods: The study used a dataset from 502 HNSC patients, classifying mutations based on 299 known high-confidence cancer driver genes. Missense somatic mutations in driver genes were treated as driver mutations, while non-driver mutations were randomly selected from other genes. Each mutation was annotated with 41 PCSAs. Three machine learning algorithms-logistic regression, random forest, and support vector machine-along with recursive feature elimination, were used to rank these PCSAs. The final ranking of the PCSAs was determined using rank-average-sort and rank-sum-sort methods.

Results: The random forest algorithm emerged as the top performer among the three tested ML algorithms, with an AUC-ROC of 0.89, compared to 0.83 for the other two, in distinguishing pathogenic driver mutations from benign passenger mutations using all 41 PCSAs. The top 11 PCSAs were selected based on the first quintile cut-off from the final rank-sum distribution. Classifiers built using these top 11 PCSAs (DEOGEN2, Integrated_fitCons, MVP, etc.) demonstrated significantly higher performance (p-value < 2.22e-16) compared to those using the remaining 30 PCSAs across all three ML algorithms, in separating pathogenic driver from benign passenger mutations. The top PCSAs demonstrated strong performance on a validation cohort including independent HNSC and other cancer types: breast, lung, and colorectal - reflecting its consistency, robustness and generalizability.

Conclusions: The ensemble machine learning approach effectively evaluates the performance of PCSAs based on their ability to differentiate pathogenic drivers from benign passenger mutations in HNSC and other cancer types. Notably, some well-known PCSAs performed poorly, underscoring the importance of data-driven selection over relying solely on popularity.

背景和目的：癌症驱动突变的准确识别和优先排序对于有效的患者管理至关重要。尽管存在许多用于估计突变致病性的生物信息学算法，但它们的评估存在显着差异。这种不一致甚至在已经确定的癌症驱动突变中也很明显。本研究旨在开发一种集成机器学习方法，基于区分头颈部鳞状细胞癌（HNSC）中致病驱动突变和良性乘客（非驱动）突变的能力，评估致病性和保守性评分算法（pcsa）的性能（排名）。方法：该研究使用了来自502例HNSC患者的数据集，基于299个已知的高可信度癌症驱动基因对突变进行了分类。驱动基因中的错义体细胞突变被视为驱动突变，而非驱动突变则从其他基因中随机选择。每个突变用41个pcsa注释。三种机器学习算法——逻辑回归、随机森林和支持向量机——以及递归特征消除，被用来对这些pcsa进行排序。采用秩-平均排序和秩-和排序方法确定pcsa的最终排序。结果：在使用所有41种pcsa区分致病驱动突变和良性乘客突变方面，随机森林算法在三种测试的ML算法中表现最佳，AUC-ROC为0.89，而其他两种算法的AUC-ROC为0.83。排名前11位的pcsa是根据最终秩和分布的第一个五分位数选出的。使用这11种排名靠前的pcsa （DEOGEN2、Integrated_fitCons、MVP等）构建的分类器表现出了显著更高的性能（p值）。结论：集成机器学习方法基于pcsa区分HNSC和其他癌症类型的致病驱动因子与良性乘客突变的能力，有效地评估了pcsa的性能。值得注意的是，一些知名的pcsa表现不佳，强调了数据驱动选择的重要性，而不是仅仅依靠人气。

{"title":"An ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancer.","authors":"Subrata Das, Vatsal Patel, Shouvik Chakravarty, Arnab Ghosh, Anirban Mukhopadhyay, Nidhan K Biswas","doi":"10.1186/s13040-024-00420-x","DOIUrl":"10.1186/s13040-024-00420-x","url":null,"abstract":"Background and objective: Accurate identification and prioritization of driver-mutations in cancer is critical for effective patient management. Despite the presence of numerous bioinformatic algorithms for estimating mutation pathogenicity, there is significant variation in their assessments. This inconsistency is evident even for well-established cancer driver mutations. This study aims to develop an ensemble machine learning approach to evaluate the performance (rank) of pathogenic and conservation scoring algorithms (PCSAs) based on their ability to distinguish pathogenic driver mutations from benign passenger (non-driver) mutations in head and neck squamous cell carcinoma (HNSC).Methods: The study used a dataset from 502 HNSC patients, classifying mutations based on 299 known high-confidence cancer driver genes. Missense somatic mutations in driver genes were treated as driver mutations, while non-driver mutations were randomly selected from other genes. Each mutation was annotated with 41 PCSAs. Three machine learning algorithms-logistic regression, random forest, and support vector machine-along with recursive feature elimination, were used to rank these PCSAs. The final ranking of the PCSAs was determined using rank-average-sort and rank-sum-sort methods.Results: The random forest algorithm emerged as the top performer among the three tested ML algorithms, with an AUC-ROC of 0.89, compared to 0.83 for the other two, in distinguishing pathogenic driver mutations from benign passenger mutations using all 41 PCSAs. The top 11 PCSAs were selected based on the first quintile cut-off from the final rank-sum distribution. Classifiers built using these top 11 PCSAs (DEOGEN2, Integrated_fitCons, MVP, etc.) demonstrated significantly higher performance (p-value < 2.22e-16) compared to those using the remaining 30 PCSAs across all three ML algorithms, in separating pathogenic driver from benign passenger mutations. The top PCSAs demonstrated strong performance on a validation cohort including independent HNSC and other cancer types: breast, lung, and colorectal - reflecting its consistency, robustness and generalizability.Conclusions: The ensemble machine learning approach effectively evaluates the performance of PCSAs based on their ability to differentiate pathogenic drivers from benign passenger mutations in HNSC and other cancer types. Notably, some well-known PCSAs performed poorly, underscoring the importance of data-driven selection over relying solely on popularity.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"7"},"PeriodicalIF":4.0,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11744934/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143014566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enriched phenotypes in rare variant carriers suggest pathogenic mechanisms in rare disease patients. 罕见变异携带者的丰富表型提示罕见病患者的致病机制。

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2025-01-17 DOI: 10.1186/s13040-024-00418-5

Lane Fitzsimmons, Brett Beaulieu-Jones, Shilpa Nadimpalli Kobren

Background: The mechanistic pathways that give rise to the extreme symptoms exhibited by rare disease patients are complex, heterogeneous, and difficult to discern. Understanding these mechanisms is critical for developing treatments that address the underlying causes of diseases rather than merely the presenting symptoms. Moreover, the same dysfunctional series of interrelated symptoms implicated in rare recessive diseases may also lead to milder and potentially preventable symptoms in carriers in the general population. Seizures are a common and extreme phenotype that can result from diverse and often elusive pathways in patients with ultrarare or undiagnosed disorders.

Methods: In this pilot study, we present an approach to understand the underlying pathways leading to seizures in patients from the Undiagnosed Diseases Network (UDN) by analyzing aggregated genotype and phenotype data from the UK Biobank (UKB). Specifically, we look for enriched phenotypes across UKB participants who harbor rare variants in the same gene known or suspected to be causally implicated in a UDN patient's recessively manifesting disorder. Analyzing these milder but related associated phenotypes in UKB participants can provide insight into the disease-causing mechanisms at play in rare disease UDN patients.

Results: We present six vignettes of undiagnosed patients experiencing seizures as part of their recessive genetic condition. For each patient, we analyze a gene of interest: MPO, P2RX7, SQSTM1, COL27A1, PIGQ, or CACNA2D2, and find relevant symptoms associated with UKB participants. We discuss the potential mechanisms by which the digestive, skeletal, circulatory, and immune system abnormalities found in the UKB patients may contribute to the severe presentations exhibited by UDN patients. We find that in our set of rare disease patients, seizures may result from diverse, multi-step pathways that involve multiple body systems.

Conclusions: Analyses of large-scale population cohorts such as the UKB can be a critical tool to further our understanding of rare diseases in general. Continued research in this area could lead to more precise diagnostics and personalized treatment strategies for patients with rare and undiagnosed conditions.

背景：引起罕见病患者表现出的极端症状的机制途径是复杂的，异质性的，并且难以辨别。了解这些机制对于开发治疗方法，解决疾病的根本原因而不仅仅是表现症状至关重要。此外，与罕见隐性疾病相关的一系列功能失调症状也可能导致普通人群中的携带者出现较轻且可能可预防的症状。癫痫发作是一种常见的极端表型，可由多种多样且往往难以捉摸的途径引起，发生在患有罕见或未确诊疾病的患者中。方法：在这项初步研究中，我们提出了一种方法，通过分析来自英国生物银行（UKB）的汇总基因型和表型数据，了解导致未确诊疾病网络（UDN）患者癫痫发作的潜在途径。具体来说，我们在UKB参与者中寻找富集的表型，这些参与者在已知或怀疑与UDN患者隐性表现疾病有因果关系的同一基因中含有罕见变异。分析UKB参与者中这些较轻但相关的表型可以深入了解罕见疾病UDN患者的致病机制。结果：我们目前的六个小插曲未确诊的患者经历癫痫发作的一部分，他们的隐性遗传条件。对于每位患者，我们分析了一个感兴趣的基因：MPO、P2RX7、SQSTM1、COL27A1、PIGQ或CACNA2D2，并找到与UKB参与者相关的相关症状。我们讨论了在UKB患者中发现的消化、骨骼、循环和免疫系统异常可能导致UDN患者表现出严重症状的潜在机制。我们发现，在我们的一组罕见疾病患者中，癫痫发作可能是由涉及多个身体系统的多种多步骤途径引起的。结论：对大规模人群队列（如UKB）的分析可以成为进一步了解罕见病的关键工具。在这一领域的持续研究可能会为罕见和未确诊疾病的患者带来更精确的诊断和个性化的治疗策略。

{"title":"Enriched phenotypes in rare variant carriers suggest pathogenic mechanisms in rare disease patients.","authors":"Lane Fitzsimmons, Brett Beaulieu-Jones, Shilpa Nadimpalli Kobren","doi":"10.1186/s13040-024-00418-5","DOIUrl":"10.1186/s13040-024-00418-5","url":null,"abstract":"Background: The mechanistic pathways that give rise to the extreme symptoms exhibited by rare disease patients are complex, heterogeneous, and difficult to discern. Understanding these mechanisms is critical for developing treatments that address the underlying causes of diseases rather than merely the presenting symptoms. Moreover, the same dysfunctional series of interrelated symptoms implicated in rare recessive diseases may also lead to milder and potentially preventable symptoms in carriers in the general population. Seizures are a common and extreme phenotype that can result from diverse and often elusive pathways in patients with ultrarare or undiagnosed disorders.Methods: In this pilot study, we present an approach to understand the underlying pathways leading to seizures in patients from the Undiagnosed Diseases Network (UDN) by analyzing aggregated genotype and phenotype data from the UK Biobank (UKB). Specifically, we look for enriched phenotypes across UKB participants who harbor rare variants in the same gene known or suspected to be causally implicated in a UDN patient's recessively manifesting disorder. Analyzing these milder but related associated phenotypes in UKB participants can provide insight into the disease-causing mechanisms at play in rare disease UDN patients.Results: We present six vignettes of undiagnosed patients experiencing seizures as part of their recessive genetic condition. For each patient, we analyze a gene of interest: MPO, P2RX7, SQSTM1, COL27A1, PIGQ, or CACNA2D2, and find relevant symptoms associated with UKB participants. We discuss the potential mechanisms by which the digestive, skeletal, circulatory, and immune system abnormalities found in the UKB patients may contribute to the severe presentations exhibited by UDN patients. We find that in our set of rare disease patients, seizures may result from diverse, multi-step pathways that involve multiple body systems.Conclusions: Analyses of large-scale population cohorts such as the UKB can be a critical tool to further our understanding of rare diseases in general. Continued research in this area could lead to more precise diagnostics and personalized treatment strategies for patients with rare and undiagnosed conditions.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"6"},"PeriodicalIF":4.0,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11740427/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143014569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Correction: Predictive modeling of ALS progression: an XGBoost approach using clinical features. 纠正：ALS进展的预测建模：使用临床特征的XGBoost方法。

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2025-01-17 DOI: 10.1186/s13040-025-00423-2

Richa Gupta, Mansi Bhandari, Anhad Grover, Taher Al-Shehari, Mohammed Kadrie, Taha Alfakih, Hussain Alsalman

引用次数: 0

MultiChem: predicting chemical properties using multi-view graph attention network. MultiChem：使用多视图图注意网络预测化学性质。

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2025-01-16 DOI: 10.1186/s13040-024-00419-4

Heesang Moon, Mina Rho

Background: Understanding the molecular properties of chemical compounds is essential for identifying potential candidates or ensuring safety in drug discovery. However, exploring the vast chemical space is time-consuming and costly, necessitating the development of time-efficient and cost-effective computational methods. Recent advances in deep learning approaches have offered deeper insights into molecular structures. Leveraging this progress, we developed a novel multi-view learning model.

Results: We introduce a graph-integrated model that captures both local and global structural features of chemical compounds. In our model, graph attention layers are employed to effectively capture essential local structures by jointly considering atom and bond features, while multi-head attention layers extract important global features. We evaluated our model on nine MoleculeNet datasets, encompassing both classification and regression tasks, and compared its performance with state-of-the-art methods. Our model achieved an average area under the receiver operating characteristic (AUROC) of 0.822 and a root mean squared error (RMSE) of 1.133, representing a 3% improvement in AUROC and a 7% improvement in RMSE over state-of-the-art models in extensive seed testing.

Conclusion: MultiChem highlights the importance of integrating both local and global structural information in predicting molecular properties, while also assessing the stability of the models across multiple datasets using various random seed values.

Implementation: The codes are available at https://github.com/DMnBI/MultiChem .

背景：了解化合物的分子性质对于确定潜在候选药物或确保药物开发的安全性至关重要。然而，探索广阔的化学空间既耗时又昂贵，因此需要开发具有时间效率和成本效益的计算方法。深度学习方法的最新进展为分子结构提供了更深入的见解。利用这一进展，我们开发了一种新的多视图学习模型。结果：我们引入了一个图集成模型，可以捕获化合物的局部和全局结构特征。在我们的模型中，图关注层通过联合考虑原子和键的特征来有效捕获重要的局部结构，而多头关注层则提取重要的全局特征。我们在九个MoleculeNet数据集上评估了我们的模型，包括分类和回归任务，并将其性能与最先进的方法进行了比较。我们的模型在接收者操作特征（AUROC）下的平均面积为0.822，均方根误差（RMSE）为1.133，在广泛的种子测试中，与最先进的模型相比，AUROC提高了3%，RMSE提高了7%。结论：MultiChem强调了在预测分子性质时整合局部和全局结构信息的重要性，同时也使用不同的随机种子值评估了模型在多个数据集上的稳定性。实现：代码可在https://github.com/DMnBI/MultiChem上获得。

{"title":"MultiChem: predicting chemical properties using multi-view graph attention network.","authors":"Heesang Moon, Mina Rho","doi":"10.1186/s13040-024-00419-4","DOIUrl":"10.1186/s13040-024-00419-4","url":null,"abstract":"Background: Understanding the molecular properties of chemical compounds is essential for identifying potential candidates or ensuring safety in drug discovery. However, exploring the vast chemical space is time-consuming and costly, necessitating the development of time-efficient and cost-effective computational methods. Recent advances in deep learning approaches have offered deeper insights into molecular structures. Leveraging this progress, we developed a novel multi-view learning model.Results: We introduce a graph-integrated model that captures both local and global structural features of chemical compounds. In our model, graph attention layers are employed to effectively capture essential local structures by jointly considering atom and bond features, while multi-head attention layers extract important global features. We evaluated our model on nine MoleculeNet datasets, encompassing both classification and regression tasks, and compared its performance with state-of-the-art methods. Our model achieved an average area under the receiver operating characteristic (AUROC) of 0.822 and a root mean squared error (RMSE) of 1.133, representing a 3% improvement in AUROC and a 7% improvement in RMSE over state-of-the-art models in extensive seed testing.Conclusion: MultiChem highlights the importance of integrating both local and global structural information in predicting molecular properties, while also assessing the stability of the models across multiple datasets using various random seed values.Implementation: The codes are available at https://github.com/DMnBI/MultiChem .","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"4"},"PeriodicalIF":4.0,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11737097/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143014571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Genome-wide association studies are enriched for interacting genes. 全基因组关联研究丰富了相互作用基因。

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2025-01-15 DOI: 10.1186/s13040-024-00421-w

Peter T Nguyen, Simon G Coetzee, Irina Silacheva, Dennis J Hazelett

Background: With recent advances in single cell technology, high-throughput methods provide unique insight into disease mechanisms and more importantly, cell type origin. Here, we used multi-omics data to understand how genetic variants from genome-wide association studies influence development of disease. We show in principle how to use genetic algorithms with normal, matching pairs of single-nucleus RNA- and ATAC-seq, genome annotations, and protein-protein interaction data to describe the genes and cell types collectively and their contribution to increased risk.

Results: We used genetic algorithms to measure fitness of gene-cell set proposals against a series of objective functions that capture data and annotations. The highest information objective function captured protein-protein interactions. We observed significantly greater fitness scores and subgraph sizes in foreground vs. matching sets of control variants. Furthermore, our model reliably identified known targets and ligand-receptor pairs, consistent with prior studies.

Conclusions: Our findings suggested that application of genetic algorithms to association studies can generate a coherent cellular model of risk from a set of susceptibility variants. Further, we showed, using breast cancer as an example, that such variants have a greater number of physical interactions than expected due to chance.

背景：随着单细胞技术的最新进展，高通量方法为疾病机制和更重要的细胞类型起源提供了独特的见解。在这里，我们使用多组学数据来了解来自全基因组关联研究的遗传变异如何影响疾病的发展。我们原则上展示了如何使用遗传算法与正常的、匹配的单核RNA-和ATAC-seq对、基因组注释和蛋白质-蛋白质相互作用数据来共同描述基因和细胞类型及其对风险增加的贡献。结果：我们使用遗传算法来测量针对一系列目标函数的基因细胞集建议的适应度，这些目标函数捕获数据和注释。最高信息目标函数捕获蛋白质-蛋白质相互作用。我们观察到前景的适应度得分和子图大小明显高于匹配控制变量集。此外，我们的模型可靠地识别了已知的靶标和配体受体对，与先前的研究一致。结论：我们的研究结果表明，将遗传算法应用于关联研究可以从一组易感性变异中产生连贯的风险细胞模型。此外，我们以乳腺癌为例表明，由于偶然性，这些变异具有比预期更多的物理相互作用。

{"title":"Genome-wide association studies are enriched for interacting genes.","authors":"Peter T Nguyen, Simon G Coetzee, Irina Silacheva, Dennis J Hazelett","doi":"10.1186/s13040-024-00421-w","DOIUrl":"10.1186/s13040-024-00421-w","url":null,"abstract":"Background: With recent advances in single cell technology, high-throughput methods provide unique insight into disease mechanisms and more importantly, cell type origin. Here, we used multi-omics data to understand how genetic variants from genome-wide association studies influence development of disease. We show in principle how to use genetic algorithms with normal, matching pairs of single-nucleus RNA- and ATAC-seq, genome annotations, and protein-protein interaction data to describe the genes and cell types collectively and their contribution to increased risk.Results: We used genetic algorithms to measure fitness of gene-cell set proposals against a series of objective functions that capture data and annotations. The highest information objective function captured protein-protein interactions. We observed significantly greater fitness scores and subgraph sizes in foreground vs. matching sets of control variants. Furthermore, our model reliably identified known targets and ligand-receptor pairs, consistent with prior studies.Conclusions: Our findings suggested that application of genetic algorithms to association studies can generate a coherent cellular model of risk from a set of susceptibility variants. Further, we showed, using breast cancer as an example, that such variants have a greater number of physical interactions than expected due to chance.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"3"},"PeriodicalIF":4.0,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11734473/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143014570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Venus score for the assessment of the quality and trustworthiness of biomedical datasets. 维纳斯分数用于评估生物医学数据集的质量和可信度。

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2025-01-09 DOI: 10.1186/s13040-024-00412-x

Davide Chicco, Alessandro Fabris, Giuseppe Jurman

Biomedical datasets are the mainstays of computational biology and health informatics projects, and can be found on multiple data platforms online or obtained from wet-lab biologists and physicians. The quality and the trustworthiness of these datasets, however, can sometimes be poor, producing bad results in turn, which can harm patients and data subjects. To address this problem, policy-makers, researchers, and consortia have proposed diverse regulations, guidelines, and scores to assess the quality and increase the reliability of datasets. Although generally useful, however, they are often incomplete and impractical. The guidelines of Datasheets for Datasets, in particular, are too numerous; the requirements of the Kaggle Dataset Usability Score focus on non-scientific requisites (for example, including a cover image); and the European Union Artificial Intelligence Act (EU AI Act) sets forth sparse and general data governance requirements, which we tailored to datasets for biomedical AI. Against this backdrop, we introduce our new Venus score to assess the data quality and trustworthiness of biomedical datasets. Our score ranges from 0 to 10 and consists of ten questions that anyone developing a bioinformatics, medical informatics, or cheminformatics dataset should answer before the release. In this study, we first describe the EU AI Act, Datasheets for Datasets, and the Kaggle Dataset Usability Score, presenting their requirements and their drawbacks. To do so, we reverse-engineer the weights of the influential Kaggle Score for the first time and report them in this study. We distill the most important data governance requirements into ten questions tailored to the biomedical domain, comprising the Venus score. We apply the Venus score to twelve datasets from multiple subdomains, including electronic health records, medical imaging, microarray and bulk RNA-seq gene expression, cheminformatics, physiologic electrogram signals, and medical text. Analyzing the results, we surface fine-grained strengths and weaknesses of popular datasets, as well as aggregate trends. Most notably, we find a widespread tendency to gloss over sources of data inaccuracy and noise, which may hinder the reliable exploitation of data and, consequently, research results. Overall, our results confirm the applicability and utility of the Venus score to assess the trustworthiness of biomedical data.

生物医学数据集是计算生物学和健康信息学项目的支柱，可以在多个在线数据平台上找到，也可以从湿实验室生物学家和医生那里获得。然而，这些数据集的质量和可信度有时可能很差，从而产生不好的结果，这可能会伤害患者和数据主体。为了解决这个问题，政策制定者、研究人员和协会提出了各种各样的法规、指南和评分来评估数据集的质量并提高数据集的可靠性。然而，尽管它们通常是有用的，但往往是不完整的和不切实际的。特别是，数据集的数据表指南太多了；Kaggle数据集可用性评分的要求侧重于非科学的必要条件（例如，包括封面图像）；欧盟人工智能法案（EU AI Act）规定了稀疏和一般的数据治理要求，我们为生物医学人工智能的数据集量身定制了这些要求。在此背景下，我们引入了新的Venus评分来评估生物医学数据集的数据质量和可信度。我们的评分范围从0到10，由10个问题组成，任何开发生物信息学，医学信息学或化学信息学数据集的人都应该在发布之前回答这些问题。在本研究中，我们首先描述了欧盟人工智能法案、数据集数据表和Kaggle数据集可用性评分，并提出了它们的要求和缺点。为此，我们首次对有影响力的Kaggle评分的权重进行了逆向工程，并在本研究中报告了它们。我们将最重要的数据治理需求提炼成针对生物医学领域量身定制的十个问题，组成维纳斯评分。我们将Venus评分应用于来自多个子领域的12个数据集，包括电子健康记录、医学成像、微阵列和大量RNA-seq基因表达、化学信息学、生理电图信号和医学文本。分析结果，我们揭示了流行数据集的细粒度优势和劣势，以及总体趋势。最值得注意的是，我们发现了一种广泛的趋势，即掩盖数据不准确和噪音的来源，这可能会阻碍数据的可靠利用，从而影响研究结果。总体而言，我们的研究结果证实了维纳斯评分在评估生物医学数据可信度方面的适用性和实用性。

{"title":"The Venus score for the assessment of the quality and trustworthiness of biomedical datasets.","authors":"Davide Chicco, Alessandro Fabris, Giuseppe Jurman","doi":"10.1186/s13040-024-00412-x","DOIUrl":"10.1186/s13040-024-00412-x","url":null,"abstract":"Biomedical datasets are the mainstays of computational biology and health informatics projects, and can be found on multiple data platforms online or obtained from wet-lab biologists and physicians. The quality and the trustworthiness of these datasets, however, can sometimes be poor, producing bad results in turn, which can harm patients and data subjects. To address this problem, policy-makers, researchers, and consortia have proposed diverse regulations, guidelines, and scores to assess the quality and increase the reliability of datasets. Although generally useful, however, they are often incomplete and impractical. The guidelines of Datasheets for Datasets, in particular, are too numerous; the requirements of the Kaggle Dataset Usability Score focus on non-scientific requisites (for example, including a cover image); and the European Union Artificial Intelligence Act (EU AI Act) sets forth sparse and general data governance requirements, which we tailored to datasets for biomedical AI. Against this backdrop, we introduce our new Venus score to assess the data quality and trustworthiness of biomedical datasets. Our score ranges from 0 to 10 and consists of ten questions that anyone developing a bioinformatics, medical informatics, or cheminformatics dataset should answer before the release. In this study, we first describe the EU AI Act, Datasheets for Datasets, and the Kaggle Dataset Usability Score, presenting their requirements and their drawbacks. To do so, we reverse-engineer the weights of the influential Kaggle Score for the first time and report them in this study. We distill the most important data governance requirements into ten questions tailored to the biomedical domain, comprising the Venus score. We apply the Venus score to twelve datasets from multiple subdomains, including electronic health records, medical imaging, microarray and bulk RNA-seq gene expression, cheminformatics, physiologic electrogram signals, and medical text. Analyzing the results, we surface fine-grained strengths and weaknesses of popular datasets, as well as aggregate trends. Most notably, we find a widespread tendency to gloss over sources of data inaccuracy and noise, which may hinder the reliable exploitation of data and, consequently, research results. Overall, our results confirm the applicability and utility of the Venus score to assess the trustworthiness of biomedical data.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"1"},"PeriodicalIF":4.0,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11716409/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142957099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Open challenges and opportunities in federated foundation models towards biomedical healthcare. 面向生物医学保健的联邦基金会模式中的开放挑战和机遇。

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2025-01-04 DOI: 10.1186/s13040-024-00414-9

Xingyu Li, Lu Peng, Yu-Ping Wang, Weihua Zhang

This survey explores the transformative impact of foundation models (FMs) in artificial intelligence, focusing on their integration with federated learning (FL) in biomedical research. Foundation models such as ChatGPT, LLaMa, and CLIP, which are trained on vast datasets through methods including unsupervised pretraining, self-supervised learning, instructed fine-tuning, and reinforcement learning from human feedback, represent significant advancements in machine learning. These models, with their ability to generate coherent text and realistic images, are crucial for biomedical applications that require processing diverse data forms such as clinical reports, diagnostic images, and multimodal patient interactions. The incorporation of FL with these sophisticated models presents a promising strategy to harness their analytical power while safeguarding the privacy of sensitive medical data. This approach not only enhances the capabilities of FMs in medical diagnostics and personalized treatment but also addresses critical concerns about data privacy and security in healthcare. This survey reviews the current applications of FMs in federated settings, underscores the challenges, and identifies future research directions including scaling FMs, managing data diversity, and enhancing communication efficiency within FL frameworks. The objective is to encourage further research into the combined potential of FMs and FL, laying the groundwork for healthcare innovations.

本调查探讨了基础模型（FMs）在人工智能中的变革性影响，重点是它们与生物医学研究中的联邦学习（FL）的集成。ChatGPT、LLaMa和CLIP等基础模型通过无监督预训练、自监督学习、指示微调和从人类反馈中强化学习等方法在大量数据集上进行训练，代表了机器学习的重大进步。这些模型能够生成连贯的文本和逼真的图像，对于需要处理各种数据形式（如临床报告、诊断图像和多模式患者交互）的生物医学应用至关重要。将FL与这些复杂的模型结合起来，在保护敏感医疗数据隐私的同时，提供了一种很有前途的策略，可以利用它们的分析能力。这种方法不仅增强了FMs在医疗诊断和个性化治疗方面的能力，而且还解决了医疗保健中有关数据隐私和安全的关键问题。本调查回顾了FMs在联邦环境中的当前应用，强调了挑战，并确定了未来的研究方向，包括扩展FMs、管理数据多样性和提高FMs框架内的通信效率。其目的是鼓励进一步研究FMs和FL的联合潜力，为医疗保健创新奠定基础。

{"title":"Open challenges and opportunities in federated foundation models towards biomedical healthcare.","authors":"Xingyu Li, Lu Peng, Yu-Ping Wang, Weihua Zhang","doi":"10.1186/s13040-024-00414-9","DOIUrl":"10.1186/s13040-024-00414-9","url":null,"abstract":"This survey explores the transformative impact of foundation models (FMs) in artificial intelligence, focusing on their integration with federated learning (FL) in biomedical research. Foundation models such as ChatGPT, LLaMa, and CLIP, which are trained on vast datasets through methods including unsupervised pretraining, self-supervised learning, instructed fine-tuning, and reinforcement learning from human feedback, represent significant advancements in machine learning. These models, with their ability to generate coherent text and realistic images, are crucial for biomedical applications that require processing diverse data forms such as clinical reports, diagnostic images, and multimodal patient interactions. The incorporation of FL with these sophisticated models presents a promising strategy to harness their analytical power while safeguarding the privacy of sensitive medical data. This approach not only enhances the capabilities of FMs in medical diagnostics and personalized treatment but also addresses critical concerns about data privacy and security in healthcare. This survey reviews the current applications of FMs in federated settings, underscores the challenges, and identifies future research directions including scaling FMs, managing data diversity, and enhancing communication efficiency within FL frameworks. The objective is to encourage further research into the combined potential of FMs and FL, laying the groundwork for healthcare innovations.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"2"},"PeriodicalIF":4.0,"publicationDate":"2025-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142928515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Correction: Detection and classification of long terminal repeat sequences in plant LTR-retrotransposons and their analysis using explainable machine learning. 更正：植物 LTR 反转座子中长末端重复序列的检测和分类，以及利用可解释机器学习对其进行分析。

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2024-12-30 DOI: 10.1186/s13040-024-00417-6

Jakub Horvath, Pavel Jedlicka, Marie Kratka, Zdenek Kubat, Eduard Kejnovsky, Matej Lexa

引用次数: 0

Distinct network patterns emerge from Cartesian and XOR epistasis models: a comparative network science analysis. 不同的网络模式出现在笛卡尔和XOR上位模型：一个比较的网络科学分析。

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2024-12-28 DOI: 10.1186/s13040-024-00413-w

Zhendong Sha, Philip J Freda, Priyanka Bhandary, Attri Ghosh, Nicholas Matsumoto, Jason H Moore, Ting Hu

Background: Epistasis, the phenomenon where the effect of one gene (or variant) is masked or modified by one or more other genes, significantly contributes to the phenotypic variance of complex traits. Traditionally, epistasis has been modeled using the Cartesian epistatic model, a multiplicative approach based on standard statistical regression. However, a recent study investigating epistasis in obesity-related traits has identified potential limitations of the Cartesian epistatic model, revealing that it likely only detects a fraction of the genetic interactions occurring in natural systems. In contrast, the exclusive-or (XOR) epistatic model has shown promise in detecting a broader range of epistatic interactions and revealing more biologically relevant functions associated with interacting variants. To investigate whether the XOR epistatic model also forms distinct network structures compared to the Cartesian model, we applied network science to examine genetic interactions underlying body mass index (BMI) in rats (Rattus norvegicus).

Results: Our comparative analysis of XOR and Cartesian epistatic models in rats reveals distinct topological characteristics. The XOR model exhibits enhanced sensitivity to epistatic interactions between the network communities found in the Cartesian epistatic network, facilitating the identification of novel trait-related biological functions via community-based enrichment analysis. Additionally, the XOR network features triangle network motifs, indicative of higher-order epistatic interactions. This research also evaluates the impact of linkage disequilibrium (LD)-based edge pruning on network-based epistasis analysis, finding that LD-based edge pruning may lead to increased network fragmentation, which may hinder the effectiveness of network analysis for the investigation of epistasis. We confirmed through network permutation analysis that most XOR and Cartesian epistatic networks derived from the data display distinct structural properties compared to randomly shuffled networks.

Conclusions: Collectively, these findings highlight the XOR model's ability to uncover meaningful biological associations and higher-order epistasis derived from lower-order network topologies. The introduction of community-based enrichment analysis and motif-based epistatic discovery emphasize network science as a critical approach for advancing epistasis research and understanding complex genetic architectures.

背景：上位性是指一个基因（或变异）的作用被一个或多个其他基因掩盖或修饰的现象，它对复杂性状的表型变异有重要影响。传统上，上位性是用笛卡尔上位性模型来建模的，这是一种基于标准统计回归的乘法方法。然而，最近一项调查肥胖相关性状中的上位性的研究发现了笛卡尔上位性模型的潜在局限性，表明它可能只检测到自然系统中发生的遗传相互作用的一小部分。相比之下，异或（XOR）上位性模型在检测更广泛的上位性相互作用和揭示与相互作用变体相关的更多生物学相关功能方面显示出了希望。为了研究与笛卡尔模型相比，XOR模型是否也形成了不同的网络结构，我们应用网络科学来研究大鼠（Rattus norvegicus）体重指数（BMI）的遗传相互作用。结果：我们对大鼠XOR模型和笛卡尔上位模型的比较分析显示出不同的拓扑特征。XOR模型对笛卡尔上位网络中网络社区之间的上位相互作用表现出更高的敏感性，有助于通过社区富集分析识别与性状相关的新生物学功能。此外，异或网络具有三角形网络基序，表明存在高阶上位性相互作用。本研究还评估了基于链接不平衡（LD）的边缘修剪对基于网络的上位性分析的影响，发现基于链接不平衡（LD）的边缘修剪可能导致网络碎片化增加，这可能会阻碍网络分析对上位性研究的有效性。我们通过网络排列分析证实，与随机洗牌网络相比，大多数从数据中衍生的异或和笛卡尔epistatic网络显示出不同的结构特性。结论：总的来说，这些发现突出了XOR模型揭示有意义的生物学关联和源自低阶网络拓扑的高阶上位性的能力。引入基于社区的富集分析和基于基序的上位性发现强调网络科学是推进上位性研究和理解复杂遗传结构的关键方法。

{"title":"Distinct network patterns emerge from Cartesian and XOR epistasis models: a comparative network science analysis.","authors":"Zhendong Sha, Philip J Freda, Priyanka Bhandary, Attri Ghosh, Nicholas Matsumoto, Jason H Moore, Ting Hu","doi":"10.1186/s13040-024-00413-w","DOIUrl":"10.1186/s13040-024-00413-w","url":null,"abstract":"Background: Epistasis, the phenomenon where the effect of one gene (or variant) is masked or modified by one or more other genes, significantly contributes to the phenotypic variance of complex traits. Traditionally, epistasis has been modeled using the Cartesian epistatic model, a multiplicative approach based on standard statistical regression. However, a recent study investigating epistasis in obesity-related traits has identified potential limitations of the Cartesian epistatic model, revealing that it likely only detects a fraction of the genetic interactions occurring in natural systems. In contrast, the exclusive-or (XOR) epistatic model has shown promise in detecting a broader range of epistatic interactions and revealing more biologically relevant functions associated with interacting variants. To investigate whether the XOR epistatic model also forms distinct network structures compared to the Cartesian model, we applied network science to examine genetic interactions underlying body mass index (BMI) in rats (Rattus norvegicus).Results: Our comparative analysis of XOR and Cartesian epistatic models in rats reveals distinct topological characteristics. The XOR model exhibits enhanced sensitivity to epistatic interactions between the network communities found in the Cartesian epistatic network, facilitating the identification of novel trait-related biological functions via community-based enrichment analysis. Additionally, the XOR network features triangle network motifs, indicative of higher-order epistatic interactions. This research also evaluates the impact of linkage disequilibrium (LD)-based edge pruning on network-based epistasis analysis, finding that LD-based edge pruning may lead to increased network fragmentation, which may hinder the effectiveness of network analysis for the investigation of epistasis. We confirmed through network permutation analysis that most XOR and Cartesian epistatic networks derived from the data display distinct structural properties compared to randomly shuffled networks.Conclusions: Collectively, these findings highlight the XOR model's ability to uncover meaningful biological associations and higher-order epistasis derived from lower-order network topologies. The introduction of community-based enrichment analysis and motif-based epistatic discovery emphasize network science as a critical approach for advancing epistasis research and understanding complex genetic architectures.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"61"},"PeriodicalIF":4.0,"publicationDate":"2024-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11681696/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142899656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pathway metrics accurately stratify T cells to their cells states. 通路指标准确地将T细胞分层到它们的细胞状态。

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2024-12-24 DOI: 10.1186/s13040-024-00416-7

Dani Livne, Sol Efroni

Pathway analysis is a powerful approach for elucidating insights from gene expression data and associating such changes with cellular phenotypes. The overarching objective of pathway research is to identify critical molecular drivers within a cellular context and uncover novel signaling networks from groups of relevant biomolecules. In this work, we present PathSingle, a Python-based pathway analysis tool tailored for single-cell data analysis. PathSingle employs a unique graph-based algorithm to enable the classification of diverse cellular states, such as T cell subtypes. Designed to be open-source, extensible, and computationally efficient, PathSingle is available at https://github.com/zurkin1/PathSingle under the MIT license. This tool provides researchers with a versatile framework for uncovering biologically meaningful insights from high-dimensional single-cell transcriptomics data, facilitating a deeper understanding of cellular regulation and function.

途径分析是一种强有力的方法，用于阐明基因表达数据的见解，并将这种变化与细胞表型联系起来。途径研究的首要目标是识别细胞背景下的关键分子驱动因素，并从相关生物分子群中发现新的信号网络。在这项工作中，我们提出了PathSingle，一个基于python的通路分析工具，专门用于单细胞数据分析。PathSingle采用一种独特的基于图的算法来实现不同细胞状态的分类，例如T细胞亚型。PathSingle的设计是开源的、可扩展的、计算效率高的，可以在MIT许可下从https://github.com/zurkin1/PathSingle获得。该工具为研究人员提供了一个通用的框架，用于从高维单细胞转录组学数据中发现生物学上有意义的见解，促进对细胞调节和功能的更深入了解。

引用次数: 0