首页 > 最新文献

Bioinformatics (Oxford, England)最新文献

英文 中文
Predicting Explainable Dementia Types with LLM-aided Feature Engineering.
Pub Date : 2025-04-08 DOI: 10.1093/bioinformatics/btaf156
Aditya M Kashyap, Delip Rao, Mary Regina Boland, Li Shen, Chris Callison-Burch

Motivation: The integration of Machine Learning (ML) and Artificial Intelligence (AI) into healthcare has immense potential due to the rapidly growing volume of clinical data. However, existing AI models, particularly Large Language Models (LLMs) like GPT-4, face significant challenges in terms of explainability and reliability, particularly in high-stakes domains like healthcare.

Results: This paper proposes a novel LLM-aided feature engineering approach that enhances interpretability by extracting clinically relevant features from the Oxford Textbook of Medicine. By converting clinical notes into concept vector representations and employing a linear classifier, our method achieved an accuracy of 0.72, outperforming a traditional n-gram Logistic Regression baseline (0.64) and the GPT-4 baseline (0.48), while focusing on high level clinical features. We also explore using Text Embeddings to reduce the overall time and cost of our approach by 97%.

Availability: All code relevant to this paper is available at: https://github.com/AdityaKashyap423/Dementia_LLM_Feature_Engineering/tree/main.

Supplementary information: Supplementary PDF and other data files can be found at https://drive.google.com/drive/folders/1UqdpsKFnvGjUJgp58k3RYcJ8zN8zPmWR?usp=share_link .

动机:由于临床数据量迅速增长,将机器学习(ML)和人工智能(AI)融入医疗保健领域具有巨大的潜力。然而,现有的人工智能模型,特别是像 GPT-4 这样的大型语言模型(LLM),在可解释性和可靠性方面面临着巨大挑战,尤其是在医疗保健这样的高风险领域:本文提出了一种新颖的 LLM 辅助特征工程方法,通过从《牛津医学教科书》中提取临床相关特征来增强可解释性。通过将临床笔记转换为概念向量表示并采用线性分类器,我们的方法达到了 0.72 的准确率,优于传统的 n-gram Logistic Regression 基线(0.64)和 GPT-4 基线(0.48),同时专注于高级临床特征。我们还探索了使用文本嵌入的方法,将我们的方法的总体时间和成本降低了 97%:与本文相关的所有代码可从以下网址获取:https://github.com/AdityaKashyap423/Dementia_LLM_Feature_Engineering/tree/main.Supplementary information:补充 PDF 和其他数据文件可在 https://drive.google.com/drive/folders/1UqdpsKFnvGjUJgp58k3RYcJ8zN8zPmWR?usp=share_link 上找到。
{"title":"Predicting Explainable Dementia Types with LLM-aided Feature Engineering.","authors":"Aditya M Kashyap, Delip Rao, Mary Regina Boland, Li Shen, Chris Callison-Burch","doi":"10.1093/bioinformatics/btaf156","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf156","url":null,"abstract":"<p><strong>Motivation: </strong>The integration of Machine Learning (ML) and Artificial Intelligence (AI) into healthcare has immense potential due to the rapidly growing volume of clinical data. However, existing AI models, particularly Large Language Models (LLMs) like GPT-4, face significant challenges in terms of explainability and reliability, particularly in high-stakes domains like healthcare.</p><p><strong>Results: </strong>This paper proposes a novel LLM-aided feature engineering approach that enhances interpretability by extracting clinically relevant features from the Oxford Textbook of Medicine. By converting clinical notes into concept vector representations and employing a linear classifier, our method achieved an accuracy of 0.72, outperforming a traditional n-gram Logistic Regression baseline (0.64) and the GPT-4 baseline (0.48), while focusing on high level clinical features. We also explore using Text Embeddings to reduce the overall time and cost of our approach by 97%.</p><p><strong>Availability: </strong>All code relevant to this paper is available at: https://github.com/AdityaKashyap423/Dementia_LLM_Feature_Engineering/tree/main.</p><p><strong>Supplementary information: </strong>Supplementary PDF and other data files can be found at https://drive.google.com/drive/folders/1UqdpsKFnvGjUJgp58k3RYcJ8zN8zPmWR?usp=share_link .</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143813177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rethinking GWAS: how lessons from genetic screens and artificial intelligence could reveal biological mechanisms.
Pub Date : 2025-04-08 DOI: 10.1093/bioinformatics/btaf153
Dennis J Hazelett

Motivation: Modern single-cell omics data are key to unraveling the complex mechanisms underlying risk for complex diseases revealed by genome-wide association studies (GWAS). Phenotypic screens in model organisms have several important parallels to GWAS which I explore in this essay.

Results: I provide the historical context of such screens, comparing and contrasting similarities to association studies, and how these screens in model organisms can teach us what to look for. Then I consider how the results of GWAS might be exhaustively interrogated to interpret the biological mechanisms underpinning disease processes. Finally, I propose a general framework for tackling this problem computationally, and explore the data, mechanisms, and technology (both existing and yet to be invented) that are necessary to complete the task.

Availability and implementation: There are no data or code associated with this article.

Supplementary information: Not applicable.

动机全基因组关联研究(GWAS)揭示了复杂疾病风险的复杂机制,而现代单细胞组学数据是揭示这些风险的关键。模式生物的表型筛选与全基因组关联研究(GWAS)有几个重要的相似之处,我将在这篇文章中探讨这些相似之处:结果:我介绍了此类筛选的历史背景,比较了与关联研究的相似之处,以及模型生物中的这些筛选如何教会我们寻找什么。然后,我将考虑如何详尽地分析 GWAS 的结果,以解释疾病过程的生物学机制。最后,我提出了一个通过计算解决这一问题的总体框架,并探讨了完成这一任务所需的数据、机制和技术(包括现有的和有待发明的):本文无相关数据或代码:不适用。
{"title":"Rethinking GWAS: how lessons from genetic screens and artificial intelligence could reveal biological mechanisms.","authors":"Dennis J Hazelett","doi":"10.1093/bioinformatics/btaf153","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf153","url":null,"abstract":"<p><strong>Motivation: </strong>Modern single-cell omics data are key to unraveling the complex mechanisms underlying risk for complex diseases revealed by genome-wide association studies (GWAS). Phenotypic screens in model organisms have several important parallels to GWAS which I explore in this essay.</p><p><strong>Results: </strong>I provide the historical context of such screens, comparing and contrasting similarities to association studies, and how these screens in model organisms can teach us what to look for. Then I consider how the results of GWAS might be exhaustively interrogated to interpret the biological mechanisms underpinning disease processes. Finally, I propose a general framework for tackling this problem computationally, and explore the data, mechanisms, and technology (both existing and yet to be invented) that are necessary to complete the task.</p><p><strong>Availability and implementation: </strong>There are no data or code associated with this article.</p><p><strong>Supplementary information: </strong>Not applicable.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143805075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gradient matching accelerates mixed-effects inference for biochemical networks.
Pub Date : 2025-04-08 DOI: 10.1093/bioinformatics/btaf154
Yulan B van Oppen, Andreas Milias-Argeitis

Motivation: Single-cell time series data often exhibit significant variability within an isogenic cell population. When modeling intracellular processes, it is therefore more appropriate to infer parameter distributions that reflect this variability, rather than fitting the population average to obtain a single point estimate. The Global Two-Stage (GTS) approach for nonlinear mixed-effects (NLME) models is a simple and modular method commonly used for this purpose. However, this method is computationally intensive due to its repeated use of non-convex optimization and numerical integration of the underlying system.

Results: We propose the Gradient Matching GTS (GMGTS) method as an efficient alternative to GTS. Gradient matching offers an integration-free approach to parameter estimation that is particularly powerful for systems that are linear in the unknown parameters, such as biochemical networks modeled by mass action kinetics. By incorporating gradient matching into the GTS framework, we expand its capabilities through uncertainty propagation calculations and an iterative estimation scheme for partially observed systems. Comparisons between GMGTS and GTS across various inference setups show that our method significantly reduces computational demands, facilitating the application of complex NLME models in systems biology.

Availability and implementation: A Matlab implementation of GMGTS is provided at https://github.com/yulanvanoppen/GMGTS (DOI: http://doi.org/10.5281/zenodo.14884457).

Supplementary information: Supplemental Information is available online and contains Tables S1-S4, Figures S1-S21, methodology, mathematical derivations, and software implementation details.

{"title":"Gradient matching accelerates mixed-effects inference for biochemical networks.","authors":"Yulan B van Oppen, Andreas Milias-Argeitis","doi":"10.1093/bioinformatics/btaf154","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf154","url":null,"abstract":"<p><strong>Motivation: </strong>Single-cell time series data often exhibit significant variability within an isogenic cell population. When modeling intracellular processes, it is therefore more appropriate to infer parameter distributions that reflect this variability, rather than fitting the population average to obtain a single point estimate. The Global Two-Stage (GTS) approach for nonlinear mixed-effects (NLME) models is a simple and modular method commonly used for this purpose. However, this method is computationally intensive due to its repeated use of non-convex optimization and numerical integration of the underlying system.</p><p><strong>Results: </strong>We propose the Gradient Matching GTS (GMGTS) method as an efficient alternative to GTS. Gradient matching offers an integration-free approach to parameter estimation that is particularly powerful for systems that are linear in the unknown parameters, such as biochemical networks modeled by mass action kinetics. By incorporating gradient matching into the GTS framework, we expand its capabilities through uncertainty propagation calculations and an iterative estimation scheme for partially observed systems. Comparisons between GMGTS and GTS across various inference setups show that our method significantly reduces computational demands, facilitating the application of complex NLME models in systems biology.</p><p><strong>Availability and implementation: </strong>A Matlab implementation of GMGTS is provided at https://github.com/yulanvanoppen/GMGTS (DOI: http://doi.org/10.5281/zenodo.14884457).</p><p><strong>Supplementary information: </strong>Supplemental Information is available online and contains Tables S1-S4, Figures S1-S21, methodology, mathematical derivations, and software implementation details.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143813174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ROICellTrack: A deep learning framework for integrating cellular imaging modalities in subcellular spatial transcriptomic profiling of tumor tissues.
Pub Date : 2025-04-08 DOI: 10.1093/bioinformatics/btaf152
Xiaofei Song, Xiaoqing Yu, Carlos M Moran-Segura, Hongzhi Xu, Tingyi Li, Joshua T Davis, Aram Vosoughi, G Daniel Grass, Roger Li, Xuefeng Wang

Motivation: Spatial transcriptomics (ST) technologies, such as GeoMx Digital Spatial Profiler, are increasingly utilized to investigate the role of diverse tumor microenvironment components, particularly in relation to cancer progression, treatment response, and therapeutic resistance. However, in many ST studies, the spatial information obtained from immunofluorescence imaging is primarily used for identifying regions of interest rather than as an integral part of downstream transcriptomic data analysis and interpretation.

Results: We developed ROICellTrack, a deep learning-based framework that better integrates cellular imaging with spatial transcriptomic profiling. By analyzing 56 ROIs from urothelial carcinoma of the bladder (UCB) and upper tract urothelial carcinoma (UTUC), ROICellTrack identified distinct cancer-immune cell mixtures, characterized by specific transcriptomic and morphological signatures and receptor-ligand interactions linked to tumor content and immune infiltrations. Our findings demonstrate the value of integrating imaging with transcriptomics to analyze spatial omics data, improving our understanding of tumor heterogeneity and its relevance to personalized and targeted therapies.

Availability: ROICellTrack is publicly available at https://github.com/wanglab1/ROICellTrack.

Supplementary information: Supplementary data are available at Bioinformatics online.

{"title":"ROICellTrack: A deep learning framework for integrating cellular imaging modalities in subcellular spatial transcriptomic profiling of tumor tissues.","authors":"Xiaofei Song, Xiaoqing Yu, Carlos M Moran-Segura, Hongzhi Xu, Tingyi Li, Joshua T Davis, Aram Vosoughi, G Daniel Grass, Roger Li, Xuefeng Wang","doi":"10.1093/bioinformatics/btaf152","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf152","url":null,"abstract":"<p><strong>Motivation: </strong>Spatial transcriptomics (ST) technologies, such as GeoMx Digital Spatial Profiler, are increasingly utilized to investigate the role of diverse tumor microenvironment components, particularly in relation to cancer progression, treatment response, and therapeutic resistance. However, in many ST studies, the spatial information obtained from immunofluorescence imaging is primarily used for identifying regions of interest rather than as an integral part of downstream transcriptomic data analysis and interpretation.</p><p><strong>Results: </strong>We developed ROICellTrack, a deep learning-based framework that better integrates cellular imaging with spatial transcriptomic profiling. By analyzing 56 ROIs from urothelial carcinoma of the bladder (UCB) and upper tract urothelial carcinoma (UTUC), ROICellTrack identified distinct cancer-immune cell mixtures, characterized by specific transcriptomic and morphological signatures and receptor-ligand interactions linked to tumor content and immune infiltrations. Our findings demonstrate the value of integrating imaging with transcriptomics to analyze spatial omics data, improving our understanding of tumor heterogeneity and its relevance to personalized and targeted therapies.</p><p><strong>Availability: </strong>ROICellTrack is publicly available at https://github.com/wanglab1/ROICellTrack.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143813180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Realfreq: Real-time base modification analysis for nanopore sequencing.
Pub Date : 2025-04-07 DOI: 10.1093/bioinformatics/btaf151
Suneth Samarasinghe, Ira Deveson, Hasindu Gamaarachchi

Summary: Nanopore sequencers allow sequencing data to be accessed in real-time. This allows live analysis to be performed, while the sequencing is running, reducing the turnaround time of the results. We introduce realfreq, a framework for obtaining real-time base modification frequencies while a nanopore sequencer is in operation. Realfreq calculates and allows access to the real-time base modification frequency results while the sequencer is running. We demonstrate that the data analysis rate with realfreq on a laptop computer can keep up with the output data rate of a nanopore MinION sequencer, while a desktop computer can keep up with a single PromethION 2 solo flowcell.

Availability and implementation: Realfreq is a free and open-source application implemented in C programming language and shell scripts. The source code and the documentation for realfreq can be found at https://github.com/imsuneth/realfreq. The version used for the manuscript is also available at 10.5281/zenodo.15128668.

Supplementary information: Supplementary data are available at Bioinformatics online.

{"title":"Realfreq: Real-time base modification analysis for nanopore sequencing.","authors":"Suneth Samarasinghe, Ira Deveson, Hasindu Gamaarachchi","doi":"10.1093/bioinformatics/btaf151","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf151","url":null,"abstract":"<p><strong>Summary: </strong>Nanopore sequencers allow sequencing data to be accessed in real-time. This allows live analysis to be performed, while the sequencing is running, reducing the turnaround time of the results. We introduce realfreq, a framework for obtaining real-time base modification frequencies while a nanopore sequencer is in operation. Realfreq calculates and allows access to the real-time base modification frequency results while the sequencer is running. We demonstrate that the data analysis rate with realfreq on a laptop computer can keep up with the output data rate of a nanopore MinION sequencer, while a desktop computer can keep up with a single PromethION 2 solo flowcell.</p><p><strong>Availability and implementation: </strong>Realfreq is a free and open-source application implemented in C programming language and shell scripts. The source code and the documentation for realfreq can be found at https://github.com/imsuneth/realfreq. The version used for the manuscript is also available at 10.5281/zenodo.15128668.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143797261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CoverM: Read alignment statistics for metagenomics.
Pub Date : 2025-04-07 DOI: 10.1093/bioinformatics/btaf147
Samuel T N Aroney, Rhys J P Newell, Jakob N Nissen, Antonio Pedro Camargo, Gene W Tyson, Ben J Woodcroft

Summary: Genome-centric analysis of metagenomic samples is a powerful method for understanding the function of microbial communities. Calculating read coverage is a central part of analysis, enabling differential coverage binning for recovery of genomes and estimation of microbial community composition. Coverage is determined by processing read alignments to reference sequences of either contigs or genomes. Per-reference coverage is typically calculated in an ad-hoc manner, with each software package providing its own implementation and specific definition of coverage. Here we present a unified software package CoverM which calculates several coverage statistics for contigs and genomes in an ergonomic and flexible manner. It uses 'Mosdepth arrays' for computational efficiency and avoids unnecessary I/O overhead by calculating coverage statistics from streamed read alignment results.

Availability and implementation: CoverM is free software available at https://github.com/wwood/coverm. CoverM is implemented in Rust, with Python (https://github.com/apcamargo/pycoverm) and Julia (https://github.com/JuliaBinaryWrappers/CoverM_jll.jl) interfaces.

{"title":"CoverM: Read alignment statistics for metagenomics.","authors":"Samuel T N Aroney, Rhys J P Newell, Jakob N Nissen, Antonio Pedro Camargo, Gene W Tyson, Ben J Woodcroft","doi":"10.1093/bioinformatics/btaf147","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf147","url":null,"abstract":"<p><strong>Summary: </strong>Genome-centric analysis of metagenomic samples is a powerful method for understanding the function of microbial communities. Calculating read coverage is a central part of analysis, enabling differential coverage binning for recovery of genomes and estimation of microbial community composition. Coverage is determined by processing read alignments to reference sequences of either contigs or genomes. Per-reference coverage is typically calculated in an ad-hoc manner, with each software package providing its own implementation and specific definition of coverage. Here we present a unified software package CoverM which calculates several coverage statistics for contigs and genomes in an ergonomic and flexible manner. It uses 'Mosdepth arrays' for computational efficiency and avoids unnecessary I/O overhead by calculating coverage statistics from streamed read alignment results.</p><p><strong>Availability and implementation: </strong>CoverM is free software available at https://github.com/wwood/coverm. CoverM is implemented in Rust, with Python (https://github.com/apcamargo/pycoverm) and Julia (https://github.com/JuliaBinaryWrappers/CoverM_jll.jl) interfaces.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143805133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Topology-Driven Negative Sampling Enhances Generalizability in Protein-Protein Interaction Prediction.
Pub Date : 2025-04-07 DOI: 10.1093/bioinformatics/btaf148
Ayan Chatterjee, Babak Ravandi, Parham Haddadi, Naomi H Philip, Mario Abdelmessih, William R Mowrey, Piero Ricchiuto, Yupu Liang, Wei Ding, Juan C Mobarec, Tina Eliassi-Rad

Motivation: Unraveling the human interactome to uncover disease-specific patterns and discover drug targets hinges on accurate protein-protein interaction (PPI) predictions. However, challenges persist in machine learning (ML) models due to a scarcity of quality hard negative samples, shortcut learning, and limited generalizability to novel proteins.

Results: In this study, we introduce a novel approach for strategic sampling of protein-protein non-interactions (PPNIs) by leveraging higher-order network characteristics that capture the inherent complementarity-driven mechanisms of PPIs. Next, we introduce UPNA-PPI (Unsupervised Pre-training of Node Attributes tuned for PPI), a high throughput sequence-to-function ML pipeline, integrating unsupervised pre-training in protein representation learning with Topological PPNI (TPPNI) samples, capable of efficiently screening billions of interactions. By using our TPPNI in training the UPNA-PPI model, we improve PPI prediction generalizability and interpretability, particularly in identifying potential binding sites locations on amino acid sequences, strengthening the prioritization of screening assays and facilitating the transferability of ML predictions across protein families and homodimers. UPNA-PPI establishes the foundation for a fundamental negative sampling methodology in graph machine learning by integrating insights from network topology.

Availability and implementation: Code and UPNA-PPI predictions are freely available at https://github.com/alxndgb/UPNA-PPI.

Supplementary information: Supplementary data are available at Bioinformatics online.

{"title":"Topology-Driven Negative Sampling Enhances Generalizability in Protein-Protein Interaction Prediction.","authors":"Ayan Chatterjee, Babak Ravandi, Parham Haddadi, Naomi H Philip, Mario Abdelmessih, William R Mowrey, Piero Ricchiuto, Yupu Liang, Wei Ding, Juan C Mobarec, Tina Eliassi-Rad","doi":"10.1093/bioinformatics/btaf148","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf148","url":null,"abstract":"<p><strong>Motivation: </strong>Unraveling the human interactome to uncover disease-specific patterns and discover drug targets hinges on accurate protein-protein interaction (PPI) predictions. However, challenges persist in machine learning (ML) models due to a scarcity of quality hard negative samples, shortcut learning, and limited generalizability to novel proteins.</p><p><strong>Results: </strong>In this study, we introduce a novel approach for strategic sampling of protein-protein non-interactions (PPNIs) by leveraging higher-order network characteristics that capture the inherent complementarity-driven mechanisms of PPIs. Next, we introduce UPNA-PPI (Unsupervised Pre-training of Node Attributes tuned for PPI), a high throughput sequence-to-function ML pipeline, integrating unsupervised pre-training in protein representation learning with Topological PPNI (TPPNI) samples, capable of efficiently screening billions of interactions. By using our TPPNI in training the UPNA-PPI model, we improve PPI prediction generalizability and interpretability, particularly in identifying potential binding sites locations on amino acid sequences, strengthening the prioritization of screening assays and facilitating the transferability of ML predictions across protein families and homodimers. UPNA-PPI establishes the foundation for a fundamental negative sampling methodology in graph machine learning by integrating insights from network topology.</p><p><strong>Availability and implementation: </strong>Code and UPNA-PPI predictions are freely available at https://github.com/alxndgb/UPNA-PPI.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143805092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
miss-SNF: a multimodal patient similarity network integration approach to handle completely missing data sources. miss-SNF:处理完全缺失数据源的多模态患者相似性网络整合方法。
Pub Date : 2025-04-04 DOI: 10.1093/bioinformatics/btaf150
Jessica Gliozzo, Mauricio A Soto Gomez, Arturo Bonometti, Alex Patak, Elena Casiraghi, Giorgio Valentini

Motivation: Precision medicine leverages patient-specific multimodal data to improve prevention, diagnosis, prognosis and treatment of diseases. Advancing precision medicine requires the non-trivial integration of complex, heterogeneous and potentially high-dimensional data sources, such as multi-omics and clinical data. In the literature several approaches have been proposed to manage missing data, but are usually limited to the recovery of subsets of features for a subset of patients. A largely overlooked problem is the integration of multiple sources of data when one or more of them are completely missing for a subset of patients, a relatively common condition in clinical practice.

Results: We propose miss-Similarity Network Fusion (miss-SNF), a novel general-purpose data integration approach designed to manage completely missing data in the context of patient similarity networks. Miss-SNF integrates incomplete unimodal patient similarity networks by leveraging a non-linear message-passing strategy borrowed from the SNF algorithm. Miss-SNF is able to recover missing patient similarities and is "task agnostic", in the sense that can integrate partial data for both unsupervised and supervised prediction tasks. Experimental analyses on nine cancer datasets from The Cancer Genome Atlas (TCGA) demonstrate that miss-SNF achieves state-of-the-art results in recovering similarities and in identifying patients subgroups enriched in clinically relevant variables and having differential survival. Moreover, amputation experiments show that miss-SNF supervised prediction of cancer clinical outcomes and Alzheimer's disease diagnosis with completely missing data achieves results comparable to those obtained when all the data are available.

Availability and implementation: miss-SNF code, implemented in R, is available at https://github.com/AnacletoLAB/missSNF.

Supplementary information: Supplementary information are available at Bioinformatics online.

{"title":"miss-SNF: a multimodal patient similarity network integration approach to handle completely missing data sources.","authors":"Jessica Gliozzo, Mauricio A Soto Gomez, Arturo Bonometti, Alex Patak, Elena Casiraghi, Giorgio Valentini","doi":"10.1093/bioinformatics/btaf150","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf150","url":null,"abstract":"<p><strong>Motivation: </strong>Precision medicine leverages patient-specific multimodal data to improve prevention, diagnosis, prognosis and treatment of diseases. Advancing precision medicine requires the non-trivial integration of complex, heterogeneous and potentially high-dimensional data sources, such as multi-omics and clinical data. In the literature several approaches have been proposed to manage missing data, but are usually limited to the recovery of subsets of features for a subset of patients. A largely overlooked problem is the integration of multiple sources of data when one or more of them are completely missing for a subset of patients, a relatively common condition in clinical practice.</p><p><strong>Results: </strong>We propose miss-Similarity Network Fusion (miss-SNF), a novel general-purpose data integration approach designed to manage completely missing data in the context of patient similarity networks. Miss-SNF integrates incomplete unimodal patient similarity networks by leveraging a non-linear message-passing strategy borrowed from the SNF algorithm. Miss-SNF is able to recover missing patient similarities and is \"task agnostic\", in the sense that can integrate partial data for both unsupervised and supervised prediction tasks. Experimental analyses on nine cancer datasets from The Cancer Genome Atlas (TCGA) demonstrate that miss-SNF achieves state-of-the-art results in recovering similarities and in identifying patients subgroups enriched in clinically relevant variables and having differential survival. Moreover, amputation experiments show that miss-SNF supervised prediction of cancer clinical outcomes and Alzheimer's disease diagnosis with completely missing data achieves results comparable to those obtained when all the data are available.</p><p><strong>Availability and implementation: </strong>miss-SNF code, implemented in R, is available at https://github.com/AnacletoLAB/missSNF.</p><p><strong>Supplementary information: </strong>Supplementary information are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143782274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AlertGS: Determining alerts for gene sets.
Pub Date : 2025-04-03 DOI: 10.1093/bioinformatics/btaf133
Franziska Kappenberg, Jörg Rahnenführer

Motivation: A typical goal in gene expression studies is identifying certain gene sets enriched with significant genes. The measurement of many gene expression experiments for several concentrations or time points allows the modeling of the concentration/time-response relationship for each gene, and the subsequent estimation of a gene-wise alert. In this work, an approach is proposed to transfer the concept of alerts from single genes to gene sets, yielding a global significance statement and the respective concentration or time where the first enrichment of the gene set can be observed. The methodology is based on a Kolmogorov-Smirnoff type test statistic for each gene set.

Results: Simulations show that a majority of these sets can be identified especially for lower numbers of true gene sets with a signal. The false positive rate can be controlled by subsequent decorrelation approaches. Overall, the true gene set-wise alerts are rarely overestimated and rather tend to be underestimated.

Availability and implementation: The code needed to reproduce the simulations and apply the AlertGS methodology is available at the GitHub repository https://github.com/FKappenberg/AlertGS.

Supplementary information: Supplementary material is available online.

动机基因表达研究的一个典型目标是确定某些富含重要基因的基因集。通过对多个浓度或时间点的多个基因表达实验进行测量,可以为每个基因的浓度/时间-响应关系建模,进而估算出基因警戒值。在这项工作中,提出了一种将警报概念从单个基因转移到基因组的方法,从而得出一个全局重要性声明以及可以观察到基因组首次富集的相应浓度或时间。该方法基于每个基因组的 Kolmogorov-Smirnoff 类型检验统计量:模拟结果表明,这些基因组中的大多数都能被识别出来,尤其是在有信号的真实基因组数量较少的情况下。假阳性率可通过后续的去相关性方法加以控制。总的来说,真正的基因集警报很少被高估,反而有被低估的趋势:重现模拟和应用 AlertGS 方法所需的代码可从 GitHub 存储库 https://github.com/FKappenberg/AlertGS.Supplementary 获取:补充材料可在线获取。
{"title":"AlertGS: Determining alerts for gene sets.","authors":"Franziska Kappenberg, Jörg Rahnenführer","doi":"10.1093/bioinformatics/btaf133","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf133","url":null,"abstract":"<p><strong>Motivation: </strong>A typical goal in gene expression studies is identifying certain gene sets enriched with significant genes. The measurement of many gene expression experiments for several concentrations or time points allows the modeling of the concentration/time-response relationship for each gene, and the subsequent estimation of a gene-wise alert. In this work, an approach is proposed to transfer the concept of alerts from single genes to gene sets, yielding a global significance statement and the respective concentration or time where the first enrichment of the gene set can be observed. The methodology is based on a Kolmogorov-Smirnoff type test statistic for each gene set.</p><p><strong>Results: </strong>Simulations show that a majority of these sets can be identified especially for lower numbers of true gene sets with a signal. The false positive rate can be controlled by subsequent decorrelation approaches. Overall, the true gene set-wise alerts are rarely overestimated and rather tend to be underestimated.</p><p><strong>Availability and implementation: </strong>The code needed to reproduce the simulations and apply the AlertGS methodology is available at the GitHub repository https://github.com/FKappenberg/AlertGS.</p><p><strong>Supplementary information: </strong>Supplementary material is available online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143782271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Demixer: A probabilistic generative model to delineate different strains of a microbial species in a mixed infection sample.
Pub Date : 2025-04-03 DOI: 10.1093/bioinformatics/btaf139
V P Brintha, Manikandan Narayanan

Motivation: Multi-drug resistant or hetero-resistant Tuberculosis (TB) hinders the successful treatment of TB. Hetero-resistant TB occurs when multiple strains of the TB-causing bacterium with varying degrees of drug susceptibility are present in an individual. Existing studies predicting the proportion and identity of strains in a mixed infection sample rely on a reference database of known strains. A main challenge then is to identify de novo strains not present in the reference database, while quantifying the proportion of known strains.

Results: We present Demixer, a probabilistic generative model that uses a combination of reference-based and reference-free techniques to delineate mixed infection strains in whole genome sequencing (WGS) data. Demixer extends a topic model widely used in text mining to represent known mutations and discover novel ones. Parallelization and other heuristics enabled Demixer to process large datasets like CRyPTIC (Comprehensive Resistance Prediction for Tuberculosis: an International Consortium). In both synthetic and experimental benchmark datasets, our proposed method precisely detected the identity (e.g., 91.67% accuracy on the experimental in vitro dataset) as well as the proportions of the mixed strains. In real-world applications, Demixer revealed novel high confidence mixed infections (101 out of 1,963 Malawi samples analyzed), and new insights into the global frequency of mixed infection (2% at the most stringent threshold in the CRyPTIC dataset) and its significant association to drug resistance. Our approach is generalizable and hence applicable to any bacterial and viral WGS data.

Availability: All code relevant to Demixer is available at https://github.com/BIRDSgroup/Demixer.

Supplementary information: Suppl Information PDF file (containing Suppl Methods/Algorithms/Tables/Figures), and other Suppl Data Files are available at this link: https://drive.google.com/drive/folders/1P_OX_MbZ6QFN9Amyl2eGMBr1ySY6yNWu? usp=drive_link. The Suppl data, code and vcf files (of in vitro, synthetic and real-world datasets) have also been archived at Zenodo (doi: 10.5281/zenodo.15074330).

{"title":"Demixer: A probabilistic generative model to delineate different strains of a microbial species in a mixed infection sample.","authors":"V P Brintha, Manikandan Narayanan","doi":"10.1093/bioinformatics/btaf139","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf139","url":null,"abstract":"<p><strong>Motivation: </strong>Multi-drug resistant or hetero-resistant Tuberculosis (TB) hinders the successful treatment of TB. Hetero-resistant TB occurs when multiple strains of the TB-causing bacterium with varying degrees of drug susceptibility are present in an individual. Existing studies predicting the proportion and identity of strains in a mixed infection sample rely on a reference database of known strains. A main challenge then is to identify de novo strains not present in the reference database, while quantifying the proportion of known strains.</p><p><strong>Results: </strong>We present Demixer, a probabilistic generative model that uses a combination of reference-based and reference-free techniques to delineate mixed infection strains in whole genome sequencing (WGS) data. Demixer extends a topic model widely used in text mining to represent known mutations and discover novel ones. Parallelization and other heuristics enabled Demixer to process large datasets like CRyPTIC (Comprehensive Resistance Prediction for Tuberculosis: an International Consortium). In both synthetic and experimental benchmark datasets, our proposed method precisely detected the identity (e.g., 91.67% accuracy on the experimental in vitro dataset) as well as the proportions of the mixed strains. In real-world applications, Demixer revealed novel high confidence mixed infections (101 out of 1,963 Malawi samples analyzed), and new insights into the global frequency of mixed infection (2% at the most stringent threshold in the CRyPTIC dataset) and its significant association to drug resistance. Our approach is generalizable and hence applicable to any bacterial and viral WGS data.</p><p><strong>Availability: </strong>All code relevant to Demixer is available at https://github.com/BIRDSgroup/Demixer.</p><p><strong>Supplementary information: </strong>Suppl Information PDF file (containing Suppl Methods/Algorithms/Tables/Figures), and other Suppl Data Files are available at this link: https://drive.google.com/drive/folders/1P_OX_MbZ6QFN9Amyl2eGMBr1ySY6yNWu? usp=drive_link. The Suppl data, code and vcf files (of in vitro, synthetic and real-world datasets) have also been archived at Zenodo (doi: 10.5281/zenodo.15074330).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143782272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Bioinformatics (Oxford, England)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1