Motivation: Identifying effective therapeutic targets poses a challenge in drug discovery, especially for uncharacterized diseases without known therapeutic targets (e.g. rare diseases, intractable diseases).
Results: This study presents a novel machine learning approach employing multimodal vector-quantized variational autoencoders (VQ-VAEs) for predicting therapeutic target molecules across diseases. To address the lack of known therapeutic target-disease associations, we incorporate the information on uncharacterized diseases without known targets or uncharacterized proteins without known indications (applicable diseases) in the semi-supervised learning (SSL) framework. The method integrates disease-specific and protein perturbation profiles with genetic perturbations (e.g., gene knockdowns and gene overexpressions) at the transcriptome level. Cross-cell representation learning, facilitated by VQ-VAEs, was performed to extract informative features from protein perturbation profiles across diverse human cell types. Concurrently, cross-disease representation learning was performed, leveraging VQ-VAE, to extract informative features reflecting disease states from disease-specific profiles. The model's applicability to uncharacterized diseases or proteins is enhanced by considering consistency between disease-specific and patient-specific signatures. The efficacy of the method is demonstrated across three practical scenarios for 79 diseases: target repositioning for target-disease pairs, new target prediction for uncharacterized diseases, and new indication prediction for uncharacterized proteins. This method is expected to be valuable for identifying therapeutic targets across various diseases.
Availability and implementation: Code: github.com/YamanishiLab/SSL-VQ & Data: 10.5281/zenodo.14644837.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"SSL-VQ: Vector-quantized variational autoencoders for semi-supervised prediction of therapeutic targets across diverse diseases.","authors":"Satoko Namba, Chen Li, Noriko Otani, Yoshihiro Yamanishi","doi":"10.1093/bioinformatics/btaf039","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf039","url":null,"abstract":"<p><strong>Motivation: </strong>Identifying effective therapeutic targets poses a challenge in drug discovery, especially for uncharacterized diseases without known therapeutic targets (e.g. rare diseases, intractable diseases).</p><p><strong>Results: </strong>This study presents a novel machine learning approach employing multimodal vector-quantized variational autoencoders (VQ-VAEs) for predicting therapeutic target molecules across diseases. To address the lack of known therapeutic target-disease associations, we incorporate the information on uncharacterized diseases without known targets or uncharacterized proteins without known indications (applicable diseases) in the semi-supervised learning (SSL) framework. The method integrates disease-specific and protein perturbation profiles with genetic perturbations (e.g., gene knockdowns and gene overexpressions) at the transcriptome level. Cross-cell representation learning, facilitated by VQ-VAEs, was performed to extract informative features from protein perturbation profiles across diverse human cell types. Concurrently, cross-disease representation learning was performed, leveraging VQ-VAE, to extract informative features reflecting disease states from disease-specific profiles. The model's applicability to uncharacterized diseases or proteins is enhanced by considering consistency between disease-specific and patient-specific signatures. The efficacy of the method is demonstrated across three practical scenarios for 79 diseases: target repositioning for target-disease pairs, new target prediction for uncharacterized diseases, and new indication prediction for uncharacterized proteins. This method is expected to be valuable for identifying therapeutic targets across various diseases.</p><p><strong>Availability and implementation: </strong>Code: github.com/YamanishiLab/SSL-VQ & Data: 10.5281/zenodo.14644837.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143070123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-28DOI: 10.1093/bioinformatics/btaf022
Meng Wang, Wei Fan, Tianrui Wu, Min Li
Motivation: T-cell receptors (TCRs) elicit and mediate the adaptive immune response by recognizing antigenic peptides, a process pivotal for cancer immunotherapy, vaccine design, and autoimmune disease management. Understanding the intricate binding patterns between TCRs and peptides is critical for advancing these clinical applications. While several computational tools have been developed, they neglect the directional semantics inherent in sequence data, which are essential for accurately characterizing TCR-peptide interactions.
Results: To address this gap, we develop TPepRet, an innovative model that integrates subsequence mining with semantic integration capabilities. TPepRet combines the strengths of the Bidirectional Gated Recurrent Unit (BiGRU) network for capturing bidirectional sequence dependencies with the Large Language Model framework to analyze subsequences and global sequences comprehensively, which enables TPepRet to accurately decipher the semantic binding relationship between TCRs and peptides. We have evaluated TPepRet to a range of challenging scenarios, including performance benchmarking against other tools using diverse datasets, analysis of peptide binding preferences, characterization of T cells clonal expansion, identification of true binder in complex environments, assessment of key binding sites through alanine scanning, validation against expression rates from large-scale datasets, and ability to screen SARS-CoV-2 TCRs. The comprehensive results suggest that TPepRet outperforms existing tools. We believe TPepRet will become an effective tool for understanding TCR-peptide binding in clinical treatment.
Availability and implementation: The source code can be obtained from https://github.com/CSUBioGroup/TPepRet.git.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"TPepRet: a deep learning model for characterizing T cell receptors-antigen binding patterns.","authors":"Meng Wang, Wei Fan, Tianrui Wu, Min Li","doi":"10.1093/bioinformatics/btaf022","DOIUrl":"10.1093/bioinformatics/btaf022","url":null,"abstract":"<p><strong>Motivation: </strong>T-cell receptors (TCRs) elicit and mediate the adaptive immune response by recognizing antigenic peptides, a process pivotal for cancer immunotherapy, vaccine design, and autoimmune disease management. Understanding the intricate binding patterns between TCRs and peptides is critical for advancing these clinical applications. While several computational tools have been developed, they neglect the directional semantics inherent in sequence data, which are essential for accurately characterizing TCR-peptide interactions.</p><p><strong>Results: </strong>To address this gap, we develop TPepRet, an innovative model that integrates subsequence mining with semantic integration capabilities. TPepRet combines the strengths of the Bidirectional Gated Recurrent Unit (BiGRU) network for capturing bidirectional sequence dependencies with the Large Language Model framework to analyze subsequences and global sequences comprehensively, which enables TPepRet to accurately decipher the semantic binding relationship between TCRs and peptides. We have evaluated TPepRet to a range of challenging scenarios, including performance benchmarking against other tools using diverse datasets, analysis of peptide binding preferences, characterization of T cells clonal expansion, identification of true binder in complex environments, assessment of key binding sites through alanine scanning, validation against expression rates from large-scale datasets, and ability to screen SARS-CoV-2 TCRs. The comprehensive results suggest that TPepRet outperforms existing tools. We believe TPepRet will become an effective tool for understanding TCR-peptide binding in clinical treatment.</p><p><strong>Availability and implementation: </strong>The source code can be obtained from https://github.com/CSUBioGroup/TPepRet.git.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11784750/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143070124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-25DOI: 10.1093/bioinformatics/btaf045
Yan Yan, Beatriz Jiménez, Michael T Judge, Toby Athersuch, Maria De Iorio, Timothy M D Ebbels
Metabolomics extensively utilizes Nuclear Magnetic Resonance (NMR) spectroscopy due to its excellent reproducibility and high throughput. Both one-dimensional (1D) and two-dimensional (2D) NMR spectra provide crucial information for metabolite annotation and quantification, yet present complex overlapping patterns which may require sophisticated machine learning algorithms to decipher. Unfortunately, the limited availability of labeled spectra can hamper application of machine learning, especially deep learning algorithms which require large amounts of labelled data. In this context, simulation of spectral data becomes a tractable solution for algorithm development.Here, we introduce MetAssimulo 2.0, a comprehensive upgrade of the MetAssimulo 1.0 metabolomic 1H NMR simulation tool, reimplemented as a Python-based web application. Where MetAssimulo 1.0 only simulated 1D 1H spectra of human urine, MetAssimulo 2.0 expands functionality to urine, blood, and cerebral spinal fluid (CSF), enhancing the realism of blood spectra by incorporating a broad protein background. This enhancement enables a closer approximation to real blood spectra, achieving a Pearson correlation of approximately 0.82. Moreover, this tool now includes simulation capabilities for 2D J-resolved (J-Res) and Correlation Spectroscopy (COSY) spectra, significantly broadening its utility in complex mixture analysis. MetAssimulo 2.0 simulates both single, and groups, of spectra with both discrete (case-control, e.g. heart transplant vs healthy) and continuous (e.g. BMI) outcomes and includes inter-metabolite correlations. It thus supports a range of experimental designs and demonstrating associations between metabolite profiles and biomedical responses.By enhancing NMR spectral simulations, MetAssimulo 2.0 is well positioned to support and enhance research at the intersection of deep learning and metabolomics.
Availability and implementation: The code and the detailed instruction/tutorial for MetAssimulo 2.0 is available at https://github.com/yanyan5420/MetAssimulo_2.git The relevant NMR spectra for metabolites are deposited in MetaboLights with accession number MTBLS12081.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"MetAssimulo 2.0: a web app for simulating realistic 1D & 2D Metabolomic 1H NMR spectra.","authors":"Yan Yan, Beatriz Jiménez, Michael T Judge, Toby Athersuch, Maria De Iorio, Timothy M D Ebbels","doi":"10.1093/bioinformatics/btaf045","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf045","url":null,"abstract":"<p><p>Metabolomics extensively utilizes Nuclear Magnetic Resonance (NMR) spectroscopy due to its excellent reproducibility and high throughput. Both one-dimensional (1D) and two-dimensional (2D) NMR spectra provide crucial information for metabolite annotation and quantification, yet present complex overlapping patterns which may require sophisticated machine learning algorithms to decipher. Unfortunately, the limited availability of labeled spectra can hamper application of machine learning, especially deep learning algorithms which require large amounts of labelled data. In this context, simulation of spectral data becomes a tractable solution for algorithm development.Here, we introduce MetAssimulo 2.0, a comprehensive upgrade of the MetAssimulo 1.0 metabolomic 1H NMR simulation tool, reimplemented as a Python-based web application. Where MetAssimulo 1.0 only simulated 1D 1H spectra of human urine, MetAssimulo 2.0 expands functionality to urine, blood, and cerebral spinal fluid (CSF), enhancing the realism of blood spectra by incorporating a broad protein background. This enhancement enables a closer approximation to real blood spectra, achieving a Pearson correlation of approximately 0.82. Moreover, this tool now includes simulation capabilities for 2D J-resolved (J-Res) and Correlation Spectroscopy (COSY) spectra, significantly broadening its utility in complex mixture analysis. MetAssimulo 2.0 simulates both single, and groups, of spectra with both discrete (case-control, e.g. heart transplant vs healthy) and continuous (e.g. BMI) outcomes and includes inter-metabolite correlations. It thus supports a range of experimental designs and demonstrating associations between metabolite profiles and biomedical responses.By enhancing NMR spectral simulations, MetAssimulo 2.0 is well positioned to support and enhance research at the intersection of deep learning and metabolomics.</p><p><strong>Availability and implementation: </strong>The code and the detailed instruction/tutorial for MetAssimulo 2.0 is available at https://github.com/yanyan5420/MetAssimulo_2.git The relevant NMR spectra for metabolites are deposited in MetaboLights with accession number MTBLS12081.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143043810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-24DOI: 10.1093/bioinformatics/btaf037
Tatiana A Gurbich, Martin Beracochea, Nishadi H De Silva, Robert D Finn
Summary: In recent years there has been a surge in prokaryotic genome assemblies, coming from both isolated organisms and environmental samples. These assemblies often include novel species that are poorly represented in reference databases creating a need for a tool that can annotate both well-described and novel taxa, and can run at scale. Here, we present mettannotator-a comprehensive, scalable Nextflow pipeline for prokaryotic genome annotation that identifies coding and non-coding regions, predicts protein functions, including antimicrobial resistance, and delineates gene clusters. The pipeline summarises the results of these tools in a GFF (General Feature Format) file that can be easily utilised in downstream analysis or visualised using common genome browsers. Here, we show how it works on 200 genomes from 29 prokaryotic phyla, including isolate genomes and known and novel metagenome-assembled genomes, and present metrics on its performance in comparison to other tools.
Availability and implementation: The pipeline is written in Nextflow and Python and published under an open source Apache 2.0 licence. Instructions and source code can be accessed at https://github.com/EBI-Metagenomics/mettannotator. The pipeline is also available on WorkflowHub: https://workflowhub.eu/workflows/1069.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"mettannotator: a comprehensive and scalable Nextflow annotation pipeline for prokaryotic assemblies.","authors":"Tatiana A Gurbich, Martin Beracochea, Nishadi H De Silva, Robert D Finn","doi":"10.1093/bioinformatics/btaf037","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf037","url":null,"abstract":"<p><strong>Summary: </strong>In recent years there has been a surge in prokaryotic genome assemblies, coming from both isolated organisms and environmental samples. These assemblies often include novel species that are poorly represented in reference databases creating a need for a tool that can annotate both well-described and novel taxa, and can run at scale. Here, we present mettannotator-a comprehensive, scalable Nextflow pipeline for prokaryotic genome annotation that identifies coding and non-coding regions, predicts protein functions, including antimicrobial resistance, and delineates gene clusters. The pipeline summarises the results of these tools in a GFF (General Feature Format) file that can be easily utilised in downstream analysis or visualised using common genome browsers. Here, we show how it works on 200 genomes from 29 prokaryotic phyla, including isolate genomes and known and novel metagenome-assembled genomes, and present metrics on its performance in comparison to other tools.</p><p><strong>Availability and implementation: </strong>The pipeline is written in Nextflow and Python and published under an open source Apache 2.0 licence. Instructions and source code can be accessed at https://github.com/EBI-Metagenomics/mettannotator. The pipeline is also available on WorkflowHub: https://workflowhub.eu/workflows/1069.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143034832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-12DOI: 10.1093/bioinformatics/btaf014
Yifan Jiang, Disen Liao, Qiyun Zhu, Yang Young Lu
Motivation: Understanding the associations between traits and microbial composition is a fundamental objective in microbiome research. Recently, researchers have turned to machine learning (ML) models to achieve this goal with promising results. However, the effectiveness of advanced ML models is often limited by the unique characteristics of microbiome data, which are typically high-dimensional, compositional, and imbalanced. These characteristics can hinder the models' ability to fully explore the relationships among taxa in predictive analyses. To address this challenge, data augmentation has become crucial. It involves generating synthetic samples with artificial labels based on existing data and incorporating these samples into the training set to improve ML model performance.
Results: Here we propose PhyloMix, a novel data augmentation method specifically designed for microbiome data to enhance predictive analyses. PhyloMix leverages the phylogenetic relationships among microbiome taxa as an informative prior to guide the generation of synthetic microbial samples. Leveraging phylogeny, PhyloMix creates new samples by removing a subtree from one sample and combining it with the corresponding subtree from another sample. Notably, PhyloMix is designed to address the compositional nature of microbiome data, effectively handling both raw counts and relative abundances. This approach introduces sufficient diversity into the augmented samples, leading to improved predictive performance. We empirically evaluated PhyloMix on six real microbiome datasets across five commonly used ML models. PhyloMix significantly outperforms distinct baseline methods including sample-mixing-based data augmentation techniques like vanilla mixup and compositional cutmix, as well as the phylogeny-based method TADA. We also demonstrated the wide applicability of PhyloMix in both supervised learning and contrastive representation learning.
Availability: The Apache licensed source code is available at (https://github.com/batmen-lab/phylomix).
Supplementary information: Supplementary data are available at Bioinformatics.
动机了解性状与微生物组成之间的关联是微生物组研究的一个基本目标。最近,研究人员转向使用机器学习(ML)模型来实现这一目标,并取得了可喜的成果。然而,高级 ML 模型的有效性往往受到微生物组数据独特特性的限制,这些数据通常具有高维、组成复杂和不平衡的特点。这些特点会阻碍模型在预测分析中充分探索类群之间关系的能力。为了应对这一挑战,数据扩增变得至关重要。它包括在现有数据的基础上生成带有人工标签的合成样本,并将这些样本纳入训练集,以提高 ML 模型的性能:在此,我们提出了 PhyloMix,这是一种专为微生物组数据设计的新型数据增强方法,可增强预测分析。PhyloMix 利用微生物群分类群之间的系统发育关系作为信息先导,指导合成微生物样本的生成。利用系统发育关系,PhyloMix 从一个样本中移除一个子树,然后将其与另一个样本中的相应子树结合,从而生成新样本。值得注意的是,PhyloMix 的设计旨在解决微生物组数据的组成性质问题,有效处理原始计数和相对丰度。这种方法为增强样本引入了足够的多样性,从而提高了预测性能。我们在六个真实的微生物组数据集上对 PhyloMix 进行了实证评估,涉及五个常用的 ML 模型。PhyloMix 明显优于不同的基线方法,包括基于样本混合的数据增强技术,如 vanilla mixup 和 compositional cutmix,以及基于系统发育的方法 TADA。我们还证明了 PhyloMix 在监督学习和对比表示学习中的广泛适用性:Apache 许可的源代码可在 (https://github.com/batmen-lab/phylomix) 上获取。补充信息:补充数据可从 Bioinformatics 网站获取。
{"title":"PhyloMix: Enhancing microbiome-trait association prediction through phylogeny-mixing augmentation.","authors":"Yifan Jiang, Disen Liao, Qiyun Zhu, Yang Young Lu","doi":"10.1093/bioinformatics/btaf014","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf014","url":null,"abstract":"<p><strong>Motivation: </strong>Understanding the associations between traits and microbial composition is a fundamental objective in microbiome research. Recently, researchers have turned to machine learning (ML) models to achieve this goal with promising results. However, the effectiveness of advanced ML models is often limited by the unique characteristics of microbiome data, which are typically high-dimensional, compositional, and imbalanced. These characteristics can hinder the models' ability to fully explore the relationships among taxa in predictive analyses. To address this challenge, data augmentation has become crucial. It involves generating synthetic samples with artificial labels based on existing data and incorporating these samples into the training set to improve ML model performance.</p><p><strong>Results: </strong>Here we propose PhyloMix, a novel data augmentation method specifically designed for microbiome data to enhance predictive analyses. PhyloMix leverages the phylogenetic relationships among microbiome taxa as an informative prior to guide the generation of synthetic microbial samples. Leveraging phylogeny, PhyloMix creates new samples by removing a subtree from one sample and combining it with the corresponding subtree from another sample. Notably, PhyloMix is designed to address the compositional nature of microbiome data, effectively handling both raw counts and relative abundances. This approach introduces sufficient diversity into the augmented samples, leading to improved predictive performance. We empirically evaluated PhyloMix on six real microbiome datasets across five commonly used ML models. PhyloMix significantly outperforms distinct baseline methods including sample-mixing-based data augmentation techniques like vanilla mixup and compositional cutmix, as well as the phylogeny-based method TADA. We also demonstrated the wide applicability of PhyloMix in both supervised learning and contrastive representation learning.</p><p><strong>Availability: </strong>The Apache licensed source code is available at (https://github.com/batmen-lab/phylomix).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142973830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-11DOI: 10.1093/bioinformatics/btaf011
Sizhe Liu, Yuchen Liu, Haofeng Xu, Jun Xia, Stan Z Li
Motivation: Drug-target interaction (DTI) prediction is crucial for drug discovery, significantly reducing costs and time in experimental searches across vast drug compound spaces. While deep learning has advanced DTI prediction accuracy, challenges remain: (i) existing methods often lack generalizability, with performance dropping significantly on unseen proteins and cross-domain settings; (ii) current molecular relational learning often overlooks subpocket-level interactions, which are vital for a detailed understanding of binding sites.
Results: We introduce SP-DTI, a subpocket-informed transformer model designed to address these challenges through: (i) detailed subpocket analysis using the Cavity Identification and Analysis Routine (CAVIAR) for interaction modeling at both global and local levels, and (ii) integration of pre-trained language models into graph neural networks to encode drugs and proteins, enhancing generalizability to unlabeled data. Benchmark evaluations show that SP-DTI consistently outperforms state-of-the-art models, achieving a ROC-AUC of 0.873 in unseen protein settings, an 11% improvement over the best baseline.
Availability and implementation: The model scripts are available at https://github.com/Steven51516/SP-DTI.
Contact and supplementary information: For correspondence, please contact xiajun@westlake.edu.cn. Supplementary data are available online at Bioinformatics.
{"title":"SP-DTI: Subpocket-Informed Transformer for Drug-Target Interaction Prediction.","authors":"Sizhe Liu, Yuchen Liu, Haofeng Xu, Jun Xia, Stan Z Li","doi":"10.1093/bioinformatics/btaf011","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf011","url":null,"abstract":"<p><strong>Motivation: </strong>Drug-target interaction (DTI) prediction is crucial for drug discovery, significantly reducing costs and time in experimental searches across vast drug compound spaces. While deep learning has advanced DTI prediction accuracy, challenges remain: (i) existing methods often lack generalizability, with performance dropping significantly on unseen proteins and cross-domain settings; (ii) current molecular relational learning often overlooks subpocket-level interactions, which are vital for a detailed understanding of binding sites.</p><p><strong>Results: </strong>We introduce SP-DTI, a subpocket-informed transformer model designed to address these challenges through: (i) detailed subpocket analysis using the Cavity Identification and Analysis Routine (CAVIAR) for interaction modeling at both global and local levels, and (ii) integration of pre-trained language models into graph neural networks to encode drugs and proteins, enhancing generalizability to unlabeled data. Benchmark evaluations show that SP-DTI consistently outperforms state-of-the-art models, achieving a ROC-AUC of 0.873 in unseen protein settings, an 11% improvement over the best baseline.</p><p><strong>Availability and implementation: </strong>The model scripts are available at https://github.com/Steven51516/SP-DTI.</p><p><strong>Contact and supplementary information: </strong>For correspondence, please contact xiajun@westlake.edu.cn. Supplementary data are available online at Bioinformatics.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142973832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-09DOI: 10.1093/bioinformatics/btaf010
Jin Sub Lee, Philip M Kim
Motivation: Accurate prediction of protein side-chain conformations is necessary to understand protein folding, protein-protein interactions and facilitate de novo protein design.
Results: Here we apply torsional flow matching and equivariant graph attention to develop FlowPacker, a fast and performant model to predict protein side-chain conformations conditioned on the protein sequence and backbone. We show that FlowPacker outperforms previous state-of-the-art baselines across most metrics with improved runtime. We further show that FlowPacker can be used to inpaint missing side-chain coordinates and also for multimeric targets, and exhibits strong performance on a test set of antibody-antigen complexes.
Availability: Code is available at https://gitlab.com/mjslee0921/flowpacker.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"FlowPacker: Protein side-chain packing with torsional flow matching.","authors":"Jin Sub Lee, Philip M Kim","doi":"10.1093/bioinformatics/btaf010","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf010","url":null,"abstract":"<p><strong>Motivation: </strong>Accurate prediction of protein side-chain conformations is necessary to understand protein folding, protein-protein interactions and facilitate de novo protein design.</p><p><strong>Results: </strong>Here we apply torsional flow matching and equivariant graph attention to develop FlowPacker, a fast and performant model to predict protein side-chain conformations conditioned on the protein sequence and backbone. We show that FlowPacker outperforms previous state-of-the-art baselines across most metrics with improved runtime. We further show that FlowPacker can be used to inpaint missing side-chain coordinates and also for multimeric targets, and exhibits strong performance on a test set of antibody-antigen complexes.</p><p><strong>Availability: </strong>Code is available at https://gitlab.com/mjslee0921/flowpacker.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142960264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1093/bioinformatics/btae719
Tam C Tran, David J Schlueter, Chenjie Zeng, Huan Mo, Robert J Carroll, Joshua C Denny
Summary: With the rapid growth of genetic data linked to electronic health record (EHR) data in huge cohorts, large-scale phenome-wide association study (PheWAS) have become powerful discovery tools in biomedical research. PheWAS is an analysis method to study phenotype associations utilizing longitudinal EHR data. Previous PheWAS packages were developed mostly with smaller datasets and with earlier PheWAS approaches. PheTK was designed to simplify analysis and efficiently handle biobank-scale data. PheTK uses multithreading and supports a full PheWAS workflow including extraction of data from OMOP databases and Hail matrix tables as well as PheWAS analysis for both phecode version 1.2 and phecodeX. Benchmarking results showed PheTK took 64% less time than the R PheWAS package to complete the same workflow. PheTK can be run locally or on cloud platforms such as the All of Us Researcher Workbench (All of Us) or the UK Biobank (UKB) Research Analysis Platform (RAP).
Availability and implementation: The PheTK package is freely available on the Python Package Index, on GitHub under GNU General Public License (GPL-3) at https://github.com/nhgritctran/PheTK, and on Zenodo, DOI 10.5281/zenodo.14217954, at https://doi.org/10.5281/zenodo.14217954. PheTK is implemented in Python and platform independent.
摘要:随着大量与电子健康记录数据相关的基因数据的快速增长,大规模全现象关联研究(PheWAS)已成为生物医学研究中强有力的发现工具。PheWAS是一种利用纵向电子健康记录(EHR)数据研究表型关联的分析方法。以前的PheWAS软件包主要是使用较小的数据集和早期的PheWAS方法开发的。PheTK旨在简化分析并有效处理生物库规模的数据。PheTK使用多线程,支持完整的PheWAS工作流,包括从OMOP数据库和Hail矩阵表中提取数据,以及phecode 1.2版和phecodeX的PheWAS分析。基准测试结果显示,在完成相同的工作流程时,PheTK比R PheWAS包节省64%的时间。PheTK可以在本地运行,也可以在云平台上运行,例如All of Us Researcher Workbench (All of Us)或UK Biobank (UKB) Research Analysis Platform (RAP)。可用性和实现:PheTK包在Python包索引上免费提供,在GitHub上根据GNU通用公共许可证(GPL-3)在https://github.com/nhgritctran/PheTK上,在Zenodo上,DOI 10.5281/ Zenodo上。14217954,网址:https://doi.org/10.5281/zenodo.14217954PheTK是用Python实现的,与平台无关。补充信息:补充数据可在生物信息学在线获取。
{"title":"PheWAS analysis on large-scale biobank data with PheTK.","authors":"Tam C Tran, David J Schlueter, Chenjie Zeng, Huan Mo, Robert J Carroll, Joshua C Denny","doi":"10.1093/bioinformatics/btae719","DOIUrl":"10.1093/bioinformatics/btae719","url":null,"abstract":"<p><strong>Summary: </strong>With the rapid growth of genetic data linked to electronic health record (EHR) data in huge cohorts, large-scale phenome-wide association study (PheWAS) have become powerful discovery tools in biomedical research. PheWAS is an analysis method to study phenotype associations utilizing longitudinal EHR data. Previous PheWAS packages were developed mostly with smaller datasets and with earlier PheWAS approaches. PheTK was designed to simplify analysis and efficiently handle biobank-scale data. PheTK uses multithreading and supports a full PheWAS workflow including extraction of data from OMOP databases and Hail matrix tables as well as PheWAS analysis for both phecode version 1.2 and phecodeX. Benchmarking results showed PheTK took 64% less time than the R PheWAS package to complete the same workflow. PheTK can be run locally or on cloud platforms such as the All of Us Researcher Workbench (All of Us) or the UK Biobank (UKB) Research Analysis Platform (RAP).</p><p><strong>Availability and implementation: </strong>The PheTK package is freely available on the Python Package Index, on GitHub under GNU General Public License (GPL-3) at https://github.com/nhgritctran/PheTK, and on Zenodo, DOI 10.5281/zenodo.14217954, at https://doi.org/10.5281/zenodo.14217954. PheTK is implemented in Python and platform independent.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11709244/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142808795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1093/bioinformatics/btae755
Sebastian Vorbrugg, Ilja Bezrukov, Zhigui Bao, Detlef Weigel
Motivation: As genome graphs are powerful data structures for representing the genetic diversity within populations, they can help identify genomic variations that traditional linear references miss, but their complexity and size makes the analysis of genome graphs challenging. We sought to develop a genome graph analysis tool that helps these analyses to become more accessible by addressing the limitations of existing tools. Specifically, we improve scalability and user-friendliness, and we provide many new statistics tailored to variation graphs for graph evaluation, including sample-specific features.
Results: We developed an efficient, comprehensive, and integrated tool, gretl, to analyze genome graphs and gain insights into their structure and composition by providing a wide range of statistics. gretl can be utilized to evaluate different graphs, compare the output of graph construction pipelines with different parameters, as well as perform an in-depth analysis of individual graphs, including sample-specific analysis. With the assistance of gretl, novel patterns of genetic variation and potential regions of interest can be identified, for later, more detailed inspection. We demonstrate that gretl outperforms other tools in terms of speed, particularly for larger genome graphs.
Availability and implementation: Commented Rust source code and documentation is available under MIT license at https://github.com/MoinSebi/gretl together with Python scripts and step-by-step usage examples. The package is available at Bioconda for easy installation.
{"title":"Gretl-variation GRaph Evaluation TooLkit.","authors":"Sebastian Vorbrugg, Ilja Bezrukov, Zhigui Bao, Detlef Weigel","doi":"10.1093/bioinformatics/btae755","DOIUrl":"10.1093/bioinformatics/btae755","url":null,"abstract":"<p><strong>Motivation: </strong>As genome graphs are powerful data structures for representing the genetic diversity within populations, they can help identify genomic variations that traditional linear references miss, but their complexity and size makes the analysis of genome graphs challenging. We sought to develop a genome graph analysis tool that helps these analyses to become more accessible by addressing the limitations of existing tools. Specifically, we improve scalability and user-friendliness, and we provide many new statistics tailored to variation graphs for graph evaluation, including sample-specific features.</p><p><strong>Results: </strong>We developed an efficient, comprehensive, and integrated tool, gretl, to analyze genome graphs and gain insights into their structure and composition by providing a wide range of statistics. gretl can be utilized to evaluate different graphs, compare the output of graph construction pipelines with different parameters, as well as perform an in-depth analysis of individual graphs, including sample-specific analysis. With the assistance of gretl, novel patterns of genetic variation and potential regions of interest can be identified, for later, more detailed inspection. We demonstrate that gretl outperforms other tools in terms of speed, particularly for larger genome graphs.</p><p><strong>Availability and implementation: </strong>Commented Rust source code and documentation is available under MIT license at https://github.com/MoinSebi/gretl together with Python scripts and step-by-step usage examples. The package is available at Bioconda for easy installation.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11729725/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142886604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1093/bioinformatics/btae684
Gang Wen, Limin Li
Motivation: High-throughput techniques have produced a large amount of high-dimensional multi-omics data, which makes it promising to predict patient survival outcomes more accurately. Recent work has showed the superiority of multi-omics data in survival analysis. However, it remains challenging to integrate multi-omics data to solve few-shot survival prediction problem, with only a few available training samples, especially for rare cancers.
Results: In this work, we propose a meta-learning framework for multi-omics few-shot survival analysis, namely MMOSurv, which enables to learn an effective multi-omics survival prediction model from a very few training samples of a specific cancer type, with the meta-knowledge across tasks from relevant cancer types. By assuming a deep Cox survival model with multiple omics, MMOSurv first learns an adaptable parameter initialization for the multi-omics survival model from abundant data of relevant cancers, and then adapts the parameters quickly and efficiently for the target cancer task with a very few training samples. Our experiments on eleven cancer types in The Cancer Genome Atlas datasets show that, compared to single-omics meta-learning methods, MMOSurv can better utilize the meta-information of similarities and relationships between different omics data from relevant cancer datasets to improve survival prediction of the target cancer with a very few multi-omics training samples. Furthermore, MMOSurv achieves better prediction performance than other state-of-the-art strategies such as multitask learning and pretraining.
Availability and implementation: MMOSurv is freely available at https://github.com/LiminLi-xjtu/MMOSurv.
{"title":"MMOSurv: meta-learning for few-shot survival analysis with multi-omics data.","authors":"Gang Wen, Limin Li","doi":"10.1093/bioinformatics/btae684","DOIUrl":"10.1093/bioinformatics/btae684","url":null,"abstract":"<p><strong>Motivation: </strong>High-throughput techniques have produced a large amount of high-dimensional multi-omics data, which makes it promising to predict patient survival outcomes more accurately. Recent work has showed the superiority of multi-omics data in survival analysis. However, it remains challenging to integrate multi-omics data to solve few-shot survival prediction problem, with only a few available training samples, especially for rare cancers.</p><p><strong>Results: </strong>In this work, we propose a meta-learning framework for multi-omics few-shot survival analysis, namely MMOSurv, which enables to learn an effective multi-omics survival prediction model from a very few training samples of a specific cancer type, with the meta-knowledge across tasks from relevant cancer types. By assuming a deep Cox survival model with multiple omics, MMOSurv first learns an adaptable parameter initialization for the multi-omics survival model from abundant data of relevant cancers, and then adapts the parameters quickly and efficiently for the target cancer task with a very few training samples. Our experiments on eleven cancer types in The Cancer Genome Atlas datasets show that, compared to single-omics meta-learning methods, MMOSurv can better utilize the meta-information of similarities and relationships between different omics data from relevant cancer datasets to improve survival prediction of the target cancer with a very few multi-omics training samples. Furthermore, MMOSurv achieves better prediction performance than other state-of-the-art strategies such as multitask learning and pretraining.</p><p><strong>Availability and implementation: </strong>MMOSurv is freely available at https://github.com/LiminLi-xjtu/MMOSurv.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11673192/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142678071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}