Pub Date : 2024-04-22DOI: 10.1093/bioinformatics/btae271
Zhecheng Zhou, Qingquan Liao, Jinhang Wei, Linlin Zhuo, Xiaonan Wu, Xiangzheng Fu, Quan Zou
MOTIVATION Accurate inference of potential Drug-protein interactions (DPIs) aids in understanding drug mechanisms and developing novel treatments. Existing deep learning models, however, struggle with accurate node representation in DPI prediction, limiting their performance. RESULTS We propose a new computational framework that integrates global and local features of nodes in the drug-protein bipartite graph for efficient DPI inference. Initially, we employ pre-trained models to acquire fundamental knowledge of drugs and proteins and to determine their initial features. Subsequently, the MinHash and HyperLogLog algorithms are utilized to estimate the similarity and set cardinality between drug and protein subgraphs, serving as their local features. Then, an energy-constrained diffusion mechanism is integrated into the transformer architecture, capturing interdependencies between nodes in the drug-protein bipartite graph and extracting their global features. Finally, we fuse the local and global features of nodes and employ multi-layer perceptrons (MLPs) to predict the likelihood of potential DPIs. A comprehensive and precise node representation guarantees efficient prediction of unknown DPIs by the model. Various experiments validate the accuracy and reliability of our model, with molecular docking results revealing its capability to identify potential DPIs not present in existing databases. This approach are expected to offer valuable insights for furthering drug repurposing and personalized medicine research. AVAILABILITY AND IMPLEMENTATION Our code and data are accessible at: https://github.com/ZZCrazy00/DPI. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
{"title":"Revisiting Drug-Protein Interaction Prediction: A Novel Global-Local Perspective.","authors":"Zhecheng Zhou, Qingquan Liao, Jinhang Wei, Linlin Zhuo, Xiaonan Wu, Xiangzheng Fu, Quan Zou","doi":"10.1093/bioinformatics/btae271","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae271","url":null,"abstract":"MOTIVATION\u0000Accurate inference of potential Drug-protein interactions (DPIs) aids in understanding drug mechanisms and developing novel treatments. Existing deep learning models, however, struggle with accurate node representation in DPI prediction, limiting their performance.\u0000\u0000\u0000RESULTS\u0000We propose a new computational framework that integrates global and local features of nodes in the drug-protein bipartite graph for efficient DPI inference. Initially, we employ pre-trained models to acquire fundamental knowledge of drugs and proteins and to determine their initial features. Subsequently, the MinHash and HyperLogLog algorithms are utilized to estimate the similarity and set cardinality between drug and protein subgraphs, serving as their local features. Then, an energy-constrained diffusion mechanism is integrated into the transformer architecture, capturing interdependencies between nodes in the drug-protein bipartite graph and extracting their global features. Finally, we fuse the local and global features of nodes and employ multi-layer perceptrons (MLPs) to predict the likelihood of potential DPIs. A comprehensive and precise node representation guarantees efficient prediction of unknown DPIs by the model. Various experiments validate the accuracy and reliability of our model, with molecular docking results revealing its capability to identify potential DPIs not present in existing databases. This approach are expected to offer valuable insights for furthering drug repurposing and personalized medicine research.\u0000\u0000\u0000AVAILABILITY AND IMPLEMENTATION\u0000Our code and data are accessible at: https://github.com/ZZCrazy00/DPI.\u0000\u0000\u0000SUPPLEMENTARY INFORMATION\u0000Supplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140674931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-18DOI: 10.1093/bioinformatics/btae269
Hongzhen Ding, Xue Li, Peifu Han, Xu Tian, Fengrui Jing, Shuang Wang, Tao Song, Hanjiao Fu, Na Kang
MOTIVATION Protein-protein interaction sites (PPIS) are crucial for deciphering protein action mechanisms and related medical research, which is the key issue in protein action research. Recent studies have shown that graph neural networks have achieved outstanding performance in predicting PPIS. However, these studies often neglect the modeling of information at different scales in the graph and the symmetry of protein molecules within three-dimensional space. RESULTS In response to this gap, this paper proposes the MEG-PPIS approach, a PPIS prediction method based on multi-scale graph information and E(n) equivariant graph neural network (EGNN). There are two channels in MEG-PPIS: the original graph and the subgraph obtained by graph pooling. The model can iteratively update the features of the original graph and subgraph through the weight-sharing EGNN. Subsequently, the max-pooling operation aggregates the updated features of the original graph and subgraph. Ultimately, the model feeds node features into the prediction layer to obtain prediction results. Comparative assessments against other methods on benchmark datasets reveal that MEG-PPIS achieves optimal performance across all evaluation metrics and gets the fastest runtime. Furthermore, specific case studies demonstrate that our method can predict more true positive and true negative sites than the current best method, proving that our model achieves better performance in the PPIS prediction task. AVAILABILITY AND IMPLEMENTATION The data and code are available at https://github.com/dhz234/MEG-PPIS.git.
{"title":"MEG-PPIS: a fast protein-protein interaction site prediction method based on multi-scale graph information and equivariant graph neural network.","authors":"Hongzhen Ding, Xue Li, Peifu Han, Xu Tian, Fengrui Jing, Shuang Wang, Tao Song, Hanjiao Fu, Na Kang","doi":"10.1093/bioinformatics/btae269","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae269","url":null,"abstract":"MOTIVATION\u0000Protein-protein interaction sites (PPIS) are crucial for deciphering protein action mechanisms and related medical research, which is the key issue in protein action research. Recent studies have shown that graph neural networks have achieved outstanding performance in predicting PPIS. However, these studies often neglect the modeling of information at different scales in the graph and the symmetry of protein molecules within three-dimensional space.\u0000\u0000\u0000RESULTS\u0000In response to this gap, this paper proposes the MEG-PPIS approach, a PPIS prediction method based on multi-scale graph information and E(n) equivariant graph neural network (EGNN). There are two channels in MEG-PPIS: the original graph and the subgraph obtained by graph pooling. The model can iteratively update the features of the original graph and subgraph through the weight-sharing EGNN. Subsequently, the max-pooling operation aggregates the updated features of the original graph and subgraph. Ultimately, the model feeds node features into the prediction layer to obtain prediction results. Comparative assessments against other methods on benchmark datasets reveal that MEG-PPIS achieves optimal performance across all evaluation metrics and gets the fastest runtime. Furthermore, specific case studies demonstrate that our method can predict more true positive and true negative sites than the current best method, proving that our model achieves better performance in the PPIS prediction task.\u0000\u0000\u0000AVAILABILITY AND IMPLEMENTATION\u0000The data and code are available at https://github.com/dhz234/MEG-PPIS.git.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140689482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-18DOI: 10.1093/bioinformatics/btae273
Hubert Sokołowski, M. Czajkowski, Anna Czajkowska, K. Jurczuk, M. Kretowski
MOTIVATION ITree is an intuitive web tool for the manual, semi-automatic, and automatic induction of decision trees. It enables interactive modifications of tree structures and incorporates Relative Expression Analysis for detecting complex patterns in high-throughput molecular data. This makes ITree a versatile tool for both research and education in biomedical data analysis. RESULTS The tool allows users to instantly see the effects of modifications on decision trees, with updates to predictions and statistics displayed in real time, facilitating a deeper understanding of data classification processes. AVAILABILITY AND IMPLEMENTATION Available online at https://itree.wi.pb.edu.pl. Source code and documentation are hosted on GitHub at https://github.com/hsokolowski/iTree. SUPPLEMENTARY INFORMATION Additional resources are provided to enhance user experience and support.
{"title":"ITree: a user-driven tool for interactive decision-making with classification trees.","authors":"Hubert Sokołowski, M. Czajkowski, Anna Czajkowska, K. Jurczuk, M. Kretowski","doi":"10.1093/bioinformatics/btae273","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae273","url":null,"abstract":"MOTIVATION\u0000ITree is an intuitive web tool for the manual, semi-automatic, and automatic induction of decision trees. It enables interactive modifications of tree structures and incorporates Relative Expression Analysis for detecting complex patterns in high-throughput molecular data. This makes ITree a versatile tool for both research and education in biomedical data analysis.\u0000\u0000\u0000RESULTS\u0000The tool allows users to instantly see the effects of modifications on decision trees, with updates to predictions and statistics displayed in real time, facilitating a deeper understanding of data classification processes.\u0000\u0000\u0000AVAILABILITY AND IMPLEMENTATION\u0000Available online at https://itree.wi.pb.edu.pl. Source code and documentation are hosted on GitHub at https://github.com/hsokolowski/iTree.\u0000\u0000\u0000SUPPLEMENTARY INFORMATION\u0000Additional resources are provided to enhance user experience and support.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140686788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-17DOI: 10.1093/bioinformatics/btae270
Junmin Wang, Steven Novick
MOTIVATION The clinical translation of mass spectrometry-based proteomics has been challenging due to limited statistical power caused by large technical variability and inter-patient heterogeneity. Bottom-up proteomics provides an indirect measurement of proteins through digested peptides. This raises the question whether peptide measurements can be used directly to better distinguish differentially expressed proteins. RESULTS We present a novel method called the peptide set test, which detects coordinated changes in the expression of peptides originating from the same protein and compares them to the rest of the peptidome. Applying our method to data from a published spike-in experiment and simulations demonstrates improved sensitivity without compromising precision, compared to aggregation-based approaches. Additionally, applying the peptide set test to compare the tumor proteomes of tamoxifen-sensitive and tamoxifen-resistant breast cancer patients reveals significant alterations in peptide levels of collagen XII, suggesting an association between collagen XII-mediated matrix reassembly and tamoxifen resistance. Our study establishes the peptide set test as a powerful peptide-centric strategy to infer differential expression in proteomics studies. AVAILABILITY Peptide Set Test (PepSetTest) is publicly available at https://github.com/JmWangBio/PepSetTest. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
动机:由于技术上的巨大差异和患者间的异质性导致统计能力有限,基于质谱的蛋白质组学的临床转化一直面临挑战。自下而上的蛋白质组学通过消化肽对蛋白质进行间接测量。结果我们提出了一种名为肽集测试的新方法,它能检测源自同一蛋白质的肽表达的协调变化,并将其与肽组的其他部分进行比较。与基于聚集的方法相比,将我们的方法应用于已发表的尖峰实验数据和模拟实验,结果表明在不影响精度的前提下提高了灵敏度。此外,应用肽集检验比较对他莫昔芬敏感和对他莫昔芬耐药的乳腺癌患者的肿瘤蛋白质组发现,胶原蛋白 XII 的肽水平发生了显著变化,这表明胶原蛋白 XII 介导的基质重组与他莫昔芬耐药之间存在关联。我们的研究证明肽集测试是一种强大的以肽为中心的策略,可用于推断蛋白质组学研究中的差异表达。AVAILABILITY肽集测试(PepSetTest)可在 https://github.com/JmWangBio/PepSetTest.SUPPLEMENTARY 上公开获取信息补充数据可在 Bioinformatics online 上获取。
{"title":"Peptide Set Test: a Peptide-Centric Strategy to Infer Differentially Expressed Proteins.","authors":"Junmin Wang, Steven Novick","doi":"10.1093/bioinformatics/btae270","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae270","url":null,"abstract":"MOTIVATION\u0000The clinical translation of mass spectrometry-based proteomics has been challenging due to limited statistical power caused by large technical variability and inter-patient heterogeneity. Bottom-up proteomics provides an indirect measurement of proteins through digested peptides. This raises the question whether peptide measurements can be used directly to better distinguish differentially expressed proteins.\u0000\u0000\u0000RESULTS\u0000We present a novel method called the peptide set test, which detects coordinated changes in the expression of peptides originating from the same protein and compares them to the rest of the peptidome. Applying our method to data from a published spike-in experiment and simulations demonstrates improved sensitivity without compromising precision, compared to aggregation-based approaches. Additionally, applying the peptide set test to compare the tumor proteomes of tamoxifen-sensitive and tamoxifen-resistant breast cancer patients reveals significant alterations in peptide levels of collagen XII, suggesting an association between collagen XII-mediated matrix reassembly and tamoxifen resistance. Our study establishes the peptide set test as a powerful peptide-centric strategy to infer differential expression in proteomics studies.\u0000\u0000\u0000AVAILABILITY\u0000Peptide Set Test (PepSetTest) is publicly available at https://github.com/JmWangBio/PepSetTest.\u0000\u0000\u0000SUPPLEMENTARY INFORMATION\u0000Supplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140691239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-17DOI: 10.1093/bioinformatics/btae203
Donghyung Lee, S. Bacanu
MOTIVATION As the availability of larger and more ethnically diverse reference panels grows, there is an increase in demand for ancestry-informed imputation of genome-wide association studies (GWAS), and other downstream analyses, e.g., fine-mapping. Performing such analyses at the genotype level is computationally challenging and necessitates, at best, a laborious process to access individual-level genotype and phenotype data. Summary-statistics-based tools, not requiring individual-level data, provide an efficient alternative that streamlines computational requirements and promotes open science by simplifying the re-analysis and downstream analysis of existing GWAS summary data. However, existing tools perform only disparate parts of needed analysis, have only command-line interfaces, and are difficult to extend/link by applied researchers. RESULTS To address these challenges, we present GAUSS-a comprehensive and user-friendly R package designed to facilitate the re-analysis/downstream analysis of GWAS summary statistics. GAUSS offers an integrated toolkit for a range of functionalities, including i) estimating ancestry proportion of study cohorts, ii) calculating ancestry-informed linkage disequilibrium, iii) imputing summary statistics of unobserved variants, iv) conducting transcriptome-wide association studies, and v) correcting for "Winner's Curse" biases. Notably, GAUSS utilizes an expansive, multi-ethnic reference panel consisting of 32,953 genomes from 29 ethnic groups. This panel enhances the range and accuracy of imputable variants, including the ability to impute summary statistics of rarer variants. As a result, GAUSS elevates the quality and applicability of existing GWAS analyses without requiring access to subject-level genotypic and phenotypic information. AVAILABILITY AND IMPLEMENTATION The GAUSS R package, complete with its source code, is readily accessible to the public via our GitHub repository at https://github.com/statsleelab/gauss. To further assist users, we provided illustrative use-case scenarios that are conveniently found at https://statsleelab.github.io/gauss/, along with a comprehensive user guide detailed in Supplementary Text 1 from Supplementary Data. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
{"title":"GAUSS: a summary-statistics-based R package for accurate estimation of linkage disequilibrium for variants, gaussian imputation and TWAS analysis of cosmopolitan cohorts.","authors":"Donghyung Lee, S. Bacanu","doi":"10.1093/bioinformatics/btae203","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae203","url":null,"abstract":"MOTIVATION\u0000As the availability of larger and more ethnically diverse reference panels grows, there is an increase in demand for ancestry-informed imputation of genome-wide association studies (GWAS), and other downstream analyses, e.g., fine-mapping. Performing such analyses at the genotype level is computationally challenging and necessitates, at best, a laborious process to access individual-level genotype and phenotype data. Summary-statistics-based tools, not requiring individual-level data, provide an efficient alternative that streamlines computational requirements and promotes open science by simplifying the re-analysis and downstream analysis of existing GWAS summary data. However, existing tools perform only disparate parts of needed analysis, have only command-line interfaces, and are difficult to extend/link by applied researchers.\u0000\u0000\u0000RESULTS\u0000To address these challenges, we present GAUSS-a comprehensive and user-friendly R package designed to facilitate the re-analysis/downstream analysis of GWAS summary statistics. GAUSS offers an integrated toolkit for a range of functionalities, including i) estimating ancestry proportion of study cohorts, ii) calculating ancestry-informed linkage disequilibrium, iii) imputing summary statistics of unobserved variants, iv) conducting transcriptome-wide association studies, and v) correcting for \"Winner's Curse\" biases. Notably, GAUSS utilizes an expansive, multi-ethnic reference panel consisting of 32,953 genomes from 29 ethnic groups. This panel enhances the range and accuracy of imputable variants, including the ability to impute summary statistics of rarer variants. As a result, GAUSS elevates the quality and applicability of existing GWAS analyses without requiring access to subject-level genotypic and phenotypic information.\u0000\u0000\u0000AVAILABILITY AND IMPLEMENTATION\u0000The GAUSS R package, complete with its source code, is readily accessible to the public via our GitHub repository at https://github.com/statsleelab/gauss. To further assist users, we provided illustrative use-case scenarios that are conveniently found at https://statsleelab.github.io/gauss/, along with a comprehensive user guide detailed in Supplementary Text 1 from Supplementary Data.\u0000\u0000\u0000SUPPLEMENTARY INFORMATION\u0000Supplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140691258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-17DOI: 10.1093/bioinformatics/btae179
Artuur Couckuyt, Benjamin Rombaut, Yvan Saeys, S. van Gassen
MOTIVATION We describe a new Python implementation of FlowSOM, a clustering method for cytometry data. RESULTS This implementation is faster than the original version in R, better adapted to work with single-cell omics data including integration with current single-cell data structures and includes all the original visualizations, such as the star and pie plot. AVAILABILITY The FlowSOM Python implementation is freely available on GitHub: https://github.com/saeyslab/FlowSOM_Python. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
MOTIVATION We describe a new Python implementation of FlowSOM, a clustering method for cytometry data.ResultThis implementation is faster than the original version in R, better adapted to work with single-cell omics data including integration with current single-cell data structures and includes all the original visualizations, such as the star and pie plot.AVAILABILITYThe FlowSOM Python implementation is free available on GitHub: https://github.com/saeyslab/FlowSOM_Python.SUPPLEMENTARY INFORMATIONSupplementary data are available at Bioinformatics online.
{"title":"Efficient cytometry analysis with FlowSOM in python boosts interoperability with other single-cell tools.","authors":"Artuur Couckuyt, Benjamin Rombaut, Yvan Saeys, S. van Gassen","doi":"10.1093/bioinformatics/btae179","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae179","url":null,"abstract":"MOTIVATION\u0000We describe a new Python implementation of FlowSOM, a clustering method for cytometry data.\u0000\u0000\u0000RESULTS\u0000This implementation is faster than the original version in R, better adapted to work with single-cell omics data including integration with current single-cell data structures and includes all the original visualizations, such as the star and pie plot.\u0000\u0000\u0000AVAILABILITY\u0000The FlowSOM Python implementation is freely available on GitHub: https://github.com/saeyslab/FlowSOM_Python.\u0000\u0000\u0000SUPPLEMENTARY INFORMATION\u0000Supplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140693961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MOTIVATION It is difficult to generate new molecules with desirable bioactivity through ligand-based de novo drug design, and receptor-based de novo drug design is constrained by disease target information availability. The combination of artificial intelligence and phenotype-based de novo drug design can generate new bioactive molecules, independent from disease target information. Gene expression profiles can be used to characterize biological phenotypes. The Transformer model can be utilized to capture the associations between gene expression profiles and molecular structures due to its remarkable ability in processing contextual information. RESULTS We propose TransGEM (Transformer-based model from gene expression to molecules), which is a phenotype-based de novo drug design model. A specialized gene expression encoder is employed to embed gene expression difference values between diseased cell lines and their corresponding normal tissue cells into TransGEM model. The results demonstrate that the TransGEM model can generate molecules with desirable evaluation metrics and property distributions. Case studies illustrate that TransGEM model can generate structurally novel molecules with good binding affinity to disease target proteins. The majority of genes with high attention scores obtained from TransGEM model are associated with the onset of the disease, indicating the potential of these genes as disease targets. Therefore, this study provides a new paradigm for de novo drug design, and it will promote phenotype-based drug discovery. AVAILABILITY The code is available at https://github.com/hzauzqy/TransGEM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
动机 通过基于配体的从头药物设计很难产生具有理想生物活性的新分子,而基于受体的从头药物设计又受到疾病靶点信息的限制。人工智能与基于表型的从头药物设计相结合,可以产生新的生物活性分子,而不受疾病靶点信息的影响。基因表达谱可用来描述生物表型。由于 Transformer 模型在处理上下文信息方面的卓越能力,它可以用来捕捉基因表达谱和分子结构之间的关联。结果我们提出了 TransGEM(基于 Transformer 的基因表达到分子模型),这是一种基于表型的新药设计模型。我们采用专门的基因表达编码器将病变细胞系与相应正常组织细胞的基因表达差异值嵌入 TransGEM 模型。结果表明,TransGEM 模型可以生成具有理想评价指标和属性分布的分子。案例研究表明,TransGEM 模型可以生成结构新颖、与疾病靶蛋白结合亲和力良好的分子。从 TransGEM 模型中获得的高关注度基因大多与疾病的发病有关,这表明这些基因有可能成为疾病靶点。因此,这项研究为从头开始的药物设计提供了一个新的范例,它将促进基于表型的药物发现。AVAILABILITY代码可在https://github.com/hzauzqy/TransGEM.SUPPLEMENTARY INFORMATIONSupplementary data are available at Bioinformatics online.
{"title":"TransGEM: a molecule generation model based on transformer with gene expression data.","authors":"Yanguang Liu, Hailong Yu, Xinya Duan, Xiaomin Zhang, Ting Cheng, Feng Jiang, Hao Tang, Yao Ruan, Miao Zhang, Hongyu Zhang, Qingye Zhang","doi":"10.1093/bioinformatics/btae189","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae189","url":null,"abstract":"MOTIVATION\u0000It is difficult to generate new molecules with desirable bioactivity through ligand-based de novo drug design, and receptor-based de novo drug design is constrained by disease target information availability. The combination of artificial intelligence and phenotype-based de novo drug design can generate new bioactive molecules, independent from disease target information. Gene expression profiles can be used to characterize biological phenotypes. The Transformer model can be utilized to capture the associations between gene expression profiles and molecular structures due to its remarkable ability in processing contextual information.\u0000\u0000\u0000RESULTS\u0000We propose TransGEM (Transformer-based model from gene expression to molecules), which is a phenotype-based de novo drug design model. A specialized gene expression encoder is employed to embed gene expression difference values between diseased cell lines and their corresponding normal tissue cells into TransGEM model. The results demonstrate that the TransGEM model can generate molecules with desirable evaluation metrics and property distributions. Case studies illustrate that TransGEM model can generate structurally novel molecules with good binding affinity to disease target proteins. The majority of genes with high attention scores obtained from TransGEM model are associated with the onset of the disease, indicating the potential of these genes as disease targets. Therefore, this study provides a new paradigm for de novo drug design, and it will promote phenotype-based drug discovery.\u0000\u0000\u0000AVAILABILITY\u0000The code is available at https://github.com/hzauzqy/TransGEM.\u0000\u0000\u0000SUPPLEMENTARY INFORMATION\u0000Supplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140690793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-17DOI: 10.1093/bioinformatics/btae272
Hen-Huang Chen, A. Zwaenepoel, Yves Van de Peer
MOTIVATION Major improvements in sequencing technologies and genome sequence assembly have led to a huge increase in the number of available genome sequences. In turn, these genome sequences form an invaluable source for evolutionary, ecological, and comparative studies. One kind of analysis that has become routine is the search for traces of ancient polyploidy, particularly for plant genomes, where whole-genome duplication (WGD) is rampant. RESULTS Here, we present a major update of a previously developed tool wgd, namely wgd v2, to look for remnants of ancient polyploidy, or WGD. We implemented novel and improved previously developed tools to a) construct KS age distributions for the whole-paranome (collection of all duplicated genes in a genome), b) unravel intra- and inter- genomic collinearity resulting from WGDs, c) fit mixture models to age distributions of gene duplicates, d) correct substitution rate variation for phylogenetic placement of WGDs, and e) date ancient WGDs via phylogenetic dating of WGD-retained gene duplicates. The applicability and feasibility of wgd v2 for the identification and the relative and absolute dating of ancient WGDs is demonstrated using different plant genomes. AVAILABILITY wgd v2 is open source and available at https://github.com/heche-psb/wgd. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
{"title":"wgd v2: a suite of tools to uncover and date ancient polyploidy and whole-genome duplication.","authors":"Hen-Huang Chen, A. Zwaenepoel, Yves Van de Peer","doi":"10.1093/bioinformatics/btae272","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae272","url":null,"abstract":"MOTIVATION\u0000Major improvements in sequencing technologies and genome sequence assembly have led to a huge increase in the number of available genome sequences. In turn, these genome sequences form an invaluable source for evolutionary, ecological, and comparative studies. One kind of analysis that has become routine is the search for traces of ancient polyploidy, particularly for plant genomes, where whole-genome duplication (WGD) is rampant.\u0000\u0000\u0000RESULTS\u0000Here, we present a major update of a previously developed tool wgd, namely wgd v2, to look for remnants of ancient polyploidy, or WGD. We implemented novel and improved previously developed tools to a) construct KS age distributions for the whole-paranome (collection of all duplicated genes in a genome), b) unravel intra- and inter- genomic collinearity resulting from WGDs, c) fit mixture models to age distributions of gene duplicates, d) correct substitution rate variation for phylogenetic placement of WGDs, and e) date ancient WGDs via phylogenetic dating of WGD-retained gene duplicates. The applicability and feasibility of wgd v2 for the identification and the relative and absolute dating of ancient WGDs is demonstrated using different plant genomes.\u0000\u0000\u0000AVAILABILITY\u0000wgd v2 is open source and available at https://github.com/heche-psb/wgd.\u0000\u0000\u0000SUPPLEMENTARY INFORMATION\u0000Supplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140690690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-16DOI: 10.1093/bioinformatics/btae268
Zachary A Rollins, Talal Widatalla, Andrew Waight, Alan C Cheng, Essam Metwally
MOTIVATION Pre-trained protein language and/or structural models are often fine-tuned on drug development properties (ie, developability properties) to accelerate drug discovery initiatives. However, these models generally rely on a single structural conformation and/or a single sequence as a molecular representation. We present a physics-based model whereby 3D conformational ensemble representations are fused by a transformer-based architecture and concatenated to a language representation to predict antibody protein properties. AbLEF enables the direct infusion of thermodynamic information into latent space and this enhances property prediction by explicitly infusing dynamic molecular behavior that occurs during experimental measurement. RESULTS We showcase the AbLEF model on two developability properties: hydrophobic interaction chromatography retention time (HIC-RT) and temperature of aggregation (Tagg). We find that (1) 3D conformational ensembles that are generated from molecular simulation can further improve antibody property prediction for small datasets, (2) the performance benefit from 3D conformational ensembles matches shallow machine learning methods in the small data regime, and (3) fine-tuned large protein language models can match smaller antibody-specific language models at predicting antibody properties. AVAILABILITY AND IMPLEMENTATION AbLEF codebase is available at https://github.com/merck/AbLEF. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
{"title":"AbLEF: Antibody Language Ensemble Fusion for thermodynamically empowered property predictions.","authors":"Zachary A Rollins, Talal Widatalla, Andrew Waight, Alan C Cheng, Essam Metwally","doi":"10.1093/bioinformatics/btae268","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae268","url":null,"abstract":"MOTIVATION\u0000Pre-trained protein language and/or structural models are often fine-tuned on drug development properties (ie, developability properties) to accelerate drug discovery initiatives. However, these models generally rely on a single structural conformation and/or a single sequence as a molecular representation. We present a physics-based model whereby 3D conformational ensemble representations are fused by a transformer-based architecture and concatenated to a language representation to predict antibody protein properties. AbLEF enables the direct infusion of thermodynamic information into latent space and this enhances property prediction by explicitly infusing dynamic molecular behavior that occurs during experimental measurement.\u0000\u0000\u0000RESULTS\u0000We showcase the AbLEF model on two developability properties: hydrophobic interaction chromatography retention time (HIC-RT) and temperature of aggregation (Tagg). We find that (1) 3D conformational ensembles that are generated from molecular simulation can further improve antibody property prediction for small datasets, (2) the performance benefit from 3D conformational ensembles matches shallow machine learning methods in the small data regime, and (3) fine-tuned large protein language models can match smaller antibody-specific language models at predicting antibody properties.\u0000\u0000\u0000AVAILABILITY AND IMPLEMENTATION\u0000AbLEF codebase is available at https://github.com/merck/AbLEF.\u0000\u0000\u0000SUPPLEMENTARY INFORMATION\u0000Supplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140697116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MOTIVATION With the rapid advancement of single-cell sequencing technology, it becomes gradually possible to delve into the cellular responses to various external perturbations at the gene expression level. However, obtaining perturbed samples in certain scenarios may be considerably challenging, and the substantial costs associated with sequencing also curtail the feasibility of large-scale experimentation. A repertoire of methodologies has been employed for forecasting perturbative responses in single-cell gene expression. However, existing methods primarily focus on the average response of a specific cell type to perturbation, overlooking the single-cell specificity of perturbation responses and a more comprehensive prediction of the entire perturbation response distribution. RESULTS Here we present scPRAM, a method for predicting Perturbation Responses in single-cell gene expression based on Attention Mechanisms. Leveraging variational autoencoders and optimal transport, scPRAM aligns cell states before and after perturbation, followed by accurate prediction of gene expression responses to perturbations for unseen cell types through attention mechanisms. Experiments on multiple real perturbation datasets involving drug treatments and bacterial infections demonstrate that scPRAM attains heightened accuracy in perturbation prediction across cell types, species, and individuals, surpassing existing methodologies. Furthermore, scPRAM demonstrates outstanding capability in identifying differentially expressed genes under perturbation, capturing heterogeneity in perturbation responses across species, and maintaining stability in the presence of data noise and sample size variations. AVAILABILITY AND IMPLEMENTATION https://github.com/jiang-q19/scPRAM and https://doi.org/10.5281/zenodo.10935038. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
动机随着单细胞测序技术的飞速发展,从基因表达水平深入研究细胞对各种外部扰动的反应逐渐成为可能。然而,在某些情况下获取扰动样本可能具有相当大的挑战性,而且测序相关的高昂成本也限制了大规模实验的可行性。目前已有一系列方法用于预测单细胞基因表达的扰动反应。然而,现有的方法主要关注特定细胞类型对扰动的平均响应,忽略了扰动响应的单细胞特异性以及对整个扰动响应分布的更全面预测。结果在此,我们提出了基于注意机制的单细胞基因表达扰动响应预测方法 scPRAM。利用变异自编码器和最优传输,scPRAM 对扰动前后的细胞状态进行了调整,然后通过注意机制准确预测了未见细胞类型的基因表达对扰动的反应。在涉及药物治疗和细菌感染的多个真实扰动数据集上进行的实验表明,scPRAM 在跨细胞类型、物种和个体的扰动预测方面达到了更高的准确性,超越了现有方法。此外,scPRAM 在识别扰动下的差异表达基因、捕捉不同物种扰动反应的异质性以及在数据噪声和样本量变化的情况下保持稳定性方面表现出了卓越的能力。AVAILABILITY AND IMPLEMENTATIONhttps://github.com/jiang-q19/scPRAM and https://doi.org/10.5281/zenodo.10935038.SUPPLEMENTARY INFORMATIONSupplementary data are available at Bioinformatics online.
{"title":"scPRAM accurately predicts single-cell gene expression perturbation response based on attention mechanism.","authors":"Qun Jiang, Shengquan Chen, Xiaoyang Chen, Rui Jiang","doi":"10.1093/bioinformatics/btae265","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae265","url":null,"abstract":"MOTIVATION\u0000With the rapid advancement of single-cell sequencing technology, it becomes gradually possible to delve into the cellular responses to various external perturbations at the gene expression level. However, obtaining perturbed samples in certain scenarios may be considerably challenging, and the substantial costs associated with sequencing also curtail the feasibility of large-scale experimentation. A repertoire of methodologies has been employed for forecasting perturbative responses in single-cell gene expression. However, existing methods primarily focus on the average response of a specific cell type to perturbation, overlooking the single-cell specificity of perturbation responses and a more comprehensive prediction of the entire perturbation response distribution.\u0000\u0000\u0000RESULTS\u0000Here we present scPRAM, a method for predicting Perturbation Responses in single-cell gene expression based on Attention Mechanisms. Leveraging variational autoencoders and optimal transport, scPRAM aligns cell states before and after perturbation, followed by accurate prediction of gene expression responses to perturbations for unseen cell types through attention mechanisms. Experiments on multiple real perturbation datasets involving drug treatments and bacterial infections demonstrate that scPRAM attains heightened accuracy in perturbation prediction across cell types, species, and individuals, surpassing existing methodologies. Furthermore, scPRAM demonstrates outstanding capability in identifying differentially expressed genes under perturbation, capturing heterogeneity in perturbation responses across species, and maintaining stability in the presence of data noise and sample size variations.\u0000\u0000\u0000AVAILABILITY AND IMPLEMENTATION\u0000https://github.com/jiang-q19/scPRAM and https://doi.org/10.5281/zenodo.10935038.\u0000\u0000\u0000SUPPLEMENTARY INFORMATION\u0000Supplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140701441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}