Pub Date : 2024-09-16DOI: 10.1101/2024.09.12.612510
Huangqingbo Sun, Shiqiu Yu, Anna Martinez Casals, Anna Bäckström, Yuxin Lu, Cecilia Lindskog, Emma Lundberg, Robert F. Murphy
Identifying cell types in highly multiplexed images is essential for understanding tissue spatial organization. Current cell type annotation methods often rely on extensive reference images and manual adjustments. In this work, we present a tool, Robust Image-Based Cell Annotator (RIBCA), that enables accurate, automated, unbiased, and fine-grained cell type annotation for images with a wide range of antibody panels, without requiring additional model training or human intervention. Our tool has successfully annotated over 1 million cells, revealing the spatial organization of various cell types across more than 40 different human tissues. It is open-source and features a modular design, allowing for easy extension to additional cell types.
{"title":"Flexible and robust cell type annotation for highly multiplexed tissue images","authors":"Huangqingbo Sun, Shiqiu Yu, Anna Martinez Casals, Anna Bäckström, Yuxin Lu, Cecilia Lindskog, Emma Lundberg, Robert F. Murphy","doi":"10.1101/2024.09.12.612510","DOIUrl":"https://doi.org/10.1101/2024.09.12.612510","url":null,"abstract":"Identifying cell types in highly multiplexed images is essential for understanding tissue spatial organization. Current cell type annotation methods often rely on extensive reference images and manual adjustments. In this work, we present a tool, Robust Image-Based Cell Annotator (RIBCA), that enables accurate, automated, unbiased, and fine-grained cell type annotation for images with a wide range of antibody panels, without requiring additional model training or human intervention. Our tool has successfully annotated over 1 million cells, revealing the spatial organization of various cell types across more than 40 different human tissues. It is open-source and features a modular design, allowing for easy extension to additional cell types.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-15DOI: 10.1101/2024.09.12.612771
Pei-Dong Zhang, Jianzhu Ma, Ting Chen
Considering the high cost associated with determining reaction affinities through in-vitro experiments, virtual screening of potential drugs bound with specific protein pockets from vast compounds is critical in AI-assisted drug discovery. Deep-leaning approaches have been proposed for Drug-Target Interaction (DTI) prediction. However, they have shown overestimated accuracy because of the drug-bias trap, a challenge that results from excessive reliance on the drug branch in the traditional drug-protein dual-branch network approach. This casts doubt on the interpretability and generalizability of existing Drug-Target Interaction (DTI) models. Therefore, we introduce UdanDTI, an innovative deep-learning architecture designed specifically for predicting drug-protein interactions. UdanDTI applies an unbalanced dual-branch system and an attentive aggregation module to enhance interpretability from a biological perspective. Across various public datasets, UdanDTI demonstrates outstanding performance, outperforming state-of-the-art models under in-domain, cross-domain, and structural interpretability settings. Notably, it demonstrates exceptional accuracy in predicting drug responses of two crucial subgroups of Epidermal Growth Factor Receptor (EGFR) mutations associated with non-small cell lung cancer, consistent with experimental results. Meanwhile, UdanDTI could complement the advanced molecular docking software DiffDock. The codes and datasets of UdanDTI are available at https://github.com/CQ-zhang-2016/UdanDTI.
{"title":"Escaping the drug-bias trap: using debiasing design to improve interpretability and generalization of drug-target interaction prediction","authors":"Pei-Dong Zhang, Jianzhu Ma, Ting Chen","doi":"10.1101/2024.09.12.612771","DOIUrl":"https://doi.org/10.1101/2024.09.12.612771","url":null,"abstract":"Considering the high cost associated with determining reaction affinities through in-vitro experiments, virtual screening of potential drugs bound with specific protein pockets from vast compounds is critical in AI-assisted drug discovery. Deep-leaning approaches have been proposed for Drug-Target Interaction (DTI) prediction. However, they have shown overestimated accuracy because of the drug-bias trap, a challenge that results from excessive reliance on the drug branch in the traditional drug-protein dual-branch network approach. This casts doubt on the interpretability and generalizability of existing Drug-Target Interaction (DTI) models. Therefore, we introduce UdanDTI, an innovative deep-learning architecture designed specifically for predicting drug-protein interactions. UdanDTI applies an unbalanced dual-branch system and an attentive aggregation module to enhance interpretability from a biological perspective. Across various public datasets, UdanDTI demonstrates outstanding performance, outperforming state-of-the-art models under in-domain, cross-domain, and structural interpretability settings. Notably, it demonstrates exceptional accuracy in predicting drug responses of two crucial subgroups of Epidermal Growth Factor Receptor (EGFR) mutations associated with non-small cell lung cancer, consistent with experimental results. Meanwhile, UdanDTI could complement the advanced molecular docking software DiffDock. The codes and datasets of UdanDTI are available at https://github.com/CQ-zhang-2016/UdanDTI.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-15DOI: 10.1101/2024.09.13.612805
Badhan Das, Lenwood S. Heath
The SARS-CoV-2 virus has undergone mutations over time, leading to genetic diversity among circulating viral strains. This genetic diversity can affect the characteristics of the virus, including its transmissibility and the severity of symptoms in infected individuals. During the pandemic, this frequent mutation creates an enormous cloud of variants known as viral quasispecies. Most variation is lost due to the tight bottlenecks imposed by transmission and survival. Advancements in next-generation sequencing have facilitated the rapid and cost-effective production of complete viral genomes, enabling the ongoing monitoring of the evolution of the SARS-CoV-2 genome. However, inferring a reliable phylogeny from GISAID (the Global Initiative on Sharing All Influenza Data) is daunting due to the vast number of sequences. In the face of this complexity, this research proposes a new method of representing the evolutionary and epidemiological relationships among the SARS-CoV-2 variants inspired by quasispecies theory. We aim to build a Variant Evolution Graph (VEG), a novel way to model viral evolution in a local pandemic region based on the mutational distance of the genotypes of the variants. VEG is a directed acyclic graph and not necessarily a tree because a variant can evolve from more than one variant; here, the vertices represent the genotypes of the variants associated with their human hosts, and the edges represent the evolutionary relationships among these variants. A disease transmission network, DTN, which represents the transmission relationships among the hosts, is also proposed and derived from the VEG. We downloaded the genotypes of the variants recorded in GISAID, which are complete, have high coverage, and have a complete collection date from five countries: Somalia (22), Bhutan (102), Hungary (581), Iran (1334), and Nepal (1719). We ran our algorithm on these datasets to get the evolution history of the variants, build the variant evolution graph represented by the adjacency matrix, and infer the disease transmission network. Our research represents a novel and unprecedented contribution to the field of viral evolution, offering new insights and approaches not explored in prior studies.
{"title":"Variant Evolution Graph: Can We Infer How SARS-CoV-2 Variants are Evolving?","authors":"Badhan Das, Lenwood S. Heath","doi":"10.1101/2024.09.13.612805","DOIUrl":"https://doi.org/10.1101/2024.09.13.612805","url":null,"abstract":"The SARS-CoV-2 virus has undergone mutations over time, leading to genetic diversity among circulating viral strains. This genetic diversity can affect the characteristics of the virus, including its transmissibility and the severity of symptoms in infected individuals. During the pandemic, this frequent mutation creates an enormous cloud of variants known as viral quasispecies. Most variation is lost due to the tight bottlenecks imposed by transmission and survival. Advancements in next-generation sequencing have facilitated the rapid and cost-effective production of complete viral genomes, enabling the ongoing monitoring of the evolution of the SARS-CoV-2 genome. However, inferring a reliable phylogeny from GISAID (the Global Initiative on Sharing All Influenza Data) is daunting due to the vast number of sequences. In the face of this complexity, this research proposes a new method of representing the evolutionary and epidemiological relationships among the SARS-CoV-2 variants inspired by quasispecies theory. We aim to build a Variant Evolution Graph (VEG), a novel way to model viral evolution in a local pandemic region based on the mutational distance of the genotypes of the variants. VEG is a directed acyclic graph and not necessarily a tree because a variant can evolve from more than one variant; here, the vertices represent the genotypes of the variants associated with their human hosts, and the edges represent the evolutionary relationships among these variants. A disease transmission network, DTN, which represents the transmission relationships among the hosts, is also proposed and derived from the VEG. We downloaded the genotypes of the variants recorded in GISAID, which are complete, have high coverage, and have a complete collection date from five countries: Somalia (22), Bhutan (102), Hungary (581), Iran (1334), and Nepal (1719). We ran our algorithm on these datasets to get the evolution history of the variants, build the variant evolution graph represented by the adjacency matrix, and infer the disease transmission network. Our research represents a novel and unprecedented contribution to the field of viral evolution, offering new insights and approaches not explored in prior studies.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-15DOI: 10.1101/2024.09.11.612490
Gareth John Morgan, Allison N Nau, Sherry Wong, Brian H Spencer, Yun Shen, Axin Hua, Matthew J Bullard, Vaishali Sanchorawala, Tatiana Prokaeva
Background: Each monoclonal antibody light chain associated with AL amyloidosis has a unique sequence. Defining how these sequences lead to amyloid deposition could facilitate faster diagnosis and lead to new treatments. Methods: Light chain sequences are collected in the Boston University AL-Base repository. Monoclonal sequences from AL amyloidosis, multiple myeloma and the healthy polyclonal immune repertoire were compared to identify differences in precursor gene use, mutation frequency and physicochemical properties. Results: AL-Base now contains 2,193 monoclonal light chain sequences from plasma cell dyscrasias. Sixteen germline precursor genes were enriched in AL amyloidosis, relative to multiple myeloma and the polyclonal repertoire. Two genes, IGKV1-16 and IGLV1-36, were infrequently observed but highly enriched in AL amyloidosis. The number of mutations varied widely between light chains. AL-associated κ light chains harbored significantly more mutations compared to multiple myeloma and polyclonal sequences, whereas AL-associated λ light chains had fewer mutations. Machine learning tools designed to predict amyloid propensity were less accurate for new sequences than their original training data. Conclusions: Rarely-observed light chain variable genes may carry a high risk of AL amyloidosis. New approaches are needed to define sequence-associated risk factors for AL amyloidosis. AL-Base is a foundational resource for such studies.
背景:与 AL 淀粉样变性相关的每种单克隆抗体轻链都有独特的序列。确定这些序列是如何导致淀粉样蛋白沉积的,有助于更快地诊断并找到新的治疗方法。方法:波士顿大学 AL-Base 资料库收集了轻链序列。比较了 AL 淀粉样变性、多发性骨髓瘤和健康多克隆免疫复合物的单克隆序列,以确定前体基因使用、突变频率和理化性质的差异。结果:AL-Base目前包含2193个来自浆细胞性疾病的单克隆轻链序列。相对于多发性骨髓瘤和多克隆序列,16个种系前体基因在AL淀粉样变性中富集。IGKV1-16和IGLV1-36这两个基因在AL淀粉样变性病中并不常见,但却高度富集。不同轻链的突变数量差异很大。与多发性骨髓瘤和多克隆序列相比,AL相关的κ轻链突变明显较多,而AL相关的λ轻链突变较少。旨在预测淀粉样蛋白倾向的机器学习工具对新序列的准确性低于其原始训练数据:结论:罕见的轻链变异基因可能具有高发AL淀粉样变性病的风险。结论:罕见的轻链可变基因可能具有高风险,需要采用新方法来确定与序列相关的 AL 淀粉样变性风险因素。AL-Base 是此类研究的基础资源。
{"title":"An updated AL-Base reveals ranked enrichment of immunoglobulin light chain variable genes in AL amyloidosis","authors":"Gareth John Morgan, Allison N Nau, Sherry Wong, Brian H Spencer, Yun Shen, Axin Hua, Matthew J Bullard, Vaishali Sanchorawala, Tatiana Prokaeva","doi":"10.1101/2024.09.11.612490","DOIUrl":"https://doi.org/10.1101/2024.09.11.612490","url":null,"abstract":"Background: Each monoclonal antibody light chain associated with AL amyloidosis has a unique sequence. Defining how these sequences lead to amyloid deposition could facilitate faster diagnosis and lead to new treatments. Methods: Light chain sequences are collected in the Boston University AL-Base repository. Monoclonal sequences from AL amyloidosis, multiple myeloma and the healthy polyclonal immune repertoire were compared to identify differences in precursor gene use, mutation frequency and physicochemical properties. Results: AL-Base now contains 2,193 monoclonal light chain sequences from plasma cell dyscrasias. Sixteen germline precursor genes were enriched in AL amyloidosis, relative to multiple myeloma and the polyclonal repertoire. Two genes, IGKV1-16 and IGLV1-36, were infrequently observed but highly enriched in AL amyloidosis. The number of mutations varied widely between light chains. AL-associated κ light chains harbored significantly more mutations compared to multiple myeloma and polyclonal sequences, whereas AL-associated λ light chains had fewer mutations. Machine learning tools designed to predict amyloid propensity were less accurate for new sequences than their original training data.\u0000Conclusions: Rarely-observed light chain variable genes may carry a high risk of AL amyloidosis. New approaches are needed to define sequence-associated risk factors for AL amyloidosis. AL-Base is a foundational resource for such studies.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-14DOI: 10.1101/2024.09.10.612089
Carlos Cruz-Castillo, Luca Fumis, Chintan Mehta, Ricardo Esteban Martinez-Osorio, Juan Maria Roldan-Romero, Helena Cornu, Prashant Uniyal, Antonio Solano-Roman, Miguel Carmona, David Ochoa, Ellen M McDonagh, Annalisa Buniello
The Open Targets Platform (https://platform.opentargets.org) is a unique, comprehensive, open-source resource supporting systematic identification and prioritisation of targets for drug discovery. The Platform combines, harmonises and integrates data from >20 diverse sources to provide target-disease associations, covering evidence derived from genetic associations, somatic mutations, known drugs, differential expression, animal models, pathways and systems biology. An in-house target identification scoring framework weighs the evidence from each data source and type, contributing to an overall score for each of the 7.8M target-disease associations. However, the previous infrastructure did not allow user-led dynamic adjustments in the contribution of different evidence types for target prioritisation, a limitation frequently raised by our user community. Furthermore, the previous Platform user interface did not support navigation and exploration of the underlying target-disease evidence on the same page, occasionally making the user journey counterintuitive. Here, we describe Associations on the Fly (AOTF), a new Platform feature - developed as part of a wider product refactoring project - to enable formulation of more flexible and impactful therapeutic hypotheses through dynamic adjustment of the weight of contributing evidence from each source, altering the prioritisation of targets.
{"title":"Associations on the Fly, a new feature aiming to facilitate exploration of the Open Targets Platform evidence","authors":"Carlos Cruz-Castillo, Luca Fumis, Chintan Mehta, Ricardo Esteban Martinez-Osorio, Juan Maria Roldan-Romero, Helena Cornu, Prashant Uniyal, Antonio Solano-Roman, Miguel Carmona, David Ochoa, Ellen M McDonagh, Annalisa Buniello","doi":"10.1101/2024.09.10.612089","DOIUrl":"https://doi.org/10.1101/2024.09.10.612089","url":null,"abstract":"The Open Targets Platform (https://platform.opentargets.org) is a unique, comprehensive, open-source resource supporting systematic identification and prioritisation of targets for drug discovery. The Platform combines, harmonises and integrates data from >20 diverse sources to provide target-disease associations, covering evidence derived from genetic associations, somatic mutations, known drugs, differential expression, animal models, pathways and systems biology. An in-house target identification scoring framework weighs the evidence from each data source and type, contributing to an overall score for each of the 7.8M target-disease associations. However, the previous infrastructure did not allow user-led dynamic adjustments in the contribution of different evidence types for target prioritisation, a limitation frequently raised by our user community. Furthermore, the previous Platform user interface did not support navigation and exploration of the underlying target-disease evidence on the same page, occasionally making the user journey counterintuitive. Here, we describe Associations on the Fly (AOTF), a new Platform feature - developed as part of a wider product refactoring project - to enable formulation of more flexible and impactful therapeutic hypotheses through dynamic adjustment of the weight of contributing evidence from each source, altering the prioritisation of targets.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"77 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142268545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-14DOI: 10.1101/2024.09.11.612399
Artyom A. Egorov, Gemma C. Atkinson
Summary: Comparative genomic analysis often involves visualisation of alignments of genomic loci. While several software tools are available for this task, ranging from Python and R libraries to standalone graphical user interfaces, there is lack of a tool that offers fast, automated usage and the production of publication-ready vector images. Here we present LoVis4u, a command-line tool and Python API designed for highly customizable and fast visualisation of multiple genomic loci. LoVis4u generates vector images in PDF format based on annotation data from GenBank or GFF files. It is capable of visualising entire genomes of bacteriophages as well as plasmids and user-defined regions of longer prokaryotic genomes. Additionally, LoVis4u offers optional data processing steps to identify and highlight accessory and core genes in input sequences. Availability and Implementation: LoVis4u is implemented in Python3 and runs on Linux and MacOS. The command-line interface covers most practical use cases, while the provided Python API allows usage within a Python program, integration into external tools, and additional customisation. Source code is available at the GitHub page: github.com/art-egorov/lovis4u. Detailed documentation that includes an example-driven guide is available from the software home page: art-egorov.github.io/lovis4u.
摘要:比较基因组分析通常涉及基因组位点排列的可视化。虽然有多种软件工具可用于这一任务,从 Python 和 R 库到独立的图形用户界面,但仍缺乏一种工具可提供快速、自动化的使用并生成可用于发表的矢量图像。我们在此介绍 LoVis4u,它是一种命令行工具和 Python 应用程序接口,专为高度定制化和快速可视化多个基因组位点而设计。LoVis4u 可根据 GenBank 或 GFF 文件中的注释数据生成 PDF 格式的矢量图像。它能够可视化噬菌体的整个基因组以及质粒和用户定义的较长原核生物基因组区域。此外,LoVis4u 还提供可选的数据处理步骤,用于识别和突出显示输入序列中的附属基因和核心基因。可用性和实施:LoVis4u 由 Python3 实现,可在 Linux 和 MacOS 上运行。命令行界面涵盖了大多数实际用例,而提供的 Python API 允许在 Python 程序中使用、集成到外部工具中以及进行额外的定制。源代码可在 GitHub 页面获取:github.com/art-egorov/lovis4u。详细的文档(包括示例驱动指南)可从软件主页获取:art-egorov.github.io/lovis4u。
{"title":"LoVis4u: Locus Visualisation tool for comparative genomics","authors":"Artyom A. Egorov, Gemma C. Atkinson","doi":"10.1101/2024.09.11.612399","DOIUrl":"https://doi.org/10.1101/2024.09.11.612399","url":null,"abstract":"<strong>Summary:</strong> Comparative genomic analysis often involves visualisation of alignments of genomic loci. While several software tools are available for this task, ranging from Python and R libraries to standalone graphical user interfaces, there is lack of a tool that offers fast, automated usage and the production of publication-ready vector images. Here we present LoVis4u, a command-line tool and Python API designed for highly customizable and fast visualisation of multiple genomic loci. LoVis4u generates vector images in PDF format based on annotation data from GenBank or GFF files. It is capable of visualising entire genomes of bacteriophages as well as plasmids and user-defined regions of longer prokaryotic genomes. Additionally, LoVis4u offers optional data processing steps to identify and highlight accessory and core genes in input sequences. <strong>Availability and Implementation:</strong> LoVis4u is implemented in Python3 and runs on Linux and MacOS. The command-line interface covers most practical use cases, while the provided Python API allows usage within a Python program, integration into external tools, and additional customisation. Source code is available at the GitHub page: github.com/art-egorov/lovis4u. Detailed documentation that includes an example-driven guide is available from the software home page: art-egorov.github.io/lovis4u.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"147 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-14DOI: 10.1101/2024.09.11.612340
Gabriele Scalia, Steven T Rutherford, Ziqing Lu, Kerry R Buchholz, Nicholas Skelton, Kangway Chuang, Nathaniel Diamant, Jan-Christian Huetter, Jerome Luescher, Ahn Miu, Jeff Blaney, Leo Gendelev, Elizabeth Skippington, Greg Zynda, Nia Dickson, Michal Koziarski, Yoshua Bengio, Aviv Regev, Man-Wah Tan, Tommaso Biancalani
The proliferation of multi-drug-resistant bacteria underscores an urgent need for novel antibiotics. Traditional discovery methods face challenges due to limited chemical diversity, high costs, and difficulties in identifying structurally novel compounds. Here, we explore the integration of small molecule high-throughput screening with a deep learning-based virtual screening approach to uncover new antibacterial compounds. Leveraging a diverse library of nearly 2 million small molecules, we conducted comprehensive phenotypic screening against a sensitized Escherichia coli strain that, at a low hit rate, yielded thousands of hits. We trained a deep learning model, GNEprop, to predict antibacterial activity, ensuring robustness through out-of-distribution generalization techniques. Virtual screening of over 1.4 billion compounds identified potential candidates, of which 82 exhibited antibacterial activity, illustrating a 90X improved hit rate over the high-throughput screening experiment GNEprop was trained on. Importantly, a significant portion of these newly identified compounds exhibited high dissimilarity to known antibiotics, indicating promising avenues for further exploration in antibiotic discovery.
{"title":"A high-throughput phenotypic screen combined with an ultra-large-scale deep learning-based virtual screening reveals novel scaffolds of antibacterial compounds","authors":"Gabriele Scalia, Steven T Rutherford, Ziqing Lu, Kerry R Buchholz, Nicholas Skelton, Kangway Chuang, Nathaniel Diamant, Jan-Christian Huetter, Jerome Luescher, Ahn Miu, Jeff Blaney, Leo Gendelev, Elizabeth Skippington, Greg Zynda, Nia Dickson, Michal Koziarski, Yoshua Bengio, Aviv Regev, Man-Wah Tan, Tommaso Biancalani","doi":"10.1101/2024.09.11.612340","DOIUrl":"https://doi.org/10.1101/2024.09.11.612340","url":null,"abstract":"The proliferation of multi-drug-resistant bacteria underscores an urgent need for novel antibiotics. Traditional discovery methods face challenges due to limited chemical diversity, high costs, and difficulties in identifying structurally novel compounds. Here, we explore the integration of small molecule high-throughput screening with a deep learning-based virtual screening approach to uncover new antibacterial compounds. Leveraging a diverse library of nearly 2 million small molecules, we conducted comprehensive phenotypic screening against a sensitized Escherichia coli strain that, at a low hit rate, yielded thousands of hits. We trained a deep learning model, GNEprop, to predict antibacterial activity, ensuring robustness through out-of-distribution generalization techniques. Virtual screening of over 1.4 billion compounds identified potential candidates, of which 82 exhibited antibacterial activity, illustrating a 90X improved hit rate over the high-throughput screening experiment GNEprop was trained on. Importantly, a significant portion of these newly identified compounds exhibited high dissimilarity to known antibiotics, indicating promising avenues for further exploration in antibiotic discovery.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Collision cross section (CCS) of peptide ions provides an important separation dimension in liquid chromatography/tandem mass spectrometry-based proteomics that incorporates ion mobility spectrometry (IMS), and its accurate prediction is the basis for advanced proteomics workflows. This paper describes novel experimental data and a novel prediction model for challenging CCS prediction tasks including longer peptides that tend to have higher charge states. The proposed model is based on a pretrained deep protein language model. While the conventional prediction model requires training from scratch, the proposed model enables training with less amount of time owing to the use of the pretrained model as a feature extractor. Results of experiments with the novel experimental data show that the proposed model succeeds in drastically reducing the training time while maintaining the same or even better prediction performance compared with the conventional method. Our approach presents the possibility of prediction in a greener manner of various peptide properties in proteomic liquid chromatography/tandem mass spectrometry experiments.
{"title":"Leveraging Pretrained Deep Protein Language Model to Predict Peptide Collision Cross Section","authors":"Ayano Nakai-Kasai, Kosuke Ogata, Yasushi Ishihama, Toshiyuki Tanaka","doi":"10.1101/2024.09.11.612388","DOIUrl":"https://doi.org/10.1101/2024.09.11.612388","url":null,"abstract":"Collision cross section (CCS) of peptide ions provides an important separation dimension in liquid chromatography/tandem mass spectrometry-based proteomics that incorporates ion mobility spectrometry (IMS), and its accurate prediction is the basis for advanced proteomics workflows. This paper describes novel experimental data and a novel prediction model for challenging CCS prediction tasks including longer peptides that tend to have higher charge states. The proposed model is based on a pretrained deep protein language model. While the conventional prediction model requires training from scratch, the proposed model enables training with less amount of time owing to the use of the pretrained model as a feature extractor. Results of experiments with the novel experimental data show that the proposed model succeeds in drastically reducing the training time while maintaining the same or even better prediction performance compared with the conventional method. Our approach presents the possibility of prediction in a greener manner of various peptide properties in proteomic liquid chromatography/tandem mass spectrometry experiments.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142268539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-14DOI: 10.1101/2024.09.10.612385
Xueli Xu, Yanran Liang, Miaoxiu Tang, Jiongliang Wang, Xi Wang, Yixue Li, Jie Wang
Single cells exhibit heterogeneous gene expression profiles and chromatin accessibility, measurable separately via single-cell RNA sequencing (scRNA-seq) and single-cell transposase chromatin accessibility sequencing (scATAC-seq). Consequently, each cell possesses a unique gene regulatory network. However, limited methods exist for inferring cell-specific regulatory networks, particularly through the integration of scRNA-seq and scATAC-seq data. Here, we develop a novel algorithm named single-cell regulatory network inference (ScReNI), which leverages k-nearest neighbors and random forest algorithms to integrate scRNA-seq and scATAC-seq data for inferring gene regulatory networks at the single-cell level. ScReNI is built to analyze both paired and unpaired datasets for scRNA-seq and scATAC-seq. Using these two types of single-cell sequencing datasets, we validate that a higher fraction of regulatory relationships inferred by ScReNI are detected by chromatin immunoprecipitation sequencing (ChIP-seq) data. ScReNI shows superior performance in network-based cell clustering when compared to existing single-cell network inference methods. Importantly, ScReNI offers the unique function of identifying cell-enriched regulators based on each cell-specific network. In summary, ScReNI facilitates the inferences of cell-specific regulatory networks and cell-enriched regulators.
{"title":"ScReNI: single-cell regulatory network inference through integrating scRNA-seq and scATAC-seq data","authors":"Xueli Xu, Yanran Liang, Miaoxiu Tang, Jiongliang Wang, Xi Wang, Yixue Li, Jie Wang","doi":"10.1101/2024.09.10.612385","DOIUrl":"https://doi.org/10.1101/2024.09.10.612385","url":null,"abstract":"Single cells exhibit heterogeneous gene expression profiles and chromatin accessibility, measurable separately via single-cell RNA sequencing (scRNA-seq) and single-cell transposase chromatin accessibility sequencing (scATAC-seq). Consequently, each cell possesses a unique gene regulatory network. However, limited methods exist for inferring cell-specific regulatory networks, particularly through the integration of scRNA-seq and scATAC-seq data. Here, we develop a novel algorithm named single-cell regulatory network inference (ScReNI), which leverages k-nearest neighbors and random forest algorithms to integrate scRNA-seq and scATAC-seq data for inferring gene regulatory networks at the single-cell level. ScReNI is built to analyze both paired and unpaired datasets for scRNA-seq and scATAC-seq. Using these two types of single-cell sequencing datasets, we validate that a higher fraction of regulatory relationships inferred by ScReNI are detected by chromatin immunoprecipitation sequencing (ChIP-seq) data. ScReNI shows superior performance in network-based cell clustering when compared to existing single-cell network inference methods. Importantly, ScReNI offers the unique function of identifying cell-enriched regulators based on each cell-specific network. In summary, ScReNI facilitates the inferences of cell-specific regulatory networks and cell-enriched regulators.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142268540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-14DOI: 10.1101/2024.09.10.612211
Christoph Elfmann, Vincenz Dumann, Tim van den Berg, Jorg Stulke
Bacillus subtilis is a Gram-positive model bacterium and one of the most-studied and best understood organisms. The complex information resulting from its investigation is compiled in the database SubtiWiki (https://subtiwiki.uni-goettingen.de/v5) in an integrated and intuitive manner. To enhance the utility of SubtiWiki, we have added novel features such as a viewer to interrogate conserved genomic organization, a widget that shows mutant fitness data for all non-essential genes, and a widget showing protein structures, structure predictions and complex structures. Moreover, we have integrated metabolites as new entities. The new framework also includes a documented API, enabling programmatic access to data for computational tasks. Here we present the recent developments of SubtiWiki and the current state of the data for this organism.
{"title":"A new framework for SubtiWiki, the database for the model organism Bacillus subtilis","authors":"Christoph Elfmann, Vincenz Dumann, Tim van den Berg, Jorg Stulke","doi":"10.1101/2024.09.10.612211","DOIUrl":"https://doi.org/10.1101/2024.09.10.612211","url":null,"abstract":"Bacillus subtilis is a Gram-positive model bacterium and one of the most-studied and best understood organisms. The complex information resulting from its investigation is compiled in the database SubtiWiki (https://subtiwiki.uni-goettingen.de/v5) in an integrated and intuitive manner. To enhance the utility of SubtiWiki, we have added novel features such as a viewer to interrogate conserved genomic organization, a widget that shows mutant fitness data for all non-essential genes, and a widget showing protein structures, structure predictions and complex structures. Moreover, we have integrated metabolites as new entities. The new framework also includes a documented API, enabling programmatic access to data for computational tasks. Here we present the recent developments of SubtiWiki and the current state of the data for this organism.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142268544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}