Pub Date : 2024-09-18DOI: 10.1101/2024.09.02.610764
Lakhansing A. Pardeshi, Inge van Duivenbode, Michiel J.C. Pel, Eef M. Jonkheer, Anne Kupczok, Dick de Ridder, Sandra Smit, Theo van der Lee
Bacterial pathogens of the genus Pectobacterium are responsible for soft rot and blackleg disease in a wide range of crops and have a global impact on food production. The emergence of new lineages and their competitive succession is frequently observed in Pectobacterium species, in particular in P. brasiliense. With a focus on one such recently emerged P. brasiliense lineage in the Netherlands that causes blackleg in potatoes, we studied genome evolution in this genus using a reference-free graph-based pangenome approach. We clustered 1,977,865 proteins from 454 Pectobacterium spp. genomes into 30,156 homology groups. The Pectobacterium genus pangenome is open and its growth is mainly contributed by the accessory genome. Bacteriophage genes were enriched in the accessory genome and contributed 16% of the pangenome. Blackleg-causing P. brasiliense isolates had increased genome size with high levels of prophage integration. To study the diversity and dynamics of these prophages across the pangenome, we developed an approach to trace prophages across genomes using pangenome homology group signatures. We identified lineage-specific as well as generalist bacteriophages infecting Pectobacterium species. Our results capture the ongoing dynamics of mobile genetic elements, even in the clonal lineages. The observed lineage-specific prophage dynamics provide mechanistic insights into Pectobacterium pangenome growth and contribution to the radiating lineages of P. brasiliense.
果胶杆菌属细菌病原体是多种农作物软腐病和黑胫病的罪魁祸首,对全球粮食生产具有重要影响。在果胶杆菌物种中,尤其是在巴西果胶杆菌中,经常可以观察到新品系的出现及其竞争性演替。我们重点研究了最近在荷兰出现的导致马铃薯黑腿病的 P. brasiliense 品系,并采用基于无参照图的庞基因组方法研究了该菌属的基因组进化。我们将 454 个果胶杆菌属基因组中的 1,977,865 个蛋白质聚类为 30,156 个同源组。果胶杆菌属的庞基因组是开放的,其生长主要由附属基因组贡献。噬菌体基因富集在附属基因组中,占庞大基因组的 16%。引起黑腿病的 P. brasiliense 分离物的基因组规模增大,噬菌体整合程度较高。为了研究这些噬菌体在整个泛基因组中的多样性和动态,我们开发了一种方法,利用泛基因组同源组特征追踪噬菌体在基因组中的分布。我们发现了感染果胶杆菌的特异性噬菌体和通性噬菌体。我们的研究结果捕捉到了移动遗传因子的持续动态变化,即使在克隆品系中也是如此。观察到的特定品系的噬菌体动态为了解果胶杆菌泛基因组的生长和对巴西鹅膏菌辐射品系的贡献提供了机制性的见解。
{"title":"Pangenomics to understand prophage dynamics in the Pectobacterium genus and the radiating lineages of P. brasiliense","authors":"Lakhansing A. Pardeshi, Inge van Duivenbode, Michiel J.C. Pel, Eef M. Jonkheer, Anne Kupczok, Dick de Ridder, Sandra Smit, Theo van der Lee","doi":"10.1101/2024.09.02.610764","DOIUrl":"https://doi.org/10.1101/2024.09.02.610764","url":null,"abstract":"Bacterial pathogens of the genus Pectobacterium are responsible for soft rot and blackleg disease in a wide range of crops and have a global impact on food production. The emergence of new lineages and their competitive succession is frequently observed in Pectobacterium species, in particular in P. brasiliense. With a focus on one such recently emerged P. brasiliense lineage in the Netherlands that causes blackleg in potatoes, we studied genome evolution in this genus using a reference-free graph-based pangenome approach. We clustered 1,977,865 proteins from 454 Pectobacterium spp. genomes into 30,156 homology groups. The Pectobacterium genus pangenome is open and its growth is mainly contributed by the accessory genome. Bacteriophage genes were enriched in the accessory genome and contributed 16% of the pangenome. Blackleg-causing P. brasiliense isolates had increased genome size with high levels of prophage integration. To study the diversity and dynamics of these prophages across the pangenome, we developed an approach to trace prophages across genomes using pangenome homology group signatures. We identified lineage-specific as well as generalist bacteriophages infecting Pectobacterium species. Our results capture the ongoing dynamics of mobile genetic elements, even in the clonal lineages. The observed lineage-specific prophage dynamics provide mechanistic insights into Pectobacterium pangenome growth and contribution to the radiating lineages of P. brasiliense.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep learning techniques are increasingly utilized to analyze large-scale single-cell RNA sequencing (scRNA-seq) data, offering valuable insights from complex transcriptome datasets. Geneformer, a pre-trained model using a Transformer Encoder architecture and human scRNA-seq datasets, has demonstrated remarkable success in human transcriptome analysis. However, given the prominence of the mouse, Mus musculus, as a primary mammalian model in biological and medical research, there is an acute need for a mouse-specific version of Geneformer. In this study, we developed a mouse-specific Geneformer (mouse-Geneformer) by constructing a large transcriptome dataset consisting of 21 million mouse scRNA-seq profiles and pre-training Geneformer on this dataset. The mouse-Geneformer effectively models the mouse transcriptome and, upon fine-tuning for downstream tasks, enhances the accuracy of cell type classification. In silico perturbation experiments using mouse-Geneformer successfully identified disease-causing genes that have been validated in in vivo experiments. These results demonstrate the feasibility of analyzing mouse data with mouse-Geneformer and highlight the robustness of the Geneformer architecture, applicable to any species with large-scale transcriptome data available. Furthermore, we found that mouse-Geneformer can analyze human transcriptome data in a cross-species manner. After the ortholog-based gene name conversion, the analysis of human scRNA-seq data using mouse-Geneformer, followed by fine-tuning with human data, achieved cell type classification accuracy comparable to that obtained using the original human Geneformer. In in silico simulation experiments using human disease models, we obtained results similar to human-Geneformer for the myocardial infarction model but only partially consistent results for the COVID-19 model, a trait unique to humans (laboratory mice are not susceptible to SARS-CoV-2). These findings suggest the potential for cross-species application of the Geneformer model while emphasizing the importance of species-specific models for capturing the full complexity of disease mechanisms. Despite the existence of the original Geneformer tailored for humans, human research could benefit from mouse-Geneformer due to its inclusion of samples that are ethically or technically inaccessible for humans, such as embryonic tissues and certain disease models. Additionally, this cross-species approach indicates potential use for non-model organisms, where obtaining large-scale single-cell transcriptome data is challenging.
{"title":"Mouse-Geneformer: A Deep Learning Model for Mouse Single-Cell Transcriptome and Its Cross-Species Utility","authors":"Keita Ito, Tsubasa Hirakawa, Shuji Shigenobu, Hironobu Fujiyoshi, Takayoshi Yamashita","doi":"10.1101/2024.09.09.611960","DOIUrl":"https://doi.org/10.1101/2024.09.09.611960","url":null,"abstract":"Deep learning techniques are increasingly utilized to analyze large-scale single-cell RNA sequencing (scRNA-seq) data, offering valuable insights from complex transcriptome datasets. Geneformer, a pre-trained model using a Transformer Encoder architecture and human scRNA-seq datasets, has demonstrated remarkable success in human transcriptome analysis. However, given the prominence of the mouse, Mus musculus, as a primary mammalian model in biological and medical research, there is an acute need for a mouse-specific version of Geneformer. In this study, we developed a mouse-specific Geneformer (mouse-Geneformer) by constructing a large transcriptome dataset consisting of 21 million mouse scRNA-seq profiles and pre-training Geneformer on this dataset. The mouse-Geneformer effectively models the mouse transcriptome and, upon fine-tuning for downstream tasks, enhances the accuracy of cell type classification. In silico perturbation experiments using mouse-Geneformer successfully identified disease-causing genes that have been validated in in vivo experiments. These results demonstrate the feasibility of analyzing mouse data with mouse-Geneformer and highlight the robustness of the Geneformer architecture, applicable to any species with large-scale transcriptome data available. Furthermore, we found that mouse-Geneformer can analyze human transcriptome data in a cross-species manner. After the ortholog-based gene name conversion, the analysis of human scRNA-seq data using mouse-Geneformer, followed by fine-tuning with human data, achieved cell type classification accuracy comparable to that obtained using the original human Geneformer. In in silico simulation experiments using human disease models, we obtained results similar to human-Geneformer for the myocardial infarction model but only partially consistent results for the COVID-19 model, a trait unique to humans (laboratory mice are not susceptible to SARS-CoV-2). These findings suggest the potential for cross-species application of the Geneformer model while emphasizing the importance of species-specific models for capturing the full complexity of disease mechanisms. Despite the existence of the original Geneformer tailored for humans, human research could benefit from mouse-Geneformer due to its inclusion of samples that are ethically or technically inaccessible for humans, such as embryonic tissues and certain disease models. Additionally, this cross-species approach indicates potential use for non-model organisms, where obtaining large-scale single-cell transcriptome data is challenging.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-18DOI: 10.1101/2024.09.17.613439
Arindam Ghosh, Vittorio Fortino
Drug combinations, although a key therapeutic agent against cancer, are yet to reach their full applicability potential due to the challenges involved in the identification of effective and safe drug pairs. In vitro or in vivo screening would have been the optimal approach if combinatorial explosion was not an issue. In silico methods, on the other hand, can enable rapid screening of drug pairs to prioritise for experimental validation. Here we present a novel network medicine approach that systematically models the proximity of drug targets to disease-associated genes and adverse effect-associated genes, through the combination of network propagation algorithm and gene set enrichment analysis. The proposed approach is applied in the context of identifying effective drug combinations for cancer treatment starting from a training set of drug combinations curated from DrugComb and DrugBank databases. We observed that effective drug combinations usually enrich disease-related gene sets while adverse drug combinations enrich adverse-effect gene sets. We use this observation to systematically train classifiers distinguishing drug combinations with higher therapeutic effects and no known adverse reaction from combinations with lower therapeutic effects and potential adverse reactions in six cancer types. The approach is tested and validated using drug combinations curated from in vitro screening data and clinical reports. Trained classification models are also used to identify novel potential anti-cancer drug combinations for experimental validation. We believe our framework would be a key addition to the anti-cancer drug combination identification pipeline by enabling rapid yet robust estimation of therapeutic efficacy or adverse reaction potential.
{"title":"Network-based estimation of therapeutic efficacy and adverse reaction potential for prioritisation of anti-cancer drug combinations","authors":"Arindam Ghosh, Vittorio Fortino","doi":"10.1101/2024.09.17.613439","DOIUrl":"https://doi.org/10.1101/2024.09.17.613439","url":null,"abstract":"Drug combinations, although a key therapeutic agent against cancer, are yet to reach their full applicability potential due to the challenges involved in the identification of effective and safe drug pairs. In vitro or in vivo screening would have been the optimal approach if combinatorial explosion was not an issue. In silico methods, on the other hand, can enable rapid screening of drug pairs to prioritise for experimental validation. Here we present a novel network medicine approach that systematically models the proximity of drug targets to disease-associated genes and adverse effect-associated genes, through the combination of network propagation algorithm and gene set enrichment analysis. The proposed approach is applied in the context of identifying effective drug combinations for cancer treatment starting from a training set of drug combinations curated from DrugComb and DrugBank databases. We observed that effective drug combinations usually enrich disease-related gene sets while adverse drug combinations enrich adverse-effect gene sets. We use this observation to systematically train classifiers distinguishing drug combinations with higher therapeutic effects and no known adverse reaction from combinations with lower therapeutic effects and potential adverse reactions in six cancer types. The approach is tested and validated using drug combinations curated from in vitro screening data and clinical reports. Trained classification models are also used to identify novel potential anti-cancer drug combinations for experimental validation. We believe our framework would be a key addition to the anti-cancer drug combination identification pipeline by enabling rapid yet robust estimation of therapeutic efficacy or adverse reaction potential.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-18DOI: 10.1101/2024.09.13.612819
Shuhui Wang, Alexandre Allauzen, Philippe Nghe, Vaitea Opuu
Synergistic drug combination screening is a promising strategy in drug discovery, but it involves navigating a costly and complex search space. While AI, particularly deep learning, has advanced synergy predictions, its effectiveness is limited by the low occurrence of synergistic drug pairs. Active learning, which integrates experimental testing into the learning process, has been proposed to address this challenge. In this work, we explore the key components of active learning to provide recommendations for its implementation. We find that molecular encoding has a limited impact on performance, while the cellular environment features significantly enhance predictions. Additionally, active learning can discover 60% of synergistic drug pairs with only exploring 10% of combinatorial space. The synergy yield ratio is observed to be even higher with smaller batch sizes, where dynamic tuning of the exploration-exploitation strategy can further enhance performance.
{"title":"A Guide for Active Learning in Synergistic Drug Discovery","authors":"Shuhui Wang, Alexandre Allauzen, Philippe Nghe, Vaitea Opuu","doi":"10.1101/2024.09.13.612819","DOIUrl":"https://doi.org/10.1101/2024.09.13.612819","url":null,"abstract":"Synergistic drug combination screening is a promising strategy in drug discovery, but it involves navigating a costly and complex search space. While AI, particularly deep learning, has advanced synergy predictions, its effectiveness is limited by the low occurrence of synergistic drug pairs. Active learning, which integrates experimental testing into the learning process, has been proposed to address this challenge. In this work, we explore the key components of active learning to provide recommendations for its implementation. We find that molecular encoding has a limited impact on performance, while the cellular environment features significantly enhance predictions. Additionally, active learning can discover 60% of synergistic drug pairs with only exploring 10% of combinatorial space. The synergy yield ratio is observed to be even higher with smaller batch sizes, where dynamic tuning of the exploration-exploitation strategy can further enhance performance.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-17DOI: 10.1101/2024.09.14.613048
Angela Dong, Ayana Meegol Rasteh, Hengrui Liu
Background: The mitochondrial DNA repair has gained attention for its potential impact on pan-cancer genetic analysis. This study investigates the clinical relevance of mitochondrial DNA repair genes: PARP1, DNA 2, PRIMPOL, TP53, MGME1. Methods: Using multi-omics profiling data and Gene Set Cancer Analysis (GSCA) with normalized SEM mRNA expression, this research analyzes differential expression, gene mutation, and drug correlation. Results: TP53 was the most commonly mutated mitochondrial-related gene in cancer, with UCS and OV having the highest mutation rates. CPG mutations linked to lowest survival rates. Breast cancer, with various subtypes, was potentially influenced by mitochondrial DNA repair genes. ACC was shown to be high in gene survival analysis. BRCA, USC, LUCS, COAD, and OV showed CNV levels impacting survival. A negative gene expression-methylation correlation was observed and was weakest in KIRC. Mitochondrial DNA repair genes were linked to Cell cycle_A activation. A weak correlation was found between immune infiltration and mitochondrial genes. Few drug compounds were shown to be affected by mitochondrial-related genes. Conclusion: Understanding mitochondrial-related genes could redefine cancer diagnosis, and prognosis, and serve as therapeutic biomarkers, potentially altering cancer cell behavior and treatment outcomes.
背景:线粒体 DNA 修复因其对泛癌症基因分析的潜在影响而备受关注。本研究调查了线粒体 DNA 修复基因的临床相关性:PARP1、DNA 2、PRIMPOL、TP53、MGME1。研究方法本研究利用多组学剖析数据和基因组癌症分析(GSCA)与归一化 SEM mRNA 表达,分析差异表达、基因突变和药物相关性。结果发现TP53是癌症中最常见的线粒体相关基因突变,其中UCS和OV的突变率最高。CPG突变与最低生存率有关。不同亚型的乳腺癌可能受到线粒体 DNA 修复基因的影响。在基因存活率分析中,ACC 的存活率较高。BRCA、USC、LUCS、COAD和OV显示出影响生存的CNV水平。基因表达与甲基化呈负相关,KIRC的相关性最弱。线粒体 DNA 修复基因与细胞周期_A 的激活有关。免疫渗透与线粒体基因之间存在微弱的相关性。很少有药物化合物对线粒体相关基因产生影响。结论了解线粒体相关基因可重新定义癌症诊断和预后,并可作为治疗生物标志物,从而有可能改变癌细胞行为和治疗效果。
{"title":"Pan-Cancer Genetic Analysis of Mitochondrial DNA Repair Gene Set","authors":"Angela Dong, Ayana Meegol Rasteh, Hengrui Liu","doi":"10.1101/2024.09.14.613048","DOIUrl":"https://doi.org/10.1101/2024.09.14.613048","url":null,"abstract":"Background: The mitochondrial DNA repair has gained attention for its potential impact on pan-cancer genetic analysis. This study investigates the clinical relevance of mitochondrial DNA repair genes: PARP1, DNA 2, PRIMPOL, TP53, MGME1. Methods: Using multi-omics profiling data and Gene Set Cancer Analysis (GSCA) with normalized SEM mRNA expression, this research analyzes differential expression, gene mutation, and drug correlation. Results: TP53 was the most commonly mutated mitochondrial-related gene in cancer, with UCS and OV having the highest mutation rates. CPG mutations linked to lowest survival rates. Breast cancer, with various subtypes, was potentially influenced by mitochondrial DNA repair genes. ACC was shown to be high in gene survival analysis. BRCA, USC, LUCS, COAD, and OV showed CNV levels impacting survival. A negative gene expression-methylation correlation was observed and was weakest in KIRC. Mitochondrial DNA repair genes were linked to Cell cycle_A activation. A weak correlation was found between immune infiltration and mitochondrial genes. Few drug compounds were shown to be affected by mitochondrial-related genes. Conclusion: Understanding mitochondrial-related genes could redefine cancer diagnosis, and prognosis, and serve as therapeutic biomarkers, potentially altering cancer cell behavior and treatment outcomes.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"138 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-17DOI: 10.1101/2024.09.09.612081
LeAnn M Lindsey, Nicole L Pershing, Anisa Habib, W. Zac Stephens, Anne J Blaschke, Hari Sundar
Genomic language models have recently emerged as powerful tools to decode and interpret genetic sequences. Existing genomic language models have utilized various tokenization methods including character tokenization, overlapping and non-overlapping k-mer tokenization, and byte-pair encoding, a method widely used in natural language models. Genomic models have significant differences from natural language and protein language models because of their low character variability, complex and overlapping features, and inconsistent directionality. These differences make sub-word tokenization in genomic language models significantly different from traditional language models. This study explores the impact of tokenization in attention-based and state-space genomic language models by evaluating their downstream performance on various fine-tuning tasks. We propose new definitions for fertility, the token per word ratio, in the context of genomic language models, and introduce tokenization parity, which measures how consistently a tokenizer parses homologous sequences. We also perform an ablation study on the state-space model, Mamba, to evaluate the impact of character-based tokenization compared to byte-pair encoding. Our results indicate that the choice of tokenizer significantly impacts model performance and that when experiments control for input sequence length, character tokenization is the best choice in state-space models for all evaluated task categories except epigenetic mark prediction.
基因组语言模型近来已成为解码和解释基因序列的强大工具。现有的基因组语言模型采用了各种标记化方法,包括字符标记化、重叠和非重叠 k-mer 标记化以及字节对编码(一种广泛用于自然语言模型的方法)。基因组模型与自然语言和蛋白质语言模型有很大不同,因为它们的字符变异性低、特征复杂且相互重叠、方向性不一致。这些差异使得基因组语言模型中的子词标记化与传统语言模型有很大不同。本研究通过评估基于注意力的基因组语言模型和状态空间基因组语言模型在各种微调任务中的下游性能,探讨了标记化对它们的影响。我们为基因组语言模型中的 "生育率"(token per word ratio)提出了新的定义,并引入了标记化奇偶性(tokenization parity),以衡量标记化器解析同源序列的一致性。我们还对状态空间模型 Mamba 进行了消融研究,以评估基于字符的标记化与字节对编码相比所产生的影响。我们的结果表明,标记化器的选择对模型性能有显著影响,而且当实验控制输入序列长度时,字符标记化是状态空间模型中除表观遗传标记预测外所有评估任务类别的最佳选择。
{"title":"A Comparison of Tokenization Impact in Attention Based and State Space Genomic Language Models","authors":"LeAnn M Lindsey, Nicole L Pershing, Anisa Habib, W. Zac Stephens, Anne J Blaschke, Hari Sundar","doi":"10.1101/2024.09.09.612081","DOIUrl":"https://doi.org/10.1101/2024.09.09.612081","url":null,"abstract":"Genomic language models have recently emerged as powerful tools to decode and interpret genetic sequences. Existing genomic language models have utilized various tokenization methods including character tokenization, overlapping and non-overlapping k-mer tokenization, and byte-pair encoding, a method widely used in natural language models. Genomic models have significant differences from natural language and protein language models because of their low character variability, complex and overlapping features, and inconsistent directionality. These differences make sub-word tokenization in genomic language models significantly different from traditional language models. This study explores the impact of tokenization in attention-based and state-space genomic language models by evaluating their downstream performance on various fine-tuning tasks. We propose new definitions for fertility, the token per word ratio, in the context of genomic language models, and introduce tokenization parity, which measures how consistently a tokenizer parses homologous sequences. We also perform an ablation study on the state-space model, Mamba, to evaluate the impact of character-based tokenization compared to byte-pair encoding. Our results indicate that the choice of tokenizer significantly impacts model performance and that when experiments control for input sequence length, character tokenization is the best choice in state-space models for all evaluated task categories except epigenetic mark prediction.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-17DOI: 10.1101/2024.09.12.612620
Arthur Comte, Maxence Lalis, Ludivine Brajon, Riccardo Moracci, Nicolas Montagné, Jérémie Topin, Emmanuelle Jacquin Joly, Sébastien Fiorucci
Odorant receptors (ORs) are main actors of the insects peripheral olfactory system, making them prime targets for pest control through olfactory disruption. Traditional methods employed in the context of chemical ecology for identifying OR ligands rely on analyzing compounds present in the insect′s environment or screening molecules with structures similar to known ligands. However, these approaches can be time-consuming and constrained by the limited chemical space they explore. Recent advances in OR structural understanding, coupled with scientific breakthroughs in protein structure prediction, have facilitated the application of structure-based virtual screening (SBVS) techniques for accelerated ligand discovery. Here, we report the first successful application of SBVS to insect ORs. We developed a unique workflow that combines molecular docking predictions, in vivo validation and behavioral assays to identify new behaviorally active volatiles for non-pheromonal receptors. This work serves as a proof of concept, laying the groundwork for future studies and highlighting the need for improved computational approaches. Finally, we propose a simple model for predicting receptor response spectra based on the hypothesis that the binding pocket properties partially encode this information, as suggested by our results on Spodoptera littoralis ORs.
气味受体(ORs)是昆虫外周嗅觉系统的主要角色,因此成为通过嗅觉干扰控制害虫的主要目标。在化学生态学背景下,识别气味受体配体的传统方法依赖于分析昆虫环境中存在的化合物或筛选与已知配体结构相似的分子。然而,这些方法耗时长,而且受限于其探索的有限化学空间。近来对 OR 结构理解的进步,加上蛋白质结构预测方面的科学突破,促进了基于结构的虚拟筛选(SBVS)技术在加速配体发现方面的应用。在此,我们报告了 SBVS 在昆虫 ORs 中的首次成功应用。我们开发了一种独特的工作流程,将分子对接预测、体内验证和行为试验结合在一起,为非外激素受体鉴定新的行为活性挥发物。这项工作可作为概念验证,为今后的研究奠定基础,并强调了改进计算方法的必要性。最后,我们提出了一个预测受体反应谱的简单模型,该模型基于这样一个假设:结合口袋的特性部分地编码了这一信息,正如我们对滨海蝶ORs的研究结果所表明的那样。
{"title":"Accelerating Ligand Discovery for Insect Odorant Receptors","authors":"Arthur Comte, Maxence Lalis, Ludivine Brajon, Riccardo Moracci, Nicolas Montagné, Jérémie Topin, Emmanuelle Jacquin Joly, Sébastien Fiorucci","doi":"10.1101/2024.09.12.612620","DOIUrl":"https://doi.org/10.1101/2024.09.12.612620","url":null,"abstract":"Odorant receptors (ORs) are main actors of the insects peripheral olfactory system, making them prime targets for pest control through olfactory disruption. Traditional methods employed in the context of chemical ecology for identifying OR ligands rely on analyzing compounds present in the insect′s environment or screening molecules with structures similar to known ligands. However, these approaches can be time-consuming and constrained by the limited chemical space they explore. Recent advances in OR structural understanding, coupled with scientific breakthroughs in protein structure prediction, have facilitated the application of structure-based virtual screening (SBVS) techniques for accelerated ligand discovery. Here, we report the first successful application of SBVS to insect ORs. We developed a unique workflow that combines molecular docking predictions, in vivo validation and behavioral assays to identify new behaviorally active volatiles for non-pheromonal receptors. This work serves as a proof of concept, laying the groundwork for future studies and highlighting the need for improved computational approaches. Finally, we propose a simple model for predicting receptor response spectra based on the hypothesis that the binding pocket properties partially encode this information, as suggested by our results on Spodoptera littoralis ORs.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-17DOI: 10.1101/2024.08.01.606148
Alper Karagöl, Taner Karagöl
We present the Evolutionary Statistics Toolkit, a user-friendly web-based platform designed for specialized analysis of genetic sequences, which integrates multiple evolutionary statistics. The toolkit focuses on a selection of specialized tools, including Tajima's D calculator with Site Frequency Spectrum (SFS), Shannon's Entropy (H), alignment re-formatting, HGSV to FASTA conversion, pair-wise frequency analysis, FASTA to SEQRES, RNA 2D structure alignment, Kyte-Doolittle hydrophilicity plot tool and kurtosis coefficient calculator. Tajima's D is calculated using the reference formula: D = (π - θW)/sqrt(VD), where π corresponds to the average number of differences, θW is Watterson's estimator of θ, and VD is the variance of π - θW. Shannon's Entropy is defined as H = -∑ pi* log2(pi), where pi is the probability of occurrence of each unique character (nucleotide or amino acid) in the sequence. The toolkit facilitates streamlined workflows for early researchers in evolutionary biology, genomics, and related fields. With comparing with existing codes, we propose it also emerges as an educational interactive website for beginners in evolutionary statistics. The source code for each tool in the toolkit is available through GitHub links provided on the website. This open-source approach allows users to inspect the code, suggest improvements, or further adapt the tools for their specific usage and research needs. This article describes the functionalities, and validation of each tool within the platform, along with comparison with accessible existing statistical utilities. The toolkit is freely accessible on: https://www.alperkaragol.com/toolkit
{"title":"An Evolutionary Statistics Toolkit for Simplified Sequence Analysis on Web with Client-Side Processing","authors":"Alper Karagöl, Taner Karagöl","doi":"10.1101/2024.08.01.606148","DOIUrl":"https://doi.org/10.1101/2024.08.01.606148","url":null,"abstract":"We present the Evolutionary Statistics Toolkit, a user-friendly web-based platform designed for specialized analysis of genetic sequences, which integrates multiple evolutionary statistics. The toolkit focuses on a selection of specialized tools, including Tajima's D calculator with Site Frequency Spectrum (SFS), Shannon's Entropy (H), alignment re-formatting, HGSV to FASTA conversion, pair-wise frequency analysis, FASTA to SEQRES, RNA 2D structure alignment, Kyte-Doolittle hydrophilicity plot tool and kurtosis coefficient calculator. Tajima's D is calculated using the reference formula: D = (π - θ<sub>W</sub>)/sqrt(V<sub>D</sub>), where π corresponds to the average number of differences, θ<sub>W</sub> is Watterson's estimator of θ, and V<sub>D</sub> is the variance of π - θ<sub>W</sub>. Shannon's Entropy is defined as H = -∑ p<sub>i</sub>* log<sub>2</sub>(p<sub>i</sub>), where p<sub>i</sub> is the probability of occurrence of each unique character (nucleotide or amino acid) in the sequence. The toolkit facilitates streamlined workflows for early researchers in evolutionary biology, genomics, and related fields. With comparing with existing codes, we propose it also emerges as an educational interactive website for beginners in evolutionary statistics. The source code for each tool in the toolkit is available through GitHub links provided on the website. This open-source approach allows users to inspect the code, suggest improvements, or further adapt the tools for their specific usage and research needs. This article describes the functionalities, and validation of each tool within the platform, along with comparison with accessible existing statistical utilities. The toolkit is freely accessible on: https://www.alperkaragol.com/toolkit","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-17DOI: 10.1101/2024.09.11.612418
David Porubsky, Xavi Guitart, DongAhn Yoo, Philip C. Dishuck, William T. Harvey, Evan E. Eichler
Motivation We are now in the era of being able to routinely generate highly contiguous (near telomere-to-telomere) genome assemblies of human and nonhuman species. Complex structural variation and regions of rapid evolutionary turnover are being discovered for the first time. Thus, efficient and informative visualization tools are needed to evaluate and directly observe structural differences between two or more genomes. Results We developed SVbyEye, an open-source R package to visualize and annotate sequence-to-sequence alignments along with various functionalities to process alignments in PAF format. The tool facilitates the characterization of complex structural variants in the context of sequence homology helping resolve the mechanisms underlying their formation. Availability and implementation SVbyEye is available at https://github.com/daewoooo/SVbyEye.
动机我们现在正处于能够常规生成高度连续(接近端粒到端粒)的人类和非人类物种基因组组装的时代。我们首次发现了复杂的结构变异和快速进化更替的区域。因此,我们需要高效且信息丰富的可视化工具来评估和直接观察两个或多个基因组之间的结构差异。结果我们开发了 SVbyEye,这是一个开源的 R 软件包,用于可视化和注释序列到序列的比对以及处理 PAF 格式比对的各种功能。该工具有助于在序列同源性的背景下描述复杂的结构变异,帮助解决其形成的机制问题。可用性和实现SVbyEye可在https://github.com/daewoooo/SVbyEye。
{"title":"SVbyEye: A visual tool to characterize structural variation among whole-genome assemblies","authors":"David Porubsky, Xavi Guitart, DongAhn Yoo, Philip C. Dishuck, William T. Harvey, Evan E. Eichler","doi":"10.1101/2024.09.11.612418","DOIUrl":"https://doi.org/10.1101/2024.09.11.612418","url":null,"abstract":"Motivation\u0000We are now in the era of being able to routinely generate highly contiguous (near telomere-to-telomere) genome assemblies of human and nonhuman species. Complex structural variation and regions of rapid evolutionary turnover are being discovered for the first time. Thus, efficient and informative visualization tools are needed to evaluate and directly observe structural differences between two or more genomes.\u0000Results\u0000We developed SVbyEye, an open-source R package to visualize and annotate sequence-to-sequence alignments along with various functionalities to process alignments in PAF format. The tool facilitates the characterization of complex structural variants in the context of sequence homology helping resolve the mechanisms underlying their formation.\u0000Availability and implementation\u0000SVbyEye is available at https://github.com/daewoooo/SVbyEye.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"95 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
RNA velocity has recently emerged as a key tool in the analysis of single-cell transcriptomic data, yet connecting RNA velocity analyses to underlying regulatory processes has proved challenging. Here we propose CRAK-Velo, a semi-mechanistic model which integrates chromatin accessibility data in the estimation of RNA velocities. CRAK-Velo provides biologically consistent estimates of developmental flows and enables accurate cell-type deconvolution, while additionally shining light on regulatory processes at the level of interactions between genes and chromatin regions.
{"title":"CRAK-Velo: Chromatin Accessibility Kinetics integration improves RNA Velocity estimation","authors":"Nour El Kazwini, Mingze Gao, Idris Kouadri Boudjelthia, Fangxin Cai, Yuanhua Huang, Guido Sanguinetti","doi":"10.1101/2024.09.12.612736","DOIUrl":"https://doi.org/10.1101/2024.09.12.612736","url":null,"abstract":"RNA velocity has recently emerged as a key tool in the analysis of single-cell transcriptomic data, yet connecting RNA velocity analyses to underlying regulatory processes has proved challenging. Here we propose CRAK-Velo, a semi-mechanistic model which integrates chromatin accessibility data in the estimation of RNA velocities. CRAK-Velo provides biologically consistent estimates of developmental flows and enables accurate cell-type deconvolution, while additionally shining light on regulatory processes at the level of interactions between genes and chromatin regions.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}