Clinical variant classification of pathogenic versus benign genetic variants remains a challenge in clinical genetics. Recently, the proposition of genomic foundation models has improved the generic variant effect prediction (VEP) accuracy via weakly-supervised or unsupervised training. However, these VEPs are not disease-specific, limiting their adaptation at the point of care. To address this problem, we propose DYNA: Disease-specificity fine-tuning via a Siamese neural network broadly applicable to all genomic foundation models for more effective variant effect predictions in disease-specific contexts. We evaluate DYNA in two distinct disease-relevant tasks. For coding VEPs, we focus on various cardiovascular diseases, where gene-disease relationships of loss-of-function vs. gain-of-function dictate disease-specific VEP. For non-coding VEPs, we apply DYNA to an essential post-transcriptional regulatory axis of RNA splicing, the most common non-coding pathogenic mechanism in established clinical VEP guidelines. In both cases, DYNA fine-tunes various pre-trained genomic foundation models on small, rare variant sets. The DYNA fine-tuned models show superior performance in the held-out rare variant testing set and are further replicated in large, clinically-relevant variant annotations in ClinVAR. Thus, DYNA offers a potent disease-specific variant effect prediction method, excelling in intra-gene generalization and generalization to unseen genetic variants, making it particularly valuable for disease associations and clinical applicability.
{"title":"DYNA: Disease-Specific Language Model for Variant Pathogenicity","authors":"Huixin Zhan, Zijun Zhang","doi":"arxiv-2406.00164","DOIUrl":"https://doi.org/arxiv-2406.00164","url":null,"abstract":"Clinical variant classification of pathogenic versus benign genetic variants\u0000remains a challenge in clinical genetics. Recently, the proposition of genomic\u0000foundation models has improved the generic variant effect prediction (VEP)\u0000accuracy via weakly-supervised or unsupervised training. However, these VEPs\u0000are not disease-specific, limiting their adaptation at the point of care. To\u0000address this problem, we propose DYNA: Disease-specificity fine-tuning via a\u0000Siamese neural network broadly applicable to all genomic foundation models for\u0000more effective variant effect predictions in disease-specific contexts. We\u0000evaluate DYNA in two distinct disease-relevant tasks. For coding VEPs, we focus\u0000on various cardiovascular diseases, where gene-disease relationships of\u0000loss-of-function vs. gain-of-function dictate disease-specific VEP. For\u0000non-coding VEPs, we apply DYNA to an essential post-transcriptional regulatory\u0000axis of RNA splicing, the most common non-coding pathogenic mechanism in\u0000established clinical VEP guidelines. In both cases, DYNA fine-tunes various\u0000pre-trained genomic foundation models on small, rare variant sets. The DYNA\u0000fine-tuned models show superior performance in the held-out rare variant\u0000testing set and are further replicated in large, clinically-relevant variant\u0000annotations in ClinVAR. Thus, DYNA offers a potent disease-specific variant\u0000effect prediction method, excelling in intra-gene generalization and\u0000generalization to unseen genetic variants, making it particularly valuable for\u0000disease associations and clinical applicability.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141257748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antibodies are crucial proteins produced by the immune system to eliminate harmful foreign substances and have become pivotal therapeutic agents for treating human diseases. To accelerate the discovery of antibody therapeutics, there is growing interest in constructing language models using antibody sequences. However, the applicability of pre-trained language models for antibody discovery has not been thoroughly evaluated due to the scarcity of labeled datasets. To overcome these limitations, we introduce AVIDa-SARS-CoV-2, a dataset featuring the antigen-variable domain of heavy chain of heavy chain antibody (VHH) interactions obtained from two alpacas immunized with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike proteins. AVIDa-SARS-CoV-2 includes binary labels indicating the binding or non-binding of diverse VHH sequences to 12 SARS-CoV-2 mutants, such as the Delta and Omicron variants. Furthermore, we release VHHCorpus-2M, a pre-training dataset for antibody language models, containing over two million VHH sequences. We report benchmark results for predicting SARS-CoV-2-VHH binding using VHHBERT pre-trained on VHHCorpus-2M and existing general protein and antibody-specific pre-trained language models. These results confirm that AVIDa-SARS-CoV-2 provides valuable benchmarks for evaluating the representation capabilities of antibody language models for binding prediction, thereby facilitating the development of AI-driven antibody discovery. The datasets are available at https://datasets.cognanous.com.
{"title":"A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models","authors":"Hirofumi Tsuruta, Hiroyuki Yamazaki, Ryota Maeda, Ryotaro Tamura, Akihiro Imura","doi":"arxiv-2405.18749","DOIUrl":"https://doi.org/arxiv-2405.18749","url":null,"abstract":"Antibodies are crucial proteins produced by the immune system to eliminate\u0000harmful foreign substances and have become pivotal therapeutic agents for\u0000treating human diseases. To accelerate the discovery of antibody therapeutics,\u0000there is growing interest in constructing language models using antibody\u0000sequences. However, the applicability of pre-trained language models for\u0000antibody discovery has not been thoroughly evaluated due to the scarcity of\u0000labeled datasets. To overcome these limitations, we introduce AVIDa-SARS-CoV-2,\u0000a dataset featuring the antigen-variable domain of heavy chain of heavy chain\u0000antibody (VHH) interactions obtained from two alpacas immunized with severe\u0000acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike proteins.\u0000AVIDa-SARS-CoV-2 includes binary labels indicating the binding or non-binding\u0000of diverse VHH sequences to 12 SARS-CoV-2 mutants, such as the Delta and\u0000Omicron variants. Furthermore, we release VHHCorpus-2M, a pre-training dataset\u0000for antibody language models, containing over two million VHH sequences. We\u0000report benchmark results for predicting SARS-CoV-2-VHH binding using VHHBERT\u0000pre-trained on VHHCorpus-2M and existing general protein and antibody-specific\u0000pre-trained language models. These results confirm that AVIDa-SARS-CoV-2\u0000provides valuable benchmarks for evaluating the representation capabilities of\u0000antibody language models for binding prediction, thereby facilitating the\u0000development of AI-driven antibody discovery. The datasets are available at\u0000https://datasets.cognanous.com.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"257 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141194172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander Rakowski, Remo Monti, Viktoriia Huryn, Marta Lemanczyk, Uwe Ohler, Christoph Lippert
With the development of high-throughput technologies, genomics datasets rapidly grow in size, including functional genomics data. This has allowed the training of large Deep Learning (DL) models to predict epigenetic readouts, such as protein binding or histone modifications, from genome sequences. However, large dataset sizes come at a price of data consistency, often aggregating results from a large number of studies, conducted under varying experimental conditions. While data from large-scale consortia are useful as they allow studying the effects of different biological conditions, they can also contain unwanted biases from confounding experimental factors. Here, we introduce Metadata-guided Feature Disentanglement (MFD) - an approach that allows disentangling biologically relevant features from potential technical biases. MFD incorporates target metadata into model training, by conditioning weights of the model output layer on different experimental factors. It then separates the factors into disjoint groups and enforces independence of the corresponding feature subspaces with an adversarially learned penalty. We show that the metadata-driven disentanglement approach allows for better model introspection, by connecting latent features to experimental factors, without compromising, or even improving performance in downstream tasks, such as enhancer prediction, or genetic variant discovery. The code for our implemementation is available at https://github.com/HealthML/MFD
{"title":"Metadata-guided Feature Disentanglement for Functional Genomics","authors":"Alexander Rakowski, Remo Monti, Viktoriia Huryn, Marta Lemanczyk, Uwe Ohler, Christoph Lippert","doi":"arxiv-2405.19057","DOIUrl":"https://doi.org/arxiv-2405.19057","url":null,"abstract":"With the development of high-throughput technologies, genomics datasets\u0000rapidly grow in size, including functional genomics data. This has allowed the\u0000training of large Deep Learning (DL) models to predict epigenetic readouts,\u0000such as protein binding or histone modifications, from genome sequences.\u0000However, large dataset sizes come at a price of data consistency, often\u0000aggregating results from a large number of studies, conducted under varying\u0000experimental conditions. While data from large-scale consortia are useful as\u0000they allow studying the effects of different biological conditions, they can\u0000also contain unwanted biases from confounding experimental factors. Here, we\u0000introduce Metadata-guided Feature Disentanglement (MFD) - an approach that\u0000allows disentangling biologically relevant features from potential technical\u0000biases. MFD incorporates target metadata into model training, by conditioning\u0000weights of the model output layer on different experimental factors. It then\u0000separates the factors into disjoint groups and enforces independence of the\u0000corresponding feature subspaces with an adversarially learned penalty. We show\u0000that the metadata-driven disentanglement approach allows for better model\u0000introspection, by connecting latent features to experimental factors, without\u0000compromising, or even improving performance in downstream tasks, such as\u0000enhancer prediction, or genetic variant discovery. The code for our\u0000implemementation is available at https://github.com/HealthML/MFD","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141194101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ping-Han Hsieh, Ru-Xiu Hsiao, Katalin Ferenc, Anthony Mathelier, Rebekka Burkholz, Chien-Yu Chen, Geir Kjetil Sandve, Tatiana Belova, Marieke Lydia Kuijjer
Paired single-cell sequencing technologies enable the simultaneous measurement of complementary modalities of molecular data at single-cell resolution. Along with the advances in these technologies, many methods based on variational autoencoders have been developed to integrate these data. However, these methods do not explicitly incorporate prior biological relationships between the data modalities, which could significantly enhance modeling and interpretation. We propose a novel probabilistic learning framework that explicitly incorporates conditional independence relationships between multi-modal data as a directed acyclic graph using a generalized hierarchical variational autoencoder. We demonstrate the versatility of our framework across various applications pertinent to single-cell multi-omics data integration. These include the isolation of common and distinct information from different modalities, modality-specific differential analysis, and integrated cell clustering. We anticipate that the proposed framework can facilitate the construction of highly flexible graphical models that can capture the complexities of biological hypotheses and unravel the connections between different biological data types, such as different modalities of paired single-cell multi-omics data. The implementation of the proposed framework can be found in the repository https://github.com/kuijjerlab/CAVACHON.
{"title":"CAVACHON: a hierarchical variational autoencoder to integrate multi-modal single-cell data","authors":"Ping-Han Hsieh, Ru-Xiu Hsiao, Katalin Ferenc, Anthony Mathelier, Rebekka Burkholz, Chien-Yu Chen, Geir Kjetil Sandve, Tatiana Belova, Marieke Lydia Kuijjer","doi":"arxiv-2405.18655","DOIUrl":"https://doi.org/arxiv-2405.18655","url":null,"abstract":"Paired single-cell sequencing technologies enable the simultaneous\u0000measurement of complementary modalities of molecular data at single-cell\u0000resolution. Along with the advances in these technologies, many methods based\u0000on variational autoencoders have been developed to integrate these data.\u0000However, these methods do not explicitly incorporate prior biological\u0000relationships between the data modalities, which could significantly enhance\u0000modeling and interpretation. We propose a novel probabilistic learning\u0000framework that explicitly incorporates conditional independence relationships\u0000between multi-modal data as a directed acyclic graph using a generalized\u0000hierarchical variational autoencoder. We demonstrate the versatility of our\u0000framework across various applications pertinent to single-cell multi-omics data\u0000integration. These include the isolation of common and distinct information\u0000from different modalities, modality-specific differential analysis, and\u0000integrated cell clustering. We anticipate that the proposed framework can\u0000facilitate the construction of highly flexible graphical models that can\u0000capture the complexities of biological hypotheses and unravel the connections\u0000between different biological data types, such as different modalities of paired\u0000single-cell multi-omics data. The implementation of the proposed framework can\u0000be found in the repository https://github.com/kuijjerlab/CAVACHON.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"82 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141194173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heaps' or Herdan's law is a linguistic law describing the relationship between the vocabulary/dictionary size (type) and word counts (token) to be a power-law function. Its existence in genomes with certain definition of DNA words is unclear partly because the dictionary size in genome could be much smaller than that in a human language. We define a DNA word in a genome as a DNA coding region that codes for a protein domain. Using human chromosomes and chromosome arms as individual samples, we establish the existence of Heaps' law in the human genome within limited range. Our definition of words in a genomic or proteomic context is different from that in large language models for DNA or protein sequences where words are usually short. Although an approximate power-law distribution of protein domain sizes due to gene duplication and the related Zipf's law is well known, their translation to the Heaps' law in DNA words is not automatic. Several other animal genomes are shown herein also to exhibit range-limited Heaps' law with our definition of DNA words, though with various exponents, partially depending on their level of complexity. Investigation of Heaps' law and its exponent value could provide an alternative narrative of reusage and redundancy of protein domains as well as creation of new protein domains from a linguistic perspective.
希普斯定律(Heaps' or Herdan's law)是一种语言学定律,它将词汇量/字典大小(类型)与字数(标记)之间的关系描述为幂律函数。该定律在具有特定 DNA 词定义的基因组中是否存在尚不清楚,部分原因是基因组中的词典规模可能比人类语言中的词典规模小得多。我们将基因组中的 DNA 词定义为编码蛋白质域的 DNA 编码区。我们以人类染色体和染色体臂为单个样本,在有限的范围内证实了人类基因组中存在希普斯定律。我们在基因组学或蛋白质组学背景下对单词的定义不同于 DNA 或蛋白质序列的大型语言模型,后者的单词通常很短。虽然由于基因复制导致的蛋白质结构域大小的近似幂律分布和与之相关的齐普夫定律已广为人知,但它们在 DNA 单词中并不能自动转化为希普斯定律。研究 Heaps'定律及其指数值可以从语言学的角度为蛋白质结构域的重复使用和冗余以及新蛋白质结构域的创造提供另一种叙述方式。
{"title":"Range-Limited Heaps' Law for Functional DNA Words in the Human Genome","authors":"Wentian Li, Yannis Almirantis, Astero Provata","doi":"arxiv-2405.13825","DOIUrl":"https://doi.org/arxiv-2405.13825","url":null,"abstract":"Heaps' or Herdan's law is a linguistic law describing the relationship\u0000between the vocabulary/dictionary size (type) and word counts (token) to be a\u0000power-law function. Its existence in genomes with certain definition of DNA\u0000words is unclear partly because the dictionary size in genome could be much\u0000smaller than that in a human language. We define a DNA word in a genome as a\u0000DNA coding region that codes for a protein domain. Using human chromosomes and\u0000chromosome arms as individual samples, we establish the existence of Heaps' law\u0000in the human genome within limited range. Our definition of words in a genomic\u0000or proteomic context is different from that in large language models for DNA or\u0000protein sequences where words are usually short. Although an approximate\u0000power-law distribution of protein domain sizes due to gene duplication and the\u0000related Zipf's law is well known, their translation to the Heaps' law in DNA\u0000words is not automatic. Several other animal genomes are shown herein also to\u0000exhibit range-limited Heaps' law with our definition of DNA words, though with\u0000various exponents, partially depending on their level of complexity.\u0000Investigation of Heaps' law and its exponent value could provide an alternative\u0000narrative of reusage and redundancy of protein domains as well as creation of\u0000new protein domains from a linguistic perspective.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141152918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiayu Shang, Cheng Peng, Yongxin Ji, Jiaojiao Guan, Dehan Cai, Xubo Tang, Yanni Sun
Motivation: Protein embedding, which represents proteins as numerical vectors, is a crucial step in various learning-based protein annotation/classification problems, including gene ontology prediction, protein-protein interaction prediction, and protein structure prediction. However, existing protein embedding methods are often computationally expensive due to their large number of parameters, which can reach millions or even billions. The growing availability of large-scale protein datasets and the need for efficient analysis tools have created a pressing demand for efficient protein embedding methods. Results: We propose a novel protein embedding approach based on multi-teacher distillation learning, which leverages the knowledge of multiple pre-trained protein embedding models to learn a compact and informative representation of proteins. Our method achieves comparable performance to state-of-the-art methods while significantly reducing computational costs and resource requirements. Specifically, our approach reduces computational time by ~70% and maintains almost the same accuracy as the original large models. This makes our method well-suited for large-scale protein analysis and enables the bioinformatics community to perform protein embedding tasks more efficiently.
{"title":"Accurate and efficient protein embedding using multi-teacher distillation learning","authors":"Jiayu Shang, Cheng Peng, Yongxin Ji, Jiaojiao Guan, Dehan Cai, Xubo Tang, Yanni Sun","doi":"arxiv-2405.11735","DOIUrl":"https://doi.org/arxiv-2405.11735","url":null,"abstract":"Motivation: Protein embedding, which represents proteins as numerical\u0000vectors, is a crucial step in various learning-based protein\u0000annotation/classification problems, including gene ontology prediction,\u0000protein-protein interaction prediction, and protein structure prediction.\u0000However, existing protein embedding methods are often computationally expensive\u0000due to their large number of parameters, which can reach millions or even\u0000billions. The growing availability of large-scale protein datasets and the need\u0000for efficient analysis tools have created a pressing demand for efficient\u0000protein embedding methods. Results: We propose a novel protein embedding approach based on multi-teacher\u0000distillation learning, which leverages the knowledge of multiple pre-trained\u0000protein embedding models to learn a compact and informative representation of\u0000proteins. Our method achieves comparable performance to state-of-the-art\u0000methods while significantly reducing computational costs and resource\u0000requirements. Specifically, our approach reduces computational time by ~70%\u0000and maintains almost the same accuracy as the original large models. This makes\u0000our method well-suited for large-scale protein analysis and enables the\u0000bioinformatics community to perform protein embedding tasks more efficiently.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141146356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ibrahim Al-Hurani, Abedalrhman Alkhateeb, Salama Ikki
In the relentless efforts in enhancing medical diagnostics, the integration of state-of-the-art machine learning methodologies has emerged as a promising research area. In molecular biology, there has been an explosion of data generated from multi-omics sequencing. The advent sequencing equipment can provide large number of complicated measurements per one experiment. Therefore, traditional statistical methods face challenging tasks when dealing with such high dimensional data. However, most of the information contained in these datasets is redundant or unrelated and can be effectively reduced to significantly fewer variables without losing much information. Dimensionality reduction techniques are mathematical procedures that allow for this reduction; they have largely been developed through statistics and machine learning disciplines. The other challenge in medical datasets is having an imbalanced number of samples in the classes, which leads to biased results in machine learning models. This study, focused on tackling these challenges in a neural network that incorporates autoencoder to extract latent space of the features, and Generative Adversarial Networks (GAN) to generate synthetic samples. Latent space is the reduced dimensional space that captures the meaningful features of the original data. Our model starts with feature selection to select the discriminative features before feeding them to the neural network. Then, the model predicts the outcome of cancer for different datasets. The proposed model outperformed other existing models by scoring accuracy of 95.09% for bladder cancer dataset and 88.82% for the breast cancer dataset.
{"title":"An Autoencoder and Generative Adversarial Networks Approach for Multi-Omics Data Imbalanced Class Handling and Classification","authors":"Ibrahim Al-Hurani, Abedalrhman Alkhateeb, Salama Ikki","doi":"arxiv-2405.09756","DOIUrl":"https://doi.org/arxiv-2405.09756","url":null,"abstract":"In the relentless efforts in enhancing medical diagnostics, the integration\u0000of state-of-the-art machine learning methodologies has emerged as a promising\u0000research area. In molecular biology, there has been an explosion of data\u0000generated from multi-omics sequencing. The advent sequencing equipment can\u0000provide large number of complicated measurements per one experiment. Therefore,\u0000traditional statistical methods face challenging tasks when dealing with such\u0000high dimensional data. However, most of the information contained in these\u0000datasets is redundant or unrelated and can be effectively reduced to\u0000significantly fewer variables without losing much information. Dimensionality\u0000reduction techniques are mathematical procedures that allow for this reduction;\u0000they have largely been developed through statistics and machine learning\u0000disciplines. The other challenge in medical datasets is having an imbalanced\u0000number of samples in the classes, which leads to biased results in machine\u0000learning models. This study, focused on tackling these challenges in a neural\u0000network that incorporates autoencoder to extract latent space of the features,\u0000and Generative Adversarial Networks (GAN) to generate synthetic samples. Latent\u0000space is the reduced dimensional space that captures the meaningful features of\u0000the original data. Our model starts with feature selection to select the\u0000discriminative features before feeding them to the neural network. Then, the\u0000model predicts the outcome of cancer for different datasets. The proposed model\u0000outperformed other existing models by scoring accuracy of 95.09% for bladder\u0000cancer dataset and 88.82% for the breast cancer dataset.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"214 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141062651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siyuan Li, Zedong Wang, Zicheng Liu, Di Wu, Cheng Tan, Jiangbin Zheng, Yufei Huang, Stan Z. Li
Similar to natural language models, pre-trained genome language models are proposed to capture the underlying intricacies within genomes with unsupervised sequence modeling. They have become essential tools for researchers and practitioners in biology. However, the textit{hand-crafted} tokenization policies used in these models may not encode the most discriminative patterns from the limited vocabulary of genomic data. In this paper, we introduce VQDNA, a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebook as textit{learnable} vocabulary, VQDNA can adaptively tokenize genomes into textit{pattern-aware} embeddings in an end-to-end manner. To further push its limits, we propose Hierarchical Residual Quantization (HRQ), where varying scales of codebooks are designed in a hierarchy to enrich the genome vocabulary in a coarse-to-fine manner. Extensive experiments on 32 genome datasets demonstrate VQDNA's superiority and favorable parameter efficiency compared to existing genome language models. Notably, empirical analysis of SARS-CoV-2 mutations reveals the fine-grained pattern awareness and biological significance of learned HRQ vocabulary, highlighting its untapped potential for broader applications in genomics.
{"title":"VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling","authors":"Siyuan Li, Zedong Wang, Zicheng Liu, Di Wu, Cheng Tan, Jiangbin Zheng, Yufei Huang, Stan Z. Li","doi":"arxiv-2405.10812","DOIUrl":"https://doi.org/arxiv-2405.10812","url":null,"abstract":"Similar to natural language models, pre-trained genome language models are\u0000proposed to capture the underlying intricacies within genomes with unsupervised\u0000sequence modeling. They have become essential tools for researchers and\u0000practitioners in biology. However, the textit{hand-crafted} tokenization\u0000policies used in these models may not encode the most discriminative patterns\u0000from the limited vocabulary of genomic data. In this paper, we introduce VQDNA,\u0000a general-purpose framework that renovates genome tokenization from the\u0000perspective of genome vocabulary learning. By leveraging vector-quantized\u0000codebook as textit{learnable} vocabulary, VQDNA can adaptively tokenize\u0000genomes into textit{pattern-aware} embeddings in an end-to-end manner. To\u0000further push its limits, we propose Hierarchical Residual Quantization (HRQ),\u0000where varying scales of codebooks are designed in a hierarchy to enrich the\u0000genome vocabulary in a coarse-to-fine manner. Extensive experiments on 32\u0000genome datasets demonstrate VQDNA's superiority and favorable parameter\u0000efficiency compared to existing genome language models. Notably, empirical\u0000analysis of SARS-CoV-2 mutations reveals the fine-grained pattern awareness and\u0000biological significance of learned HRQ vocabulary, highlighting its untapped\u0000potential for broader applications in genomics.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141146433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Two strains of the endoparasitoid Cotesia typhae present a differential parasitism success on the host, Sesamia nonagrioides. One is virulent on both permissive and resistant host populations, and the other only on the permissive host. This interaction provides a very interesting frame for studying virulence factors. Here, we used a combination of comparative transcriptomic and proteomic analyses to unravel the molecular basis underlying virulence differences between the strains.Results: First, we report that virulence genes are mostly expressed during the nymphal stage of the parasitoid. Especially, proviral genes are broadly up-regulated at this stage, while their expression is only expected in the host. Parasitoid gene expression in the host increases with time, indicating the production of more virulence factors. Secondly, comparison between strains reveals differences in venom composition, with 12 proteins showing differential abundance. Proviral expression in the host displays a strong temporal variability, along with differential patterns between strains. Notably, a subset of proviral genes including protein-tyrosine phosphatases is specifically over-expressed in the resistant host parasitized by the less virulent strain, 24 hours after parasitism. This result particularly hints at host modulation of proviral expression.Conclusions: This study sheds light on the temporal expression of virulence factors of Cotesia typhae, both in the host and in the parasitoid. It also identifies potential molecular candidates driving differences in parasitism success between two strains. Together, those findings provide a path for further exploration of virulence mechanisms in parasitoid wasps, and offer insights into host-parasitoid coevolution.
{"title":"Characterizing virulence differences in a parasitoid wasp through comparative transcriptomic and proteomic","authors":"Samuel GornardEGCE, Pascaline Venon, Florian Lasfont, Thierry Balliau, Laure Marie-Paule Kaiser-Arnauld, Florence Mougel","doi":"arxiv-2405.07772","DOIUrl":"https://doi.org/arxiv-2405.07772","url":null,"abstract":"Background: Two strains of the endoparasitoid Cotesia typhae present a\u0000differential parasitism success on the host, Sesamia nonagrioides. One is\u0000virulent on both permissive and resistant host populations, and the other only\u0000on the permissive host. This interaction provides a very interesting frame for\u0000studying virulence factors. Here, we used a combination of comparative\u0000transcriptomic and proteomic analyses to unravel the molecular basis underlying\u0000virulence differences between the strains.Results: First, we report that\u0000virulence genes are mostly expressed during the nymphal stage of the\u0000parasitoid. Especially, proviral genes are broadly up-regulated at this stage,\u0000while their expression is only expected in the host. Parasitoid gene expression\u0000in the host increases with time, indicating the production of more virulence\u0000factors. Secondly, comparison between strains reveals differences in venom\u0000composition, with 12 proteins showing differential abundance. Proviral\u0000expression in the host displays a strong temporal variability, along with\u0000differential patterns between strains. Notably, a subset of proviral genes\u0000including protein-tyrosine phosphatases is specifically over-expressed in the\u0000resistant host parasitized by the less virulent strain, 24 hours after\u0000parasitism. This result particularly hints at host modulation of proviral\u0000expression.Conclusions: This study sheds light on the temporal expression of\u0000virulence factors of Cotesia typhae, both in the host and in the parasitoid. It\u0000also identifies potential molecular candidates driving differences in\u0000parasitism success between two strains. Together, those findings provide a path\u0000for further exploration of virulence mechanisms in parasitoid wasps, and offer\u0000insights into host-parasitoid coevolution.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140934831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aleix Lafita, Ferran Gonzalez, Mahmoud Hossam, Paul Smyth, Jacob Deasy, Ari Allyn-Feuer, Daniel Seaton, Stephen Young
Protein Language Models (PLMs) have emerged as performant and scalable tools for predicting the functional impact and clinical significance of protein-coding variants, but they still lag experimental accuracy. Here, we present a novel fine-tuning approach to improve the performance of PLMs with experimental maps of variant effects from Deep Mutational Scanning (DMS) assays using a Normalised Log-odds Ratio (NLR) head. We find consistent improvements in a held-out protein test set, and on independent DMS and clinical variant annotation benchmarks from ProteinGym and ClinVar. These findings demonstrate that DMS is a promising source of sequence diversity and supervised training data for improving the performance of PLMs for variant effect prediction.
{"title":"Fine-tuning Protein Language Models with Deep Mutational Scanning improves Variant Effect Prediction","authors":"Aleix Lafita, Ferran Gonzalez, Mahmoud Hossam, Paul Smyth, Jacob Deasy, Ari Allyn-Feuer, Daniel Seaton, Stephen Young","doi":"arxiv-2405.06729","DOIUrl":"https://doi.org/arxiv-2405.06729","url":null,"abstract":"Protein Language Models (PLMs) have emerged as performant and scalable tools\u0000for predicting the functional impact and clinical significance of\u0000protein-coding variants, but they still lag experimental accuracy. Here, we\u0000present a novel fine-tuning approach to improve the performance of PLMs with\u0000experimental maps of variant effects from Deep Mutational Scanning (DMS) assays\u0000using a Normalised Log-odds Ratio (NLR) head. We find consistent improvements\u0000in a held-out protein test set, and on independent DMS and clinical variant\u0000annotation benchmarks from ProteinGym and ClinVar. These findings demonstrate\u0000that DMS is a promising source of sequence diversity and supervised training\u0000data for improving the performance of PLMs for variant effect prediction.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140935209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}