arXiv - QuanBio - Genomics最新文献_第7页

DYNA: Disease-Specific Language Model for Variant Pathogenicity DYNA：变异致病性疾病特异性语言模型

arXiv - QuanBio - Genomics

Pub Date : 2024-05-31 DOI: arxiv-2406.00164

Huixin Zhan, Zijun Zhang

Clinical variant classification of pathogenic versus benign genetic variantsremains a challenge in clinical genetics. Recently, the proposition of genomicfoundation models has improved the generic variant effect prediction (VEP)accuracy via weakly-supervised or unsupervised training. However, these VEPsare not disease-specific, limiting their adaptation at the point of care. Toaddress this problem, we propose DYNA: Disease-specificity fine-tuning via aSiamese neural network broadly applicable to all genomic foundation models formore effective variant effect predictions in disease-specific contexts. Weevaluate DYNA in two distinct disease-relevant tasks. For coding VEPs, we focuson various cardiovascular diseases, where gene-disease relationships ofloss-of-function vs. gain-of-function dictate disease-specific VEP. Fornon-coding VEPs, we apply DYNA to an essential post-transcriptional regulatoryaxis of RNA splicing, the most common non-coding pathogenic mechanism inestablished clinical VEP guidelines. In both cases, DYNA fine-tunes variouspre-trained genomic foundation models on small, rare variant sets. The DYNAfine-tuned models show superior performance in the held-out rare varianttesting set and are further replicated in large, clinically-relevant variantannotations in ClinVAR. Thus, DYNA offers a potent disease-specific varianteffect prediction method, excelling in intra-gene generalization andgeneralization to unseen genetic variants, making it particularly valuable fordisease associations and clinical applicability.

对致病与良性遗传变异进行临床变异分类仍是临床遗传学的一项挑战。最近，基因组基础模型的提出通过弱监督或无监督训练提高了通用变异效应预测（VEP）的准确性。然而，这些变异效应预测模型并非针对特定疾病，这限制了它们在医疗点的适应性。为了解决这个问题，我们提出了 DYNA：通过暹罗神经网络进行疾病特异性微调，它广泛适用于所有基因组基础模型，能在疾病特异性背景下更有效地预测变异效应。我们在两个不同的疾病相关任务中对 DYNA 进行了评估。对于编码 VEP，我们关注各种心血管疾病，其中功能缺失与功能增益的基因-疾病关系决定了特定疾病的 VEP。对于非编码 VEP，我们将 DYNA 应用于 RNA 剪接这一重要的转录后调控轴，这是临床 VEP 指南中最常见的非编码致病机制。在这两种情况下，DYNA 都会在小型、罕见的变异集上对各种预训练基因组基础模型进行微调。经过 DYNA 微调的模型在保留的罕见变异测试集中表现出了卓越的性能，并在 ClinVAR 中的大型临床相关变异注释中得到了进一步复制。因此，DYNA 提供了一种有效的疾病特异性变异效应预测方法，在基因内泛化和泛化到未见过的基因变异方面表现出色，使其在疾病关联和临床应用方面特别有价值。

{"title":"DYNA: Disease-Specific Language Model for Variant Pathogenicity","authors":"Huixin Zhan, Zijun Zhang","doi":"arxiv-2406.00164","DOIUrl":"https://doi.org/arxiv-2406.00164","url":null,"abstract":"Clinical variant classification of pathogenic versus benign genetic variants\u0000remains a challenge in clinical genetics. Recently, the proposition of genomic\u0000foundation models has improved the generic variant effect prediction (VEP)\u0000accuracy via weakly-supervised or unsupervised training. However, these VEPs\u0000are not disease-specific, limiting their adaptation at the point of care. To\u0000address this problem, we propose DYNA: Disease-specificity fine-tuning via a\u0000Siamese neural network broadly applicable to all genomic foundation models for\u0000more effective variant effect predictions in disease-specific contexts. We\u0000evaluate DYNA in two distinct disease-relevant tasks. For coding VEPs, we focus\u0000on various cardiovascular diseases, where gene-disease relationships of\u0000loss-of-function vs. gain-of-function dictate disease-specific VEP. For\u0000non-coding VEPs, we apply DYNA to an essential post-transcriptional regulatory\u0000axis of RNA splicing, the most common non-coding pathogenic mechanism in\u0000established clinical VEP guidelines. In both cases, DYNA fine-tunes various\u0000pre-trained genomic foundation models on small, rare variant sets. The DYNA\u0000fine-tuned models show superior performance in the held-out rare variant\u0000testing set and are further replicated in large, clinically-relevant variant\u0000annotations in ClinVAR. Thus, DYNA offers a potent disease-specific variant\u0000effect prediction method, excelling in intra-gene generalization and\u0000generalization to unseen genetic variants, making it particularly valuable for\u0000disease associations and clinical applicability.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141257748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models 用于抗体语言模型的 SARS-CoV-2 相互作用数据集和 VHH 序列语料库

arXiv - QuanBio - Genomics

Pub Date : 2024-05-29 DOI: arxiv-2405.18749

Hirofumi Tsuruta, Hiroyuki Yamazaki, Ryota Maeda, Ryotaro Tamura, Akihiro Imura

Antibodies are crucial proteins produced by the immune system to eliminateharmful foreign substances and have become pivotal therapeutic agents fortreating human diseases. To accelerate the discovery of antibody therapeutics,there is growing interest in constructing language models using antibodysequences. However, the applicability of pre-trained language models forantibody discovery has not been thoroughly evaluated due to the scarcity oflabeled datasets. To overcome these limitations, we introduce AVIDa-SARS-CoV-2,a dataset featuring the antigen-variable domain of heavy chain of heavy chainantibody (VHH) interactions obtained from two alpacas immunized with severeacute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike proteins.AVIDa-SARS-CoV-2 includes binary labels indicating the binding or non-bindingof diverse VHH sequences to 12 SARS-CoV-2 mutants, such as the Delta andOmicron variants. Furthermore, we release VHHCorpus-2M, a pre-training datasetfor antibody language models, containing over two million VHH sequences. Wereport benchmark results for predicting SARS-CoV-2-VHH binding using VHHBERTpre-trained on VHHCorpus-2M and existing general protein and antibody-specificpre-trained language models. These results confirm that AVIDa-SARS-CoV-2provides valuable benchmarks for evaluating the representation capabilities ofantibody language models for binding prediction, thereby facilitating thedevelopment of AI-driven antibody discovery. The datasets are available athttps://datasets.cognanous.com.

抗体是免疫系统产生的重要蛋白质，用于消除有害的外来物质，现已成为治疗人类疾病的关键药物。为了加速抗体疗法的发现，人们对利用抗体序列构建语言模型越来越感兴趣。然而，由于标记数据集的缺乏，预训练语言模型在抗体发现方面的适用性尚未得到全面评估。为了克服这些局限性，我们引入了 AVIDa-SARS-CoV-2，这是一个以重链抗体（VHH）的重链抗原-变异域相互作用为特征的数据集，该数据集是从两头用严重急性呼吸系统综合征冠状病毒 2（SARS-CoV-2）尖峰蛋白免疫的羊驼身上获得的。AVIDa-SARS-CoV-2 包括二进制标签，显示不同的 VHH 序列与 12 种 SARS-CoV-2 突变体（如 Delta 和 Omicron 变体）的结合与否。此外，我们还发布了抗体语言模型的预训练数据集 VHHCorpus-2M，其中包含两百多万条 VHH 序列。我们使用在 VHHCorpus-2M 和现有的通用蛋白质和抗体特异性预训练语言模型上预训练的 VHHBERT 预测了 SARS-CoV-2-VHH 结合的基准结果。这些结果证实，AVIDa-SARS-CoV-2 为评估抗体语言模型在结合预测方面的表征能力提供了有价值的基准，从而促进了人工智能驱动的抗体发现的发展。数据集可在https://datasets.cognanous.com。

{"title":"A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models","authors":"Hirofumi Tsuruta, Hiroyuki Yamazaki, Ryota Maeda, Ryotaro Tamura, Akihiro Imura","doi":"arxiv-2405.18749","DOIUrl":"https://doi.org/arxiv-2405.18749","url":null,"abstract":"Antibodies are crucial proteins produced by the immune system to eliminate\u0000harmful foreign substances and have become pivotal therapeutic agents for\u0000treating human diseases. To accelerate the discovery of antibody therapeutics,\u0000there is growing interest in constructing language models using antibody\u0000sequences. However, the applicability of pre-trained language models for\u0000antibody discovery has not been thoroughly evaluated due to the scarcity of\u0000labeled datasets. To overcome these limitations, we introduce AVIDa-SARS-CoV-2,\u0000a dataset featuring the antigen-variable domain of heavy chain of heavy chain\u0000antibody (VHH) interactions obtained from two alpacas immunized with severe\u0000acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike proteins.\u0000AVIDa-SARS-CoV-2 includes binary labels indicating the binding or non-binding\u0000of diverse VHH sequences to 12 SARS-CoV-2 mutants, such as the Delta and\u0000Omicron variants. Furthermore, we release VHHCorpus-2M, a pre-training dataset\u0000for antibody language models, containing over two million VHH sequences. We\u0000report benchmark results for predicting SARS-CoV-2-VHH binding using VHHBERT\u0000pre-trained on VHHCorpus-2M and existing general protein and antibody-specific\u0000pre-trained language models. These results confirm that AVIDa-SARS-CoV-2\u0000provides valuable benchmarks for evaluating the representation capabilities of\u0000antibody language models for binding prediction, thereby facilitating the\u0000development of AI-driven antibody discovery. The datasets are available at\u0000https://datasets.cognanous.com.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"257 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141194172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Metadata-guided Feature Disentanglement for Functional Genomics 元数据指导下的功能基因组学特征分解

arXiv - QuanBio - Genomics

Pub Date : 2024-05-29 DOI: arxiv-2405.19057

Alexander Rakowski, Remo Monti, Viktoriia Huryn, Marta Lemanczyk, Uwe Ohler, Christoph Lippert

With the development of high-throughput technologies, genomics datasetsrapidly grow in size, including functional genomics data. This has allowed thetraining of large Deep Learning (DL) models to predict epigenetic readouts,such as protein binding or histone modifications, from genome sequences.However, large dataset sizes come at a price of data consistency, oftenaggregating results from a large number of studies, conducted under varyingexperimental conditions. While data from large-scale consortia are useful asthey allow studying the effects of different biological conditions, they canalso contain unwanted biases from confounding experimental factors. Here, weintroduce Metadata-guided Feature Disentanglement (MFD) - an approach thatallows disentangling biologically relevant features from potential technicalbiases. MFD incorporates target metadata into model training, by conditioningweights of the model output layer on different experimental factors. It thenseparates the factors into disjoint groups and enforces independence of thecorresponding feature subspaces with an adversarially learned penalty. We showthat the metadata-driven disentanglement approach allows for better modelintrospection, by connecting latent features to experimental factors, withoutcompromising, or even improving performance in downstream tasks, such asenhancer prediction, or genetic variant discovery. The code for ourimplemementation is available at https://github.com/HealthML/MFD

随着高通量技术的发展，基因组学数据集的规模迅速扩大，其中包括功能基因组学数据。这使得人们能够训练大型深度学习（DL）模型，以预测基因组序列中的表观遗传读数，如蛋白质结合或组蛋白修饰。然而，大型数据集是以数据一致性为代价的，它往往汇集了在不同实验条件下进行的大量研究的结果。虽然来自大规模联合体的数据非常有用，因为它们可以研究不同生物条件的影响，但它们也包含了混杂实验因素带来的不必要的偏差。在这里，我们引入了元数据指导下的特征分离（MFD）--一种可以将生物相关特征与潜在技术偏差分离开来的方法。MFD 将目标元数据纳入模型训练，根据不同的实验因素对模型输出层的权重进行调节。然后，它将这些因素分成不同的组，并通过对抗性学习惩罚来加强相应特征子空间的独立性。我们的研究表明，元数据驱动的分离方法通过将潜在特征与实验因素连接起来，可以更好地进行模型内视，而不会影响甚至提高下游任务的性能，如增强子预测或遗传变异发现。我们的实现代码见 https://github.com/HealthML/MFD。

{"title":"Metadata-guided Feature Disentanglement for Functional Genomics","authors":"Alexander Rakowski, Remo Monti, Viktoriia Huryn, Marta Lemanczyk, Uwe Ohler, Christoph Lippert","doi":"arxiv-2405.19057","DOIUrl":"https://doi.org/arxiv-2405.19057","url":null,"abstract":"With the development of high-throughput technologies, genomics datasets\u0000rapidly grow in size, including functional genomics data. This has allowed the\u0000training of large Deep Learning (DL) models to predict epigenetic readouts,\u0000such as protein binding or histone modifications, from genome sequences.\u0000However, large dataset sizes come at a price of data consistency, often\u0000aggregating results from a large number of studies, conducted under varying\u0000experimental conditions. While data from large-scale consortia are useful as\u0000they allow studying the effects of different biological conditions, they can\u0000also contain unwanted biases from confounding experimental factors. Here, we\u0000introduce Metadata-guided Feature Disentanglement (MFD) - an approach that\u0000allows disentangling biologically relevant features from potential technical\u0000biases. MFD incorporates target metadata into model training, by conditioning\u0000weights of the model output layer on different experimental factors. It then\u0000separates the factors into disjoint groups and enforces independence of the\u0000corresponding feature subspaces with an adversarially learned penalty. We show\u0000that the metadata-driven disentanglement approach allows for better model\u0000introspection, by connecting latent features to experimental factors, without\u0000compromising, or even improving performance in downstream tasks, such as\u0000enhancer prediction, or genetic variant discovery. The code for our\u0000implemementation is available at https://github.com/HealthML/MFD","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141194101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CAVACHON: a hierarchical variational autoencoder to integrate multi-modal single-cell data CAVACHON：用于整合多模态单细胞数据的分层变异自动编码器

arXiv - QuanBio - Genomics

Pub Date : 2024-05-28 DOI: arxiv-2405.18655

Ping-Han Hsieh, Ru-Xiu Hsiao, Katalin Ferenc, Anthony Mathelier, Rebekka Burkholz, Chien-Yu Chen, Geir Kjetil Sandve, Tatiana Belova, Marieke Lydia Kuijjer

Paired single-cell sequencing technologies enable the simultaneousmeasurement of complementary modalities of molecular data at single-cellresolution. Along with the advances in these technologies, many methods basedon variational autoencoders have been developed to integrate these data.However, these methods do not explicitly incorporate prior biologicalrelationships between the data modalities, which could significantly enhancemodeling and interpretation. We propose a novel probabilistic learningframework that explicitly incorporates conditional independence relationshipsbetween multi-modal data as a directed acyclic graph using a generalizedhierarchical variational autoencoder. We demonstrate the versatility of ourframework across various applications pertinent to single-cell multi-omics dataintegration. These include the isolation of common and distinct informationfrom different modalities, modality-specific differential analysis, andintegrated cell clustering. We anticipate that the proposed framework canfacilitate the construction of highly flexible graphical models that cancapture the complexities of biological hypotheses and unravel the connectionsbetween different biological data types, such as different modalities of pairedsingle-cell multi-omics data. The implementation of the proposed framework canbe found in the repository https://github.com/kuijjerlab/CAVACHON.

配对单细胞测序技术可在单细胞分辨率下同时测量互补模式的分子数据。然而，这些方法并没有明确纳入数据模态之间的先验生物学关系，而这种关系可以显著提高建模和解释能力。我们提出了一种新颖的概率学习框架，利用广义层次变异自动编码器将多模态数据之间的条件独立性关系明确纳入有向无环图。我们在与单细胞多组学数据整合相关的各种应用中展示了我们的框架的多功能性。这些应用包括从不同模式中分离出共同和不同的信息、特定模式的差异分析以及整合细胞聚类。我们预计，所提出的框架将有助于构建高度灵活的图形模型，从而捕捉复杂的生物假设，并揭示不同生物数据类型（如不同模式的单细胞多组学数据）之间的联系。拟议框架的实现可在 https://github.com/kuijjerlab/CAVACHON 存储库中找到。

{"title":"CAVACHON: a hierarchical variational autoencoder to integrate multi-modal single-cell data","authors":"Ping-Han Hsieh, Ru-Xiu Hsiao, Katalin Ferenc, Anthony Mathelier, Rebekka Burkholz, Chien-Yu Chen, Geir Kjetil Sandve, Tatiana Belova, Marieke Lydia Kuijjer","doi":"arxiv-2405.18655","DOIUrl":"https://doi.org/arxiv-2405.18655","url":null,"abstract":"Paired single-cell sequencing technologies enable the simultaneous\u0000measurement of complementary modalities of molecular data at single-cell\u0000resolution. Along with the advances in these technologies, many methods based\u0000on variational autoencoders have been developed to integrate these data.\u0000However, these methods do not explicitly incorporate prior biological\u0000relationships between the data modalities, which could significantly enhance\u0000modeling and interpretation. We propose a novel probabilistic learning\u0000framework that explicitly incorporates conditional independence relationships\u0000between multi-modal data as a directed acyclic graph using a generalized\u0000hierarchical variational autoencoder. We demonstrate the versatility of our\u0000framework across various applications pertinent to single-cell multi-omics data\u0000integration. These include the isolation of common and distinct information\u0000from different modalities, modality-specific differential analysis, and\u0000integrated cell clustering. We anticipate that the proposed framework can\u0000facilitate the construction of highly flexible graphical models that can\u0000capture the complexities of biological hypotheses and unravel the connections\u0000between different biological data types, such as different modalities of paired\u0000single-cell multi-omics data. The implementation of the proposed framework can\u0000be found in the repository https://github.com/kuijjerlab/CAVACHON.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"82 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141194173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Range-Limited Heaps' Law for Functional DNA Words in the Human Genome 人类基因组中功能 DNA 词的范围限制希普斯定律

arXiv - QuanBio - Genomics

Pub Date : 2024-05-22 DOI: arxiv-2405.13825

Wentian Li, Yannis Almirantis, Astero Provata

Heaps' or Herdan's law is a linguistic law describing the relationshipbetween the vocabulary/dictionary size (type) and word counts (token) to be apower-law function. Its existence in genomes with certain definition of DNAwords is unclear partly because the dictionary size in genome could be muchsmaller than that in a human language. We define a DNA word in a genome as aDNA coding region that codes for a protein domain. Using human chromosomes andchromosome arms as individual samples, we establish the existence of Heaps' lawin the human genome within limited range. Our definition of words in a genomicor proteomic context is different from that in large language models for DNA orprotein sequences where words are usually short. Although an approximatepower-law distribution of protein domain sizes due to gene duplication and therelated Zipf's law is well known, their translation to the Heaps' law in DNAwords is not automatic. Several other animal genomes are shown herein also toexhibit range-limited Heaps' law with our definition of DNA words, though withvarious exponents, partially depending on their level of complexity.Investigation of Heaps' law and its exponent value could provide an alternativenarrative of reusage and redundancy of protein domains as well as creation ofnew protein domains from a linguistic perspective.

希普斯定律（Heaps' or Herdan's law）是一种语言学定律，它将词汇量/字典大小（类型）与字数（标记）之间的关系描述为幂律函数。该定律在具有特定 DNA 词定义的基因组中是否存在尚不清楚，部分原因是基因组中的词典规模可能比人类语言中的词典规模小得多。我们将基因组中的 DNA 词定义为编码蛋白质域的 DNA 编码区。我们以人类染色体和染色体臂为单个样本，在有限的范围内证实了人类基因组中存在希普斯定律。我们在基因组学或蛋白质组学背景下对单词的定义不同于 DNA 或蛋白质序列的大型语言模型，后者的单词通常很短。虽然由于基因复制导致的蛋白质结构域大小的近似幂律分布和与之相关的齐普夫定律已广为人知，但它们在 DNA 单词中并不能自动转化为希普斯定律。研究 Heaps'定律及其指数值可以从语言学的角度为蛋白质结构域的重复使用和冗余以及新蛋白质结构域的创造提供另一种叙述方式。

{"title":"Range-Limited Heaps' Law for Functional DNA Words in the Human Genome","authors":"Wentian Li, Yannis Almirantis, Astero Provata","doi":"arxiv-2405.13825","DOIUrl":"https://doi.org/arxiv-2405.13825","url":null,"abstract":"Heaps' or Herdan's law is a linguistic law describing the relationship\u0000between the vocabulary/dictionary size (type) and word counts (token) to be a\u0000power-law function. Its existence in genomes with certain definition of DNA\u0000words is unclear partly because the dictionary size in genome could be much\u0000smaller than that in a human language. We define a DNA word in a genome as a\u0000DNA coding region that codes for a protein domain. Using human chromosomes and\u0000chromosome arms as individual samples, we establish the existence of Heaps' law\u0000in the human genome within limited range. Our definition of words in a genomic\u0000or proteomic context is different from that in large language models for DNA or\u0000protein sequences where words are usually short. Although an approximate\u0000power-law distribution of protein domain sizes due to gene duplication and the\u0000related Zipf's law is well known, their translation to the Heaps' law in DNA\u0000words is not automatic. Several other animal genomes are shown herein also to\u0000exhibit range-limited Heaps' law with our definition of DNA words, though with\u0000various exponents, partially depending on their level of complexity.\u0000Investigation of Heaps' law and its exponent value could provide an alternative\u0000narrative of reusage and redundancy of protein domains as well as creation of\u0000new protein domains from a linguistic perspective.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141152918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accurate and efficient protein embedding using multi-teacher distillation learning 利用多教师蒸馏学习实现准确高效的蛋白质嵌入

arXiv - QuanBio - Genomics

Pub Date : 2024-05-20 DOI: arxiv-2405.11735

Jiayu Shang, Cheng Peng, Yongxin Ji, Jiaojiao Guan, Dehan Cai, Xubo Tang, Yanni Sun

Motivation: Protein embedding, which represents proteins as numericalvectors, is a crucial step in various learning-based proteinannotation/classification problems, including gene ontology prediction,protein-protein interaction prediction, and protein structure prediction.However, existing protein embedding methods are often computationally expensivedue to their large number of parameters, which can reach millions or evenbillions. The growing availability of large-scale protein datasets and the needfor efficient analysis tools have created a pressing demand for efficientprotein embedding methods. Results: We propose a novel protein embedding approach based on multi-teacherdistillation learning, which leverages the knowledge of multiple pre-trainedprotein embedding models to learn a compact and informative representation ofproteins. Our method achieves comparable performance to state-of-the-artmethods while significantly reducing computational costs and resourcerequirements. Specifically, our approach reduces computational time by ~70%and maintains almost the same accuracy as the original large models. This makesour method well-suited for large-scale protein analysis and enables thebioinformatics community to perform protein embedding tasks more efficiently.

动机蛋白质嵌入将蛋白质表示为数字向量，是各种基于学习的蛋白质标注/分类问题（包括基因本体预测、蛋白质-蛋白质相互作用预测和蛋白质结构预测）中的关键步骤。然而，现有的蛋白质嵌入方法由于参数数量庞大，可达数百万甚至数十亿，因此通常计算成本高昂。随着大规模蛋白质数据集的日益增多以及对高效分析工具的需求，对高效蛋白质嵌入方法提出了迫切的要求。结果：我们提出了一种基于多教师蒸馏学习的新型蛋白质嵌入方法，该方法利用多个预先训练好的蛋白质嵌入模型的知识来学习一种紧凑且信息丰富的蛋白质表示方法。我们的方法实现了与最先进方法相当的性能，同时大大降低了计算成本和资源需求。具体来说，我们的方法减少了约70%的计算时间，并保持了与原始大型模型几乎相同的精度。这使得我们的方法非常适合大规模蛋白质分析，并使生物信息学界能够更高效地完成蛋白质嵌入任务。

{"title":"Accurate and efficient protein embedding using multi-teacher distillation learning","authors":"Jiayu Shang, Cheng Peng, Yongxin Ji, Jiaojiao Guan, Dehan Cai, Xubo Tang, Yanni Sun","doi":"arxiv-2405.11735","DOIUrl":"https://doi.org/arxiv-2405.11735","url":null,"abstract":"Motivation: Protein embedding, which represents proteins as numerical\u0000vectors, is a crucial step in various learning-based protein\u0000annotation/classification problems, including gene ontology prediction,\u0000protein-protein interaction prediction, and protein structure prediction.\u0000However, existing protein embedding methods are often computationally expensive\u0000due to their large number of parameters, which can reach millions or even\u0000billions. The growing availability of large-scale protein datasets and the need\u0000for efficient analysis tools have created a pressing demand for efficient\u0000protein embedding methods. Results: We propose a novel protein embedding approach based on multi-teacher\u0000distillation learning, which leverages the knowledge of multiple pre-trained\u0000protein embedding models to learn a compact and informative representation of\u0000proteins. Our method achieves comparable performance to state-of-the-art\u0000methods while significantly reducing computational costs and resource\u0000requirements. Specifically, our approach reduces computational time by ~70%\u0000and maintains almost the same accuracy as the original large models. This makes\u0000our method well-suited for large-scale protein analysis and enables the\u0000bioinformatics community to perform protein embedding tasks more efficiently.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141146356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Autoencoder and Generative Adversarial Networks Approach for Multi-Omics Data Imbalanced Class Handling and Classification 一种自动编码器和生成式对抗网络方法用于多传感器数据不平衡类别处理和分类

arXiv - QuanBio - Genomics

Pub Date : 2024-05-16 DOI: arxiv-2405.09756

Ibrahim Al-Hurani, Abedalrhman Alkhateeb, Salama Ikki

In the relentless efforts in enhancing medical diagnostics, the integrationof state-of-the-art machine learning methodologies has emerged as a promisingresearch area. In molecular biology, there has been an explosion of datagenerated from multi-omics sequencing. The advent sequencing equipment canprovide large number of complicated measurements per one experiment. Therefore,traditional statistical methods face challenging tasks when dealing with suchhigh dimensional data. However, most of the information contained in thesedatasets is redundant or unrelated and can be effectively reduced tosignificantly fewer variables without losing much information. Dimensionalityreduction techniques are mathematical procedures that allow for this reduction;they have largely been developed through statistics and machine learningdisciplines. The other challenge in medical datasets is having an imbalancednumber of samples in the classes, which leads to biased results in machinelearning models. This study, focused on tackling these challenges in a neuralnetwork that incorporates autoencoder to extract latent space of the features,and Generative Adversarial Networks (GAN) to generate synthetic samples. Latentspace is the reduced dimensional space that captures the meaningful features ofthe original data. Our model starts with feature selection to select thediscriminative features before feeding them to the neural network. Then, themodel predicts the outcome of cancer for different datasets. The proposed modeloutperformed other existing models by scoring accuracy of 95.09% for bladdercancer dataset and 88.82% for the breast cancer dataset.

在提高医疗诊断水平的不懈努力中，整合最先进的机器学习方法已成为一个前景广阔的研究领域。在分子生物学领域，多组学测序产生的数据呈爆炸式增长。新出现的测序设备可以在一次实验中提供大量复杂的测量数据。因此，传统的统计方法在处理这种高维数据时面临挑战。然而，这些数据集中包含的大部分信息都是冗余或不相关的，因此可以有效地减少变量数量而不会丢失太多信息。降维技术是实现降维的数学方法，主要是通过统计学和机器学习学科发展起来的。医学数据集面临的另一个挑战是类中样本数量不平衡，这会导致机器学习模型的结果出现偏差。本研究的重点是在神经网络中应对这些挑战，该网络结合了自动编码器来提取特征的潜在空间，以及生成对抗网络（GAN）来生成合成样本。潜在空间是一个缩小了的维度空间，它捕捉了原始数据的有意义特征。我们的模型从特征选择开始，选择具有区分度的特征，然后将其输入神经网络。然后，模型预测不同数据集的癌症结果。所提出的模型在膀胱癌数据集和乳腺癌数据集上的准确率分别为 95.09% 和 88.82%，优于其他现有模型。

{"title":"An Autoencoder and Generative Adversarial Networks Approach for Multi-Omics Data Imbalanced Class Handling and Classification","authors":"Ibrahim Al-Hurani, Abedalrhman Alkhateeb, Salama Ikki","doi":"arxiv-2405.09756","DOIUrl":"https://doi.org/arxiv-2405.09756","url":null,"abstract":"In the relentless efforts in enhancing medical diagnostics, the integration\u0000of state-of-the-art machine learning methodologies has emerged as a promising\u0000research area. In molecular biology, there has been an explosion of data\u0000generated from multi-omics sequencing. The advent sequencing equipment can\u0000provide large number of complicated measurements per one experiment. Therefore,\u0000traditional statistical methods face challenging tasks when dealing with such\u0000high dimensional data. However, most of the information contained in these\u0000datasets is redundant or unrelated and can be effectively reduced to\u0000significantly fewer variables without losing much information. Dimensionality\u0000reduction techniques are mathematical procedures that allow for this reduction;\u0000they have largely been developed through statistics and machine learning\u0000disciplines. The other challenge in medical datasets is having an imbalanced\u0000number of samples in the classes, which leads to biased results in machine\u0000learning models. This study, focused on tackling these challenges in a neural\u0000network that incorporates autoencoder to extract latent space of the features,\u0000and Generative Adversarial Networks (GAN) to generate synthetic samples. Latent\u0000space is the reduced dimensional space that captures the meaningful features of\u0000the original data. Our model starts with feature selection to select the\u0000discriminative features before feeding them to the neural network. Then, the\u0000model predicts the outcome of cancer for different datasets. The proposed model\u0000outperformed other existing models by scoring accuracy of 95.09% for bladder\u0000cancer dataset and 88.82% for the breast cancer dataset.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"214 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141062651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling VQDNA：为多物种基因组序列建模释放矢量量化的力量

arXiv - QuanBio - Genomics

Pub Date : 2024-05-13 DOI: arxiv-2405.10812

Siyuan Li, Zedong Wang, Zicheng Liu, Di Wu, Cheng Tan, Jiangbin Zheng, Yufei Huang, Stan Z. Li

Similar to natural language models, pre-trained genome language models areproposed to capture the underlying intricacies within genomes with unsupervisedsequence modeling. They have become essential tools for researchers andpractitioners in biology. However, the textit{hand-crafted} tokenizationpolicies used in these models may not encode the most discriminative patternsfrom the limited vocabulary of genomic data. In this paper, we introduce VQDNA,a general-purpose framework that renovates genome tokenization from theperspective of genome vocabulary learning. By leveraging vector-quantizedcodebook as textit{learnable} vocabulary, VQDNA can adaptively tokenizegenomes into textit{pattern-aware} embeddings in an end-to-end manner. Tofurther push its limits, we propose Hierarchical Residual Quantization (HRQ),where varying scales of codebooks are designed in a hierarchy to enrich thegenome vocabulary in a coarse-to-fine manner. Extensive experiments on 32genome datasets demonstrate VQDNA's superiority and favorable parameterefficiency compared to existing genome language models. Notably, empiricalanalysis of SARS-CoV-2 mutations reveals the fine-grained pattern awareness andbiological significance of learned HRQ vocabulary, highlighting its untappedpotential for broader applications in genomics.

与自然语言模型类似，预训练的基因组语言模型被提出来通过无监督序列建模捕捉基因组中潜在的复杂性。它们已成为生物学研究人员和从业人员的必备工具。然而，这些模型中使用的 "文本"{hand-crafted}标记化策略可能无法从有限的基因组数据词汇中编码出最具辨别力的模式。本文介绍的 VQDNA 是一个通用框架，它从基因组词汇学习的角度对基因组标记化进行了革新。通过利用向量量化码本作为（textit{learnable}）词汇，VQDNA可以以端到端的方式自适应地将基因组标记化为（textit{pattern-aware}）嵌入。为了进一步提升其极限，我们提出了分层残差量化（HRQ）技术，即通过分层设计不同规模的编码本，以从粗到细的方式丰富基因组词汇。在 32 个基因组数据集上进行的广泛实验证明，与现有的基因组语言模型相比，VQDNA 具有优越性和良好的参数效率。值得注意的是，对 SARS-CoV-2 突变的实证分析揭示了所学 HRQ 词汇的精细模式识别和生物学意义，凸显了它在基因组学更广泛应用中尚未开发的潜力。

{"title":"VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling","authors":"Siyuan Li, Zedong Wang, Zicheng Liu, Di Wu, Cheng Tan, Jiangbin Zheng, Yufei Huang, Stan Z. Li","doi":"arxiv-2405.10812","DOIUrl":"https://doi.org/arxiv-2405.10812","url":null,"abstract":"Similar to natural language models, pre-trained genome language models are\u0000proposed to capture the underlying intricacies within genomes with unsupervised\u0000sequence modeling. They have become essential tools for researchers and\u0000practitioners in biology. However, the textit{hand-crafted} tokenization\u0000policies used in these models may not encode the most discriminative patterns\u0000from the limited vocabulary of genomic data. In this paper, we introduce VQDNA,\u0000a general-purpose framework that renovates genome tokenization from the\u0000perspective of genome vocabulary learning. By leveraging vector-quantized\u0000codebook as textit{learnable} vocabulary, VQDNA can adaptively tokenize\u0000genomes into textit{pattern-aware} embeddings in an end-to-end manner. To\u0000further push its limits, we propose Hierarchical Residual Quantization (HRQ),\u0000where varying scales of codebooks are designed in a hierarchy to enrich the\u0000genome vocabulary in a coarse-to-fine manner. Extensive experiments on 32\u0000genome datasets demonstrate VQDNA's superiority and favorable parameter\u0000efficiency compared to existing genome language models. Notably, empirical\u0000analysis of SARS-CoV-2 mutations reveals the fine-grained pattern awareness and\u0000biological significance of learned HRQ vocabulary, highlighting its untapped\u0000potential for broader applications in genomics.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141146433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Characterizing virulence differences in a parasitoid wasp through comparative transcriptomic and proteomic 通过比较转录组和蛋白质组鉴定寄生蜂的毒力差异

arXiv - QuanBio - Genomics

Pub Date : 2024-05-13 DOI: arxiv-2405.07772

Samuel GornardEGCE, Pascaline Venon, Florian Lasfont, Thierry Balliau, Laure Marie-Paule Kaiser-Arnauld, Florence Mougel

Background: Two strains of the endoparasitoid Cotesia typhae present adifferential parasitism success on the host, Sesamia nonagrioides. One isvirulent on both permissive and resistant host populations, and the other onlyon the permissive host. This interaction provides a very interesting frame forstudying virulence factors. Here, we used a combination of comparativetranscriptomic and proteomic analyses to unravel the molecular basis underlyingvirulence differences between the strains.Results: First, we report thatvirulence genes are mostly expressed during the nymphal stage of theparasitoid. Especially, proviral genes are broadly up-regulated at this stage,while their expression is only expected in the host. Parasitoid gene expressionin the host increases with time, indicating the production of more virulencefactors. Secondly, comparison between strains reveals differences in venomcomposition, with 12 proteins showing differential abundance. Proviralexpression in the host displays a strong temporal variability, along withdifferential patterns between strains. Notably, a subset of proviral genesincluding protein-tyrosine phosphatases is specifically over-expressed in theresistant host parasitized by the less virulent strain, 24 hours afterparasitism. This result particularly hints at host modulation of proviralexpression.Conclusions: This study sheds light on the temporal expression ofvirulence factors of Cotesia typhae, both in the host and in the parasitoid. Italso identifies potential molecular candidates driving differences inparasitism success between two strains. Together, those findings provide a pathfor further exploration of virulence mechanisms in parasitoid wasps, and offerinsights into host-parasitoid coevolution.

背景：内寄生虫 Cotesia typhae 的两种菌株对宿主 Sesamia nonagrioides 的寄生成功率存在差异。其中一株对允许寄生的宿主种群和具有抗性的宿主种群都具有病毒性，而另一株仅对允许寄生的宿主具有病毒性。这种相互作用为研究毒力因子提供了一个非常有趣的框架。在这里，我们结合使用了转录组学和蛋白质组学的比较分析方法，以揭示菌株间毒力差异的分子基础：结果：首先，我们发现病毒性基因主要在若虫期表达。尤其是病毒基因在这一阶段广泛上调，而它们的表达只有在宿主体内才会出现。寄生虫基因在宿主体内的表达量随着时间的推移而增加，这表明寄生虫产生了更多的毒力因子。其次，不同毒株之间的比较显示了毒液组成的差异，有 12 种蛋白质的丰度不同。宿主体内的毒腺表达具有很强的时间差异性，不同毒株之间的表达模式也不尽相同。值得注意的是，包括蛋白酪氨酸磷酸酶在内的一组前病毒基因在寄生 24 小时后被毒力较弱的毒株寄生的抗性宿主体内特别过度表达。这一结果特别提示了宿主对前体表达的调节作用：本研究揭示了伤寒木虱病毒因子在宿主和寄生虫体内的时间表达。它还发现了导致两个品系之间寄生成功率差异的潜在分子候选因子。总之，这些发现为进一步探索寄生蜂的毒力机制提供了一条途径，并为寄主与寄生蜂的共同进化提供了启示。

{"title":"Characterizing virulence differences in a parasitoid wasp through comparative transcriptomic and proteomic","authors":"Samuel GornardEGCE, Pascaline Venon, Florian Lasfont, Thierry Balliau, Laure Marie-Paule Kaiser-Arnauld, Florence Mougel","doi":"arxiv-2405.07772","DOIUrl":"https://doi.org/arxiv-2405.07772","url":null,"abstract":"Background: Two strains of the endoparasitoid Cotesia typhae present a\u0000differential parasitism success on the host, Sesamia nonagrioides. One is\u0000virulent on both permissive and resistant host populations, and the other only\u0000on the permissive host. This interaction provides a very interesting frame for\u0000studying virulence factors. Here, we used a combination of comparative\u0000transcriptomic and proteomic analyses to unravel the molecular basis underlying\u0000virulence differences between the strains.Results: First, we report that\u0000virulence genes are mostly expressed during the nymphal stage of the\u0000parasitoid. Especially, proviral genes are broadly up-regulated at this stage,\u0000while their expression is only expected in the host. Parasitoid gene expression\u0000in the host increases with time, indicating the production of more virulence\u0000factors. Secondly, comparison between strains reveals differences in venom\u0000composition, with 12 proteins showing differential abundance. Proviral\u0000expression in the host displays a strong temporal variability, along with\u0000differential patterns between strains. Notably, a subset of proviral genes\u0000including protein-tyrosine phosphatases is specifically over-expressed in the\u0000resistant host parasitized by the less virulent strain, 24 hours after\u0000parasitism. This result particularly hints at host modulation of proviral\u0000expression.Conclusions: This study sheds light on the temporal expression of\u0000virulence factors of Cotesia typhae, both in the host and in the parasitoid. It\u0000also identifies potential molecular candidates driving differences in\u0000parasitism success between two strains. Together, those findings provide a path\u0000for further exploration of virulence mechanisms in parasitoid wasps, and offer\u0000insights into host-parasitoid coevolution.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140934831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fine-tuning Protein Language Models with Deep Mutational Scanning improves Variant Effect Prediction 利用深度突变扫描微调蛋白质语言模型，提高变异效应预测能力

arXiv - QuanBio - Genomics

Pub Date : 2024-05-10 DOI: arxiv-2405.06729

Aleix Lafita, Ferran Gonzalez, Mahmoud Hossam, Paul Smyth, Jacob Deasy, Ari Allyn-Feuer, Daniel Seaton, Stephen Young

Protein Language Models (PLMs) have emerged as performant and scalable toolsfor predicting the functional impact and clinical significance ofprotein-coding variants, but they still lag experimental accuracy. Here, wepresent a novel fine-tuning approach to improve the performance of PLMs withexperimental maps of variant effects from Deep Mutational Scanning (DMS) assaysusing a Normalised Log-odds Ratio (NLR) head. We find consistent improvementsin a held-out protein test set, and on independent DMS and clinical variantannotation benchmarks from ProteinGym and ClinVar. These findings demonstratethat DMS is a promising source of sequence diversity and supervised trainingdata for improving the performance of PLMs for variant effect prediction.

蛋白质语言模型（PLMs）已成为预测蛋白质编码变异的功能影响和临床意义的高性能、可扩展的工具，但其准确性仍落后于实验准确性。在这里，我们提出了一种新颖的微调方法，利用归一化对数比率（NLR）头，通过深度突变扫描（DMS）测定的变异效应实验图来提高 PLM 的性能。我们发现，DMS 和来自 ProteinGym 和 ClinVar 的临床变异注释基准在蛋白质测试集、独立 DMS 和临床变异注释基准上都有一致的改进。这些研究结果表明，DMS 是序列多样性和监督训练数据的理想来源，可以提高 PLM 在变异效应预测方面的性能。

引用次数: 0