用于抗体语言模型的 SARS-CoV-2 相互作用数据集和 VHH 序列语料库

arXiv - QuanBio - Genomics Pub Date : 2024-05-29 DOI:arxiv-2405.18749

Hirofumi Tsuruta, Hiroyuki Yamazaki, Ryota Maeda, Ryotaro Tamura, Akihiro Imura

{"title":"用于抗体语言模型的 SARS-CoV-2 相互作用数据集和 VHH 序列语料库","authors":"Hirofumi Tsuruta, Hiroyuki Yamazaki, Ryota Maeda, Ryotaro Tamura, Akihiro Imura","doi":"arxiv-2405.18749","DOIUrl":null,"url":null,"abstract":"Antibodies are crucial proteins produced by the immune system to eliminate\nharmful foreign substances and have become pivotal therapeutic agents for\ntreating human diseases. To accelerate the discovery of antibody therapeutics,\nthere is growing interest in constructing language models using antibody\nsequences. However, the applicability of pre-trained language models for\nantibody discovery has not been thoroughly evaluated due to the scarcity of\nlabeled datasets. To overcome these limitations, we introduce AVIDa-SARS-CoV-2,\na dataset featuring the antigen-variable domain of heavy chain of heavy chain\nantibody (VHH) interactions obtained from two alpacas immunized with severe\nacute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike proteins.\nAVIDa-SARS-CoV-2 includes binary labels indicating the binding or non-binding\nof diverse VHH sequences to 12 SARS-CoV-2 mutants, such as the Delta and\nOmicron variants. Furthermore, we release VHHCorpus-2M, a pre-training dataset\nfor antibody language models, containing over two million VHH sequences. We\nreport benchmark results for predicting SARS-CoV-2-VHH binding using VHHBERT\npre-trained on VHHCorpus-2M and existing general protein and antibody-specific\npre-trained language models. These results confirm that AVIDa-SARS-CoV-2\nprovides valuable benchmarks for evaluating the representation capabilities of\nantibody language models for binding prediction, thereby facilitating the\ndevelopment of AI-driven antibody discovery. The datasets are available at\nhttps://datasets.cognanous.com.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"257 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models\",\"authors\":\"Hirofumi Tsuruta, Hiroyuki Yamazaki, Ryota Maeda, Ryotaro Tamura, Akihiro Imura\",\"doi\":\"arxiv-2405.18749\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Antibodies are crucial proteins produced by the immune system to eliminate\\nharmful foreign substances and have become pivotal therapeutic agents for\\ntreating human diseases. To accelerate the discovery of antibody therapeutics,\\nthere is growing interest in constructing language models using antibody\\nsequences. However, the applicability of pre-trained language models for\\nantibody discovery has not been thoroughly evaluated due to the scarcity of\\nlabeled datasets. To overcome these limitations, we introduce AVIDa-SARS-CoV-2,\\na dataset featuring the antigen-variable domain of heavy chain of heavy chain\\nantibody (VHH) interactions obtained from two alpacas immunized with severe\\nacute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike proteins.\\nAVIDa-SARS-CoV-2 includes binary labels indicating the binding or non-binding\\nof diverse VHH sequences to 12 SARS-CoV-2 mutants, such as the Delta and\\nOmicron variants. Furthermore, we release VHHCorpus-2M, a pre-training dataset\\nfor antibody language models, containing over two million VHH sequences. We\\nreport benchmark results for predicting SARS-CoV-2-VHH binding using VHHBERT\\npre-trained on VHHCorpus-2M and existing general protein and antibody-specific\\npre-trained language models. These results confirm that AVIDa-SARS-CoV-2\\nprovides valuable benchmarks for evaluating the representation capabilities of\\nantibody language models for binding prediction, thereby facilitating the\\ndevelopment of AI-driven antibody discovery. The datasets are available at\\nhttps://datasets.cognanous.com.\",\"PeriodicalId\":501070,\"journal\":{\"name\":\"arXiv - QuanBio - Genomics\",\"volume\":\"257 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Genomics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2405.18749\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.18749","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

抗体是免疫系统产生的重要蛋白质，用于消除有害的外来物质，现已成为治疗人类疾病的关键药物。为了加速抗体疗法的发现，人们对利用抗体序列构建语言模型越来越感兴趣。然而，由于标记数据集的缺乏，预训练语言模型在抗体发现方面的适用性尚未得到全面评估。为了克服这些局限性，我们引入了 AVIDa-SARS-CoV-2，这是一个以重链抗体（VHH）的重链抗原-变异域相互作用为特征的数据集，该数据集是从两头用严重急性呼吸系统综合征冠状病毒 2（SARS-CoV-2）尖峰蛋白免疫的羊驼身上获得的。AVIDa-SARS-CoV-2 包括二进制标签，显示不同的 VHH 序列与 12 种 SARS-CoV-2 突变体（如 Delta 和 Omicron 变体）的结合与否。此外，我们还发布了抗体语言模型的预训练数据集 VHHCorpus-2M，其中包含两百多万条 VHH 序列。我们使用在 VHHCorpus-2M 和现有的通用蛋白质和抗体特异性预训练语言模型上预训练的 VHHBERT 预测了 SARS-CoV-2-VHH 结合的基准结果。这些结果证实，AVIDa-SARS-CoV-2 为评估抗体语言模型在结合预测方面的表征能力提供了有价值的基准，从而促进了人工智能驱动的抗体发现的发展。数据集可在https://datasets.cognanous.com。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models

Antibodies are crucial proteins produced by the immune system to eliminate harmful foreign substances and have become pivotal therapeutic agents for treating human diseases. To accelerate the discovery of antibody therapeutics, there is growing interest in constructing language models using antibody sequences. However, the applicability of pre-trained language models for antibody discovery has not been thoroughly evaluated due to the scarcity of labeled datasets. To overcome these limitations, we introduce AVIDa-SARS-CoV-2, a dataset featuring the antigen-variable domain of heavy chain of heavy chain antibody (VHH) interactions obtained from two alpacas immunized with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike proteins. AVIDa-SARS-CoV-2 includes binary labels indicating the binding or non-binding of diverse VHH sequences to 12 SARS-CoV-2 mutants, such as the Delta and Omicron variants. Furthermore, we release VHHCorpus-2M, a pre-training dataset for antibody language models, containing over two million VHH sequences. We report benchmark results for predicting SARS-CoV-2-VHH binding using VHHBERT pre-trained on VHHCorpus-2M and existing general protein and antibody-specific pre-trained language models. These results confirm that AVIDa-SARS-CoV-2 provides valuable benchmarks for evaluating the representation capabilities of antibody language models for binding prediction, thereby facilitating the development of AI-driven antibody discovery. The datasets are available at https://datasets.cognanous.com.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - QuanBio - Genomics

自引率

0.00%

发文量