{"title":"用于抗体语言模型的 SARS-CoV-2 相互作用数据集和 VHH 序列语料库","authors":"Hirofumi Tsuruta, Hiroyuki Yamazaki, Ryota Maeda, Ryotaro Tamura, Akihiro Imura","doi":"arxiv-2405.18749","DOIUrl":null,"url":null,"abstract":"Antibodies are crucial proteins produced by the immune system to eliminate\nharmful foreign substances and have become pivotal therapeutic agents for\ntreating human diseases. To accelerate the discovery of antibody therapeutics,\nthere is growing interest in constructing language models using antibody\nsequences. However, the applicability of pre-trained language models for\nantibody discovery has not been thoroughly evaluated due to the scarcity of\nlabeled datasets. To overcome these limitations, we introduce AVIDa-SARS-CoV-2,\na dataset featuring the antigen-variable domain of heavy chain of heavy chain\nantibody (VHH) interactions obtained from two alpacas immunized with severe\nacute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike proteins.\nAVIDa-SARS-CoV-2 includes binary labels indicating the binding or non-binding\nof diverse VHH sequences to 12 SARS-CoV-2 mutants, such as the Delta and\nOmicron variants. Furthermore, we release VHHCorpus-2M, a pre-training dataset\nfor antibody language models, containing over two million VHH sequences. We\nreport benchmark results for predicting SARS-CoV-2-VHH binding using VHHBERT\npre-trained on VHHCorpus-2M and existing general protein and antibody-specific\npre-trained language models. These results confirm that AVIDa-SARS-CoV-2\nprovides valuable benchmarks for evaluating the representation capabilities of\nantibody language models for binding prediction, thereby facilitating the\ndevelopment of AI-driven antibody discovery. The datasets are available at\nhttps://datasets.cognanous.com.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"257 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models\",\"authors\":\"Hirofumi Tsuruta, Hiroyuki Yamazaki, Ryota Maeda, Ryotaro Tamura, Akihiro Imura\",\"doi\":\"arxiv-2405.18749\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Antibodies are crucial proteins produced by the immune system to eliminate\\nharmful foreign substances and have become pivotal therapeutic agents for\\ntreating human diseases. To accelerate the discovery of antibody therapeutics,\\nthere is growing interest in constructing language models using antibody\\nsequences. However, the applicability of pre-trained language models for\\nantibody discovery has not been thoroughly evaluated due to the scarcity of\\nlabeled datasets. To overcome these limitations, we introduce AVIDa-SARS-CoV-2,\\na dataset featuring the antigen-variable domain of heavy chain of heavy chain\\nantibody (VHH) interactions obtained from two alpacas immunized with severe\\nacute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike proteins.\\nAVIDa-SARS-CoV-2 includes binary labels indicating the binding or non-binding\\nof diverse VHH sequences to 12 SARS-CoV-2 mutants, such as the Delta and\\nOmicron variants. Furthermore, we release VHHCorpus-2M, a pre-training dataset\\nfor antibody language models, containing over two million VHH sequences. We\\nreport benchmark results for predicting SARS-CoV-2-VHH binding using VHHBERT\\npre-trained on VHHCorpus-2M and existing general protein and antibody-specific\\npre-trained language models. These results confirm that AVIDa-SARS-CoV-2\\nprovides valuable benchmarks for evaluating the representation capabilities of\\nantibody language models for binding prediction, thereby facilitating the\\ndevelopment of AI-driven antibody discovery. The datasets are available at\\nhttps://datasets.cognanous.com.\",\"PeriodicalId\":501070,\"journal\":{\"name\":\"arXiv - QuanBio - Genomics\",\"volume\":\"257 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Genomics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2405.18749\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.18749","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models
Antibodies are crucial proteins produced by the immune system to eliminate
harmful foreign substances and have become pivotal therapeutic agents for
treating human diseases. To accelerate the discovery of antibody therapeutics,
there is growing interest in constructing language models using antibody
sequences. However, the applicability of pre-trained language models for
antibody discovery has not been thoroughly evaluated due to the scarcity of
labeled datasets. To overcome these limitations, we introduce AVIDa-SARS-CoV-2,
a dataset featuring the antigen-variable domain of heavy chain of heavy chain
antibody (VHH) interactions obtained from two alpacas immunized with severe
acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike proteins.
AVIDa-SARS-CoV-2 includes binary labels indicating the binding or non-binding
of diverse VHH sequences to 12 SARS-CoV-2 mutants, such as the Delta and
Omicron variants. Furthermore, we release VHHCorpus-2M, a pre-training dataset
for antibody language models, containing over two million VHH sequences. We
report benchmark results for predicting SARS-CoV-2-VHH binding using VHHBERT
pre-trained on VHHCorpus-2M and existing general protein and antibody-specific
pre-trained language models. These results confirm that AVIDa-SARS-CoV-2
provides valuable benchmarks for evaluating the representation capabilities of
antibody language models for binding prediction, thereby facilitating the
development of AI-driven antibody discovery. The datasets are available at
https://datasets.cognanous.com.