Learning the Protein Language Model of SARS-CoV-2 Spike Proteins

Paul Vincent Llanes, Geoffrey A. Solano, Marc Jermaine Pontiveros
{"title":"Learning the Protein Language Model of SARS-CoV-2 Spike Proteins","authors":"Paul Vincent Llanes, Geoffrey A. Solano, Marc Jermaine Pontiveros","doi":"10.1109/ICAIIC57133.2023.10067040","DOIUrl":null,"url":null,"abstract":"Ahstract-SARS-CoV-2 virus has long been evolving posing an increased risk in terms of infectivity and transmissibility which causes greater impact in communities worldwide. With the surge of collected SARS-CoV-2 sequences, studies found out that most of the emerging variants are linked to increased mutations in the spike (S) protein as observed in Alpha, Beta, Gamma, and Delta variants. Multiple approaches on genomic surveillance have been performed to monitor the mutational status and spread of the virus however most are heavily dependent on labels attributed to these sequences. Hence, this study features a system that has the capability to learn the protein language model of SARS-CoV-2 spike proteins, based on a bidirectional long-short term memory (BiLSTM) recurrent neural network, using sequence data alone. Upon obtaining the sequence embedding from the model, observed clusters are generated using the Leiden clustering algorithm and is visualized to monitor similarities between variants in terms of grammatical probability and semantic change. Additionally, the system measures the validity of a user-generated next-generation sequence capturing potential sequence mutations indicative of viral escape, particularly mutations by substitutions. Further studies on methods uncovering semantic rules that govern spike proteins are recommended to learn more about other viral characteristics conclusive of the future of the COVID-19 pandemic.","PeriodicalId":105769,"journal":{"name":"2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAIIC57133.2023.10067040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Ahstract-SARS-CoV-2 virus has long been evolving posing an increased risk in terms of infectivity and transmissibility which causes greater impact in communities worldwide. With the surge of collected SARS-CoV-2 sequences, studies found out that most of the emerging variants are linked to increased mutations in the spike (S) protein as observed in Alpha, Beta, Gamma, and Delta variants. Multiple approaches on genomic surveillance have been performed to monitor the mutational status and spread of the virus however most are heavily dependent on labels attributed to these sequences. Hence, this study features a system that has the capability to learn the protein language model of SARS-CoV-2 spike proteins, based on a bidirectional long-short term memory (BiLSTM) recurrent neural network, using sequence data alone. Upon obtaining the sequence embedding from the model, observed clusters are generated using the Leiden clustering algorithm and is visualized to monitor similarities between variants in terms of grammatical probability and semantic change. Additionally, the system measures the validity of a user-generated next-generation sequence capturing potential sequence mutations indicative of viral escape, particularly mutations by substitutions. Further studies on methods uncovering semantic rules that govern spike proteins are recommended to learn more about other viral characteristics conclusive of the future of the COVID-19 pandemic.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SARS-CoV-2刺突蛋白的蛋白质语言模型研究
摘要- sars - cov -2病毒长期以来一直在进化,在传染性和传播性方面的风险越来越大,对全球社区造成了更大的影响。随着收集到的SARS-CoV-2序列的激增,研究发现,大多数新出现的变体与α、β、γ和δ变体中观察到的刺突(S)蛋白突变增加有关。已经采取了多种基因组监测方法来监测病毒的突变状态和传播,但大多数方法严重依赖于这些序列的标签。因此,本研究的特点是一个能够学习SARS-CoV-2刺突蛋白的蛋白质语言模型的系统,基于双向长短期记忆(BiLSTM)递归神经网络,仅使用序列数据。从模型中获得序列嵌入后,使用Leiden聚类算法生成观察到的聚类,并将其可视化,从语法概率和语义变化方面监测变体之间的相似性。此外,该系统测量用户生成的下一代序列的有效性,捕获指示病毒逃逸的潜在序列突变,特别是由替换引起的突变。建议进一步研究发现控制刺突蛋白的语义规则的方法,以了解更多关于COVID-19大流行未来的其他病毒特征。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Development of AI Educational Datasets Library Using Synthetic Dataset Generation Method Channel Access Control Instead of Random Backoff Algorithm Illegal 3D Content Distribution Tracking System based on DNN Forensic Watermarking Deep Learning-based Spectral Efficiency Maximization in Massive MIMO-NOMA Systems with STAR-RIS Data Pipeline Design for Dangerous Driving Behavior Detection System
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1