Paul Vincent Llanes, Geoffrey A. Solano, Marc Jermaine Pontiveros
{"title":"SARS-CoV-2刺突蛋白的蛋白质语言模型研究","authors":"Paul Vincent Llanes, Geoffrey A. Solano, Marc Jermaine Pontiveros","doi":"10.1109/ICAIIC57133.2023.10067040","DOIUrl":null,"url":null,"abstract":"Ahstract-SARS-CoV-2 virus has long been evolving posing an increased risk in terms of infectivity and transmissibility which causes greater impact in communities worldwide. With the surge of collected SARS-CoV-2 sequences, studies found out that most of the emerging variants are linked to increased mutations in the spike (S) protein as observed in Alpha, Beta, Gamma, and Delta variants. Multiple approaches on genomic surveillance have been performed to monitor the mutational status and spread of the virus however most are heavily dependent on labels attributed to these sequences. Hence, this study features a system that has the capability to learn the protein language model of SARS-CoV-2 spike proteins, based on a bidirectional long-short term memory (BiLSTM) recurrent neural network, using sequence data alone. Upon obtaining the sequence embedding from the model, observed clusters are generated using the Leiden clustering algorithm and is visualized to monitor similarities between variants in terms of grammatical probability and semantic change. Additionally, the system measures the validity of a user-generated next-generation sequence capturing potential sequence mutations indicative of viral escape, particularly mutations by substitutions. Further studies on methods uncovering semantic rules that govern spike proteins are recommended to learn more about other viral characteristics conclusive of the future of the COVID-19 pandemic.","PeriodicalId":105769,"journal":{"name":"2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning the Protein Language Model of SARS-CoV-2 Spike Proteins\",\"authors\":\"Paul Vincent Llanes, Geoffrey A. Solano, Marc Jermaine Pontiveros\",\"doi\":\"10.1109/ICAIIC57133.2023.10067040\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Ahstract-SARS-CoV-2 virus has long been evolving posing an increased risk in terms of infectivity and transmissibility which causes greater impact in communities worldwide. With the surge of collected SARS-CoV-2 sequences, studies found out that most of the emerging variants are linked to increased mutations in the spike (S) protein as observed in Alpha, Beta, Gamma, and Delta variants. Multiple approaches on genomic surveillance have been performed to monitor the mutational status and spread of the virus however most are heavily dependent on labels attributed to these sequences. Hence, this study features a system that has the capability to learn the protein language model of SARS-CoV-2 spike proteins, based on a bidirectional long-short term memory (BiLSTM) recurrent neural network, using sequence data alone. Upon obtaining the sequence embedding from the model, observed clusters are generated using the Leiden clustering algorithm and is visualized to monitor similarities between variants in terms of grammatical probability and semantic change. Additionally, the system measures the validity of a user-generated next-generation sequence capturing potential sequence mutations indicative of viral escape, particularly mutations by substitutions. Further studies on methods uncovering semantic rules that govern spike proteins are recommended to learn more about other viral characteristics conclusive of the future of the COVID-19 pandemic.\",\"PeriodicalId\":105769,\"journal\":{\"name\":\"2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAIIC57133.2023.10067040\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAIIC57133.2023.10067040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Learning the Protein Language Model of SARS-CoV-2 Spike Proteins
Ahstract-SARS-CoV-2 virus has long been evolving posing an increased risk in terms of infectivity and transmissibility which causes greater impact in communities worldwide. With the surge of collected SARS-CoV-2 sequences, studies found out that most of the emerging variants are linked to increased mutations in the spike (S) protein as observed in Alpha, Beta, Gamma, and Delta variants. Multiple approaches on genomic surveillance have been performed to monitor the mutational status and spread of the virus however most are heavily dependent on labels attributed to these sequences. Hence, this study features a system that has the capability to learn the protein language model of SARS-CoV-2 spike proteins, based on a bidirectional long-short term memory (BiLSTM) recurrent neural network, using sequence data alone. Upon obtaining the sequence embedding from the model, observed clusters are generated using the Leiden clustering algorithm and is visualized to monitor similarities between variants in terms of grammatical probability and semantic change. Additionally, the system measures the validity of a user-generated next-generation sequence capturing potential sequence mutations indicative of viral escape, particularly mutations by substitutions. Further studies on methods uncovering semantic rules that govern spike proteins are recommended to learn more about other viral characteristics conclusive of the future of the COVID-19 pandemic.