{"title":"Forecasting dominance of SARS-CoV-2 lineages by anomaly detection using deep AutoEncoders.","authors":"Simone Rancati, Giovanna Nicora, Mattia Prosperi, Riccardo Bellazzi, Marco Salemi, Simone Marini","doi":"10.1101/2023.10.24.563721","DOIUrl":null,"url":null,"abstract":"<p><p>The coronavirus disease of 2019 (COVID-19) pandemic is characterized by sequential emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants, lineages, and sublineages, outcompeting previously circulating ones because of, among other factors, increased transmissibility and immune escape. We propose DeepAutoCoV, an unsupervised deep learning anomaly detection system to predict future dominant lineages (FDLs). We define FDLs as viral (sub)lineages that will constitute more than 10% of all the viral sequences added to the GISAID database on a given week. DeepAutoCoV is trained and validated by assembling global and country-specific data sets from over 16 million Spike protein sequences sampled over a period of about 4 years. DeepAutoCoV successfully flags FDLs at very low frequencies (0.01% - 3%), with median lead times of 4-17 weeks, and predicts FDLs ~5 and ~25 times better than a baseline approach For example, the B.1.617.2 vaccine reference strain was flagged as FDL when its frequency was only 0.01%, more than a year before it was considered for an updated COVID-19 vaccine. Furthermore, DeepAutoCoV outputs interpretable results by pinpointing specific mutations potentially linked to increased fitness, and may provide significant insights for the optimization of public health <i>pre-emptive</i> intervention strategies.</p>","PeriodicalId":72407,"journal":{"name":"bioRxiv : the preprint server for biology","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10634784/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv : the preprint server for biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2023.10.24.563721","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The coronavirus disease of 2019 (COVID-19) pandemic is characterized by sequential emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants, lineages, and sublineages, outcompeting previously circulating ones because of, among other factors, increased transmissibility and immune escape. We propose DeepAutoCoV, an unsupervised deep learning anomaly detection system to predict future dominant lineages (FDLs). We define FDLs as viral (sub)lineages that will constitute more than 10% of all the viral sequences added to the GISAID database on a given week. DeepAutoCoV is trained and validated by assembling global and country-specific data sets from over 16 million Spike protein sequences sampled over a period of about 4 years. DeepAutoCoV successfully flags FDLs at very low frequencies (0.01% - 3%), with median lead times of 4-17 weeks, and predicts FDLs ~5 and ~25 times better than a baseline approach For example, the B.1.617.2 vaccine reference strain was flagged as FDL when its frequency was only 0.01%, more than a year before it was considered for an updated COVID-19 vaccine. Furthermore, DeepAutoCoV outputs interpretable results by pinpointing specific mutations potentially linked to increased fitness, and may provide significant insights for the optimization of public health pre-emptive intervention strategies.