Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2.

IF 5.5 2区 医学 Q1 VIROLOGY Virus Evolution Pub Date : 2024-11-14 eCollection Date: 2024-01-01 DOI:10.1093/ve/veae087
Sravani Nanduri, Allison Black, Trevor Bedford, John Huddleston
{"title":"Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2.","authors":"Sravani Nanduri, Allison Black, Trevor Bedford, John Huddleston","doi":"10.1093/ve/veae087","DOIUrl":null,"url":null,"abstract":"<p><p>Public health researchers and practitioners commonly infer phylogenies from viral genome sequences to understand transmission dynamics and identify clusters of genetically-related samples. However, viruses that reassort or recombine violate phylogenetic assumptions and require more sophisticated methods. Even when phylogenies are appropriate, they can be unnecessary or difficult to interpret without specialty knowledge. For example, pairwise distances between sequences can be enough to identify clusters of related samples or assign new samples to existing phylogenetic clusters. In this work, we tested whether dimensionality reduction methods could capture known genetic groups within two human pathogenic viruses that cause substantial human morbidity and mortality and frequently reassort or recombine, respectively: seasonal influenza A/H3N2 and SARS-CoV-2. We applied principal component analysis, multidimensional scaling (MDS), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection to sequences with well-defined phylogenetic clades and either reassortment (H3N2) or recombination (SARS-CoV-2). For each low-dimensional embedding of sequences, we calculated the correlation between pairwise genetic and Euclidean distances in the embedding and applied a hierarchical clustering method to identify clusters in the embedding. We measured the accuracy of clusters compared to previously defined phylogenetic clades, reassortment clusters, or recombinant lineages. We found that MDS embeddings accurately represented pairwise genetic distances including the intermediate placement of recombinant SARS-CoV-2 lineages between parental lineages. Clusters from t-SNE embeddings accurately recapitulated known phylogenetic clades, H3N2 reassortment groups, and SARS-CoV-2 recombinant lineages. We show that simple statistical methods without a biological model can accurately represent known genetic relationships for relevant human pathogenic viruses. Our open source implementation of these methods for analysis of viral genome sequences can be easily applied when phylogenetic methods are either unnecessary or inappropriate.</p>","PeriodicalId":56026,"journal":{"name":"Virus Evolution","volume":"10 1","pages":"veae087"},"PeriodicalIF":5.5000,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11604119/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Virus Evolution","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/ve/veae087","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"VIROLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Public health researchers and practitioners commonly infer phylogenies from viral genome sequences to understand transmission dynamics and identify clusters of genetically-related samples. However, viruses that reassort or recombine violate phylogenetic assumptions and require more sophisticated methods. Even when phylogenies are appropriate, they can be unnecessary or difficult to interpret without specialty knowledge. For example, pairwise distances between sequences can be enough to identify clusters of related samples or assign new samples to existing phylogenetic clusters. In this work, we tested whether dimensionality reduction methods could capture known genetic groups within two human pathogenic viruses that cause substantial human morbidity and mortality and frequently reassort or recombine, respectively: seasonal influenza A/H3N2 and SARS-CoV-2. We applied principal component analysis, multidimensional scaling (MDS), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection to sequences with well-defined phylogenetic clades and either reassortment (H3N2) or recombination (SARS-CoV-2). For each low-dimensional embedding of sequences, we calculated the correlation between pairwise genetic and Euclidean distances in the embedding and applied a hierarchical clustering method to identify clusters in the embedding. We measured the accuracy of clusters compared to previously defined phylogenetic clades, reassortment clusters, or recombinant lineages. We found that MDS embeddings accurately represented pairwise genetic distances including the intermediate placement of recombinant SARS-CoV-2 lineages between parental lineages. Clusters from t-SNE embeddings accurately recapitulated known phylogenetic clades, H3N2 reassortment groups, and SARS-CoV-2 recombinant lineages. We show that simple statistical methods without a biological model can accurately represent known genetic relationships for relevant human pathogenic viruses. Our open source implementation of these methods for analysis of viral genome sequences can be easily applied when phylogenetic methods are either unnecessary or inappropriate.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
降维提取了季节性流感和SARS-CoV-2之间复杂的进化关系。
公共卫生研究人员和从业人员通常从病毒基因组序列推断系统发育,以了解传播动力学和识别遗传相关样本簇。然而,重组或重组的病毒违反了系统发育的假设,需要更复杂的方法。即使系统发育是适当的,如果没有专业知识,它们也可能是不必要的或难以解释的。例如,序列之间的成对距离足以识别相关样本的集群或将新样本分配给现有的系统发育集群。在这项工作中,我们测试了降维方法是否可以捕获两种分别导致大量人类发病率和死亡率并频繁重组或重组的人类致病性病毒中的已知遗传群:季节性流感A/H3N2和SARS-CoV-2。我们应用主成分分析、多维尺度(MDS)、t分布随机邻居嵌入(t-SNE)以及均匀流形逼近和投影方法,对具有明确的系统发育进化枝和重组(H3N2)或重组(SARS-CoV-2)的序列进行分析。对于序列的每一个低维嵌入,我们计算了嵌入中成对遗传距离和欧氏距离的相关性,并应用层次聚类方法来识别嵌入中的聚类。我们测量了与先前定义的系统发育枝、重组聚类或重组谱系相比的聚类准确性。我们发现MDS嵌入准确地代表了两两遗传距离,包括重组SARS-CoV-2谱系在亲本谱系之间的中间位置。来自t-SNE嵌入的聚类准确地概括了已知的系统发育分支、H3N2重组群和SARS-CoV-2重组谱系。我们表明,没有生物学模型的简单统计方法可以准确地表示相关人类致病病毒的已知遗传关系。当系统发育方法不必要或不合适时,我们的开源实现可以很容易地应用这些方法来分析病毒基因组序列。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Virus Evolution
Virus Evolution Immunology and Microbiology-Microbiology
CiteScore
10.50
自引率
5.70%
发文量
108
审稿时长
14 weeks
期刊介绍: Virus Evolution is a new Open Access journal focusing on the long-term evolution of viruses, viruses as a model system for studying evolutionary processes, viral molecular epidemiology and environmental virology. The aim of the journal is to provide a forum for original research papers, reviews, commentaries and a venue for in-depth discussion on the topics relevant to virus evolution.
期刊最新文献
Correction to: Going beyond consensus genome sequences: An innovative SNP-based methodology reconstructs different Ugandan cassava brown streak virus haplotypes at a nationwide scale in Rwanda. Dispersal dynamics and introduction patterns of SARS-CoV-2 lineages in Iran. Genome sizes of animal RNA viruses reflect phylogenetic constraints. SARS-CoV-2 CoCoPUTs: analyzing GISAID and NCBI data to obtain codon statistics, mutations, and free energy over a multiyear period. Expanding the genomic diversity of human anelloviruses.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1