Steven Weaver, Vanessa M Dávila Conn, Daniel Ji, Hannah Verdonk, Santiago Ávila-Ríos, Andrew J Leigh Brown, Joel O Wertheim, Sergei L Kosakovsky Pond
{"title":"自动调整:选择距离阈值以推断艾滋病毒传播集群。","authors":"Steven Weaver, Vanessa M Dávila Conn, Daniel Ji, Hannah Verdonk, Santiago Ávila-Ríos, Andrew J Leigh Brown, Joel O Wertheim, Sergei L Kosakovsky Pond","doi":"10.3389/fbinf.2024.1400003","DOIUrl":null,"url":null,"abstract":"<p><p>Molecular surveillance of viral pathogens and inference of transmission networks from genomic data play an increasingly important role in public health efforts, especially for HIV-1. For many methods, the genetic distance threshold used to connect sequences in the transmission network is a key parameter informing the properties of inferred networks. Using a distance threshold that is too high can result in a network with many spurious links, making it difficult to interpret. Conversely, a distance threshold that is too low can result in a network with too few links, which may not capture key insights into clusters of public health concern. Published research using the HIV-TRACE software package frequently uses the default threshold of 0.015 substitutions/site for HIV pol gene sequences, but in many cases, investigators heuristically select other threshold parameters to better capture the underlying dynamics of the epidemic they are studying. Here, we present a general heuristic scoring approach for tuning a distance threshold adaptively, which seeks to prevent the formation of giant clusters. We prioritize the ratio of the sizes of the largest and the second largest cluster, maximizing the number of clusters present in the network. We apply our scoring heuristic to outbreaks with different characteristics, such as regional or temporal variability, and demonstrate the utility of using the scoring mechanism's suggested distance threshold to identify clusters exhibiting risk factors that would have otherwise been more difficult to identify. For example, while we found that a 0.015 substitutions/site distance threshold is typical for US-like epidemics, recent outbreaks like the CRF07_BC subtype among men who have sex with men (MSM) in China have been found to have a lower optimal threshold of 0.005 to better capture the transition from injected drug use (IDU) to MSM as the primary risk factor. Alternatively, in communities surrounding Lake Victoria in Uganda, where there has been sustained heterosexual transmission for many years, we found that a larger distance threshold is necessary to capture a more risk factor-diverse population with sparse sampling over a longer period of time. Such identification may allow for more informed intervention action by respective public health officials.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"4 ","pages":"1400003"},"PeriodicalIF":2.8000,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11289888/pdf/","citationCount":"0","resultStr":"{\"title\":\"AUTO-TUNE: selecting the distance threshold for inferring HIV transmission clusters.\",\"authors\":\"Steven Weaver, Vanessa M Dávila Conn, Daniel Ji, Hannah Verdonk, Santiago Ávila-Ríos, Andrew J Leigh Brown, Joel O Wertheim, Sergei L Kosakovsky Pond\",\"doi\":\"10.3389/fbinf.2024.1400003\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Molecular surveillance of viral pathogens and inference of transmission networks from genomic data play an increasingly important role in public health efforts, especially for HIV-1. For many methods, the genetic distance threshold used to connect sequences in the transmission network is a key parameter informing the properties of inferred networks. Using a distance threshold that is too high can result in a network with many spurious links, making it difficult to interpret. Conversely, a distance threshold that is too low can result in a network with too few links, which may not capture key insights into clusters of public health concern. Published research using the HIV-TRACE software package frequently uses the default threshold of 0.015 substitutions/site for HIV pol gene sequences, but in many cases, investigators heuristically select other threshold parameters to better capture the underlying dynamics of the epidemic they are studying. Here, we present a general heuristic scoring approach for tuning a distance threshold adaptively, which seeks to prevent the formation of giant clusters. We prioritize the ratio of the sizes of the largest and the second largest cluster, maximizing the number of clusters present in the network. We apply our scoring heuristic to outbreaks with different characteristics, such as regional or temporal variability, and demonstrate the utility of using the scoring mechanism's suggested distance threshold to identify clusters exhibiting risk factors that would have otherwise been more difficult to identify. For example, while we found that a 0.015 substitutions/site distance threshold is typical for US-like epidemics, recent outbreaks like the CRF07_BC subtype among men who have sex with men (MSM) in China have been found to have a lower optimal threshold of 0.005 to better capture the transition from injected drug use (IDU) to MSM as the primary risk factor. Alternatively, in communities surrounding Lake Victoria in Uganda, where there has been sustained heterosexual transmission for many years, we found that a larger distance threshold is necessary to capture a more risk factor-diverse population with sparse sampling over a longer period of time. Such identification may allow for more informed intervention action by respective public health officials.</p>\",\"PeriodicalId\":73066,\"journal\":{\"name\":\"Frontiers in bioinformatics\",\"volume\":\"4 \",\"pages\":\"1400003\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-07-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11289888/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3389/fbinf.2024.1400003\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fbinf.2024.1400003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0
摘要
对病毒病原体的分子监测和从基因组数据推断传播网络在公共卫生工作中发挥着越来越重要的作用,尤其是在 HIV-1 方面。对于许多方法来说,用于连接传播网络中序列的遗传距离阈值是影响推断网络特性的一个关键参数。使用过高的距离阈值会导致网络中出现许多虚假链接,从而难以解释。相反,如果距离阈值过低,则可能导致网络中的链接过少,从而无法捕捉到有关公共卫生问题的关键信息。已发表的使用 HIV-TRACE 软件包进行的研究通常使用 0.015 个取代/位点的默认阈值来处理 HIV pol 基因序列,但在许多情况下,研究人员会启发式地选择其他阈值参数,以更好地捕捉他们正在研究的流行病的潜在动态。在此,我们提出了一种通用的启发式评分方法,用于自适应地调整距离阈值,以防止形成巨大的簇。我们优先考虑最大集群和第二大集群的大小之比,最大限度地增加网络中存在的集群数量。我们将我们的评分启发式应用于具有不同特征的疫情爆发,如区域或时间变异性,并展示了使用评分机制建议的距离阈值来识别表现出风险因素的集群的实用性,否则这些集群将更难识别。例如,我们发现 0.015 个替代/地点的距离阈值是类似美国流行病的典型阈值,而最近在中国男男性行为者(MSM)中爆发的 CRF07_BC 亚型等流行病的最佳阈值较低,为 0.005,以便更好地捕捉从注射吸毒(IDU)到 MSM 作为主要风险因素的转变。另外,在乌干达维多利亚湖周边的社区,异性传播已持续多年,我们发现需要更大的距离阈值,才能在更长的时间内通过稀疏取样捕捉到风险因素更多样化的人群。这样的识别可以让相关公共卫生官员采取更明智的干预行动。
AUTO-TUNE: selecting the distance threshold for inferring HIV transmission clusters.
Molecular surveillance of viral pathogens and inference of transmission networks from genomic data play an increasingly important role in public health efforts, especially for HIV-1. For many methods, the genetic distance threshold used to connect sequences in the transmission network is a key parameter informing the properties of inferred networks. Using a distance threshold that is too high can result in a network with many spurious links, making it difficult to interpret. Conversely, a distance threshold that is too low can result in a network with too few links, which may not capture key insights into clusters of public health concern. Published research using the HIV-TRACE software package frequently uses the default threshold of 0.015 substitutions/site for HIV pol gene sequences, but in many cases, investigators heuristically select other threshold parameters to better capture the underlying dynamics of the epidemic they are studying. Here, we present a general heuristic scoring approach for tuning a distance threshold adaptively, which seeks to prevent the formation of giant clusters. We prioritize the ratio of the sizes of the largest and the second largest cluster, maximizing the number of clusters present in the network. We apply our scoring heuristic to outbreaks with different characteristics, such as regional or temporal variability, and demonstrate the utility of using the scoring mechanism's suggested distance threshold to identify clusters exhibiting risk factors that would have otherwise been more difficult to identify. For example, while we found that a 0.015 substitutions/site distance threshold is typical for US-like epidemics, recent outbreaks like the CRF07_BC subtype among men who have sex with men (MSM) in China have been found to have a lower optimal threshold of 0.005 to better capture the transition from injected drug use (IDU) to MSM as the primary risk factor. Alternatively, in communities surrounding Lake Victoria in Uganda, where there has been sustained heterosexual transmission for many years, we found that a larger distance threshold is necessary to capture a more risk factor-diverse population with sparse sampling over a longer period of time. Such identification may allow for more informed intervention action by respective public health officials.