AUTO-TUNE: selecting the distance threshold for inferring HIV transmission clusters.

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Frontiers in bioinformatics Pub Date : 2024-07-10 eCollection Date: 2024-01-01 DOI:10.3389/fbinf.2024.1400003
Steven Weaver, Vanessa M Dávila Conn, Daniel Ji, Hannah Verdonk, Santiago Ávila-Ríos, Andrew J Leigh Brown, Joel O Wertheim, Sergei L Kosakovsky Pond
{"title":"AUTO-TUNE: selecting the distance threshold for inferring HIV transmission clusters.","authors":"Steven Weaver, Vanessa M Dávila Conn, Daniel Ji, Hannah Verdonk, Santiago Ávila-Ríos, Andrew J Leigh Brown, Joel O Wertheim, Sergei L Kosakovsky Pond","doi":"10.3389/fbinf.2024.1400003","DOIUrl":null,"url":null,"abstract":"<p><p>Molecular surveillance of viral pathogens and inference of transmission networks from genomic data play an increasingly important role in public health efforts, especially for HIV-1. For many methods, the genetic distance threshold used to connect sequences in the transmission network is a key parameter informing the properties of inferred networks. Using a distance threshold that is too high can result in a network with many spurious links, making it difficult to interpret. Conversely, a distance threshold that is too low can result in a network with too few links, which may not capture key insights into clusters of public health concern. Published research using the HIV-TRACE software package frequently uses the default threshold of 0.015 substitutions/site for HIV pol gene sequences, but in many cases, investigators heuristically select other threshold parameters to better capture the underlying dynamics of the epidemic they are studying. Here, we present a general heuristic scoring approach for tuning a distance threshold adaptively, which seeks to prevent the formation of giant clusters. We prioritize the ratio of the sizes of the largest and the second largest cluster, maximizing the number of clusters present in the network. We apply our scoring heuristic to outbreaks with different characteristics, such as regional or temporal variability, and demonstrate the utility of using the scoring mechanism's suggested distance threshold to identify clusters exhibiting risk factors that would have otherwise been more difficult to identify. For example, while we found that a 0.015 substitutions/site distance threshold is typical for US-like epidemics, recent outbreaks like the CRF07_BC subtype among men who have sex with men (MSM) in China have been found to have a lower optimal threshold of 0.005 to better capture the transition from injected drug use (IDU) to MSM as the primary risk factor. Alternatively, in communities surrounding Lake Victoria in Uganda, where there has been sustained heterosexual transmission for many years, we found that a larger distance threshold is necessary to capture a more risk factor-diverse population with sparse sampling over a longer period of time. Such identification may allow for more informed intervention action by respective public health officials.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"4 ","pages":"1400003"},"PeriodicalIF":2.8000,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11289888/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fbinf.2024.1400003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Molecular surveillance of viral pathogens and inference of transmission networks from genomic data play an increasingly important role in public health efforts, especially for HIV-1. For many methods, the genetic distance threshold used to connect sequences in the transmission network is a key parameter informing the properties of inferred networks. Using a distance threshold that is too high can result in a network with many spurious links, making it difficult to interpret. Conversely, a distance threshold that is too low can result in a network with too few links, which may not capture key insights into clusters of public health concern. Published research using the HIV-TRACE software package frequently uses the default threshold of 0.015 substitutions/site for HIV pol gene sequences, but in many cases, investigators heuristically select other threshold parameters to better capture the underlying dynamics of the epidemic they are studying. Here, we present a general heuristic scoring approach for tuning a distance threshold adaptively, which seeks to prevent the formation of giant clusters. We prioritize the ratio of the sizes of the largest and the second largest cluster, maximizing the number of clusters present in the network. We apply our scoring heuristic to outbreaks with different characteristics, such as regional or temporal variability, and demonstrate the utility of using the scoring mechanism's suggested distance threshold to identify clusters exhibiting risk factors that would have otherwise been more difficult to identify. For example, while we found that a 0.015 substitutions/site distance threshold is typical for US-like epidemics, recent outbreaks like the CRF07_BC subtype among men who have sex with men (MSM) in China have been found to have a lower optimal threshold of 0.005 to better capture the transition from injected drug use (IDU) to MSM as the primary risk factor. Alternatively, in communities surrounding Lake Victoria in Uganda, where there has been sustained heterosexual transmission for many years, we found that a larger distance threshold is necessary to capture a more risk factor-diverse population with sparse sampling over a longer period of time. Such identification may allow for more informed intervention action by respective public health officials.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
自动调整:选择距离阈值以推断艾滋病毒传播集群。
对病毒病原体的分子监测和从基因组数据推断传播网络在公共卫生工作中发挥着越来越重要的作用,尤其是在 HIV-1 方面。对于许多方法来说,用于连接传播网络中序列的遗传距离阈值是影响推断网络特性的一个关键参数。使用过高的距离阈值会导致网络中出现许多虚假链接,从而难以解释。相反,如果距离阈值过低,则可能导致网络中的链接过少,从而无法捕捉到有关公共卫生问题的关键信息。已发表的使用 HIV-TRACE 软件包进行的研究通常使用 0.015 个取代/位点的默认阈值来处理 HIV pol 基因序列,但在许多情况下,研究人员会启发式地选择其他阈值参数,以更好地捕捉他们正在研究的流行病的潜在动态。在此,我们提出了一种通用的启发式评分方法,用于自适应地调整距离阈值,以防止形成巨大的簇。我们优先考虑最大集群和第二大集群的大小之比,最大限度地增加网络中存在的集群数量。我们将我们的评分启发式应用于具有不同特征的疫情爆发,如区域或时间变异性,并展示了使用评分机制建议的距离阈值来识别表现出风险因素的集群的实用性,否则这些集群将更难识别。例如,我们发现 0.015 个替代/地点的距离阈值是类似美国流行病的典型阈值,而最近在中国男男性行为者(MSM)中爆发的 CRF07_BC 亚型等流行病的最佳阈值较低,为 0.005,以便更好地捕捉从注射吸毒(IDU)到 MSM 作为主要风险因素的转变。另外,在乌干达维多利亚湖周边的社区,异性传播已持续多年,我们发现需要更大的距离阈值,才能在更长的时间内通过稀疏取样捕捉到风险因素更多样化的人群。这样的识别可以让相关公共卫生官员采取更明智的干预行动。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
2.60
自引率
0.00%
发文量
0
期刊最新文献
Quantification of muscle fiber malformations using edge detection to investigate chronic muscle pressure ulcers. Computational identification and characterization of chitinase 1 and chitinase 2 from neotropical isolates of Beauveria bassiana. DCMA: faster protein backbone dihedral angle prediction using a dilated convolutional attention-based neural network. Identification of novel drug targets for Helicobacter pylori: structure-based virtual screening of potential inhibitors against DAH7PS protein involved in the shikimate pathway. Editorial: Women in bioinformatics.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1