Investigating alignment-free machine learning methods for HIV-1 subtype classification.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Bioinformatics advances Pub Date : 2024-07-29 eCollection Date: 2024-01-01 DOI:10.1093/bioadv/vbae108

Kaitlyn E Wade, Lianghong Chen, Chutong Deng, Gen Zhou, Pingzhao Hu

{"title":"Investigating alignment-free machine learning methods for HIV-1 subtype classification.","authors":"Kaitlyn E Wade, Lianghong Chen, Chutong Deng, Gen Zhou, Pingzhao Hu","doi":"10.1093/bioadv/vbae108","DOIUrl":null,"url":null,"abstract":"Motivation: Many viruses are organized into taxonomies of subtypes based on their genetic similarities. For human immunodeficiency virus 1 (HIV-1), subtype classification plays a crucial role in infection management. Sequence alignment-based methods for subtype classification are impractical for large datasets because they are costly and time-consuming. Alignment-free methods involve creating numerical representations for genetic sequences and applying statistical or machine learning methods. Despite their high overall accuracy, existing models perform poorly on less common subtypes. Furthermore, there is limited work investigating the impact of sequence vectorization methods, in particular natural language-inspired embedding methods, on HIV-1 subtype classification.Results: We present a comprehensive analysis of sequence vectorization methods across machine learning methods. We report a k-mer-based XGBoost model with a balanced accuracy of 0.84, indicating that it has good overall performance for both common and uncommon HIV-1 subtypes. We also report a Word2Vec-based support vector machine that achieves promising results on precision and balanced accuracy. Our study sheds light on the effect of sequence vectorization methods on HIV-1 subtype classification and suggests that natural language-inspired encoding methods show promise. Our results could help to develop improved HIV-1 subtype classification methods, leading to improved individual patient outcomes, and the development of subtype-specific treatments.Availability and implementation: Source code is available at https://www.github.com/kwade4/HIV_Subtypes.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11371153/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbae108","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: Many viruses are organized into taxonomies of subtypes based on their genetic similarities. For human immunodeficiency virus 1 (HIV-1), subtype classification plays a crucial role in infection management. Sequence alignment-based methods for subtype classification are impractical for large datasets because they are costly and time-consuming. Alignment-free methods involve creating numerical representations for genetic sequences and applying statistical or machine learning methods. Despite their high overall accuracy, existing models perform poorly on less common subtypes. Furthermore, there is limited work investigating the impact of sequence vectorization methods, in particular natural language-inspired embedding methods, on HIV-1 subtype classification.

Results: We present a comprehensive analysis of sequence vectorization methods across machine learning methods. We report a k-mer-based XGBoost model with a balanced accuracy of 0.84, indicating that it has good overall performance for both common and uncommon HIV-1 subtypes. We also report a Word2Vec-based support vector machine that achieves promising results on precision and balanced accuracy. Our study sheds light on the effect of sequence vectorization methods on HIV-1 subtype classification and suggests that natural language-inspired encoding methods show promise. Our results could help to develop improved HIV-1 subtype classification methods, leading to improved individual patient outcomes, and the development of subtype-specific treatments.

Availability and implementation: Source code is available at https://www.github.com/kwade4/HIV_Subtypes.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

研究用于 HIV-1 亚型分类的无对齐机器学习方法。

动机许多病毒根据其基因相似性分为亚型分类法。对于人类免疫缺陷病毒 1（HIV-1）来说，亚型分类在感染管理中起着至关重要的作用。基于序列比对的亚型分类方法对于大型数据集来说是不切实际的，因为它们既昂贵又耗时。无配对方法涉及为基因序列创建数字表示，并应用统计或机器学习方法。尽管现有模型的总体准确率较高，但在较少见的亚型上表现不佳。此外，研究序列矢量化方法，特别是受自然语言启发的嵌入方法对 HIV-1 亚型分类的影响的工作也很有限：结果：我们全面分析了机器学习方法中的序列矢量化方法。我们报告了一种基于 k-mer 的 XGBoost 模型，其平衡准确率为 0.84，表明该模型对于常见和不常见的 HIV-1 亚型都具有良好的整体性能。我们还报告了一种基于 Word2Vec 的支持向量机，它在精确度和均衡准确度方面都取得了可喜的成果。我们的研究揭示了序列矢量化方法对 HIV-1 亚型分类的影响，并表明受自然语言启发的编码方法大有可为。我们的研究结果有助于开发出更好的 HIV-1 亚型分类方法，从而改善患者的个体治疗效果，并开发出针对特定亚型的治疗方法：源代码见 https://www.github.com/kwade4/HIV_Subtypes。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Bioinformatics advances

CiteScore

1.60

自引率

0.00%

发文量