Sequencing and phylogenetic classification have become a common task in human and animal diagnostic laboratories. It is routine to sequence pathogens to identify genetic variations of diagnostic significance and to use these data in realtime genomic contact tracing and surveillance. Under this paradigm, unprecedented volumes of data are generated that require rapid analysis to provide meaningful inference.
We present a machine learning logistic regression pipeline that can assign classifications to genetic sequence data. The pipeline implements an intuitive and customizable approach to developing a trained prediction model that runs in linear time complexity, generating accurate output rapidly, even with incomplete data. Our approach was benchmarked against porcine respiratory and reproductive syndrome virus (PRRSv) and swine H1 influenza A virus (IAV) datasets. Trained classifiers were tested against sequences and simulated datasets that artificially degraded sequence quality at 0, 10, 20, 30, and 40%.
When applied to a poor-quality sequence data, the classifier achieved between >85% to 95% accuracy for the PRRSv and the swine H1 IAV HA dataset and this increased to near perfect accuracy when using the full dataset. The model also identifies amino acid positions used to determine genetic clade identity through a feature selection ranking within the model. These positions can be mapped onto a maximum-likelihood phylogenetic tree, allowing for the inference of clade defining mutations.
Our approach is implemented as a python package with code available at
HIV-1 generates remarkable intra- and inter-host viral diversity during infection. In the response to the dynamic selective pressures of the host’s environment, HIV-1 evolves distinct phenotypes—biological features that provide fitness advantages. The transmitted form of HIV-1 has been shown to require a high density of CD4 on the target cell surface (as found on CD4+ T cells) and typically uses C–C chemokine receptor type 5 (CCR5) as a coreceptor during entry. This phenotype is referred to as R5T cell-tropic (or R5 T-tropic); however, HIV-1 can switch to a secondary coreceptor, C–X–C chemokine receptor type 4 (CXCR4), resulting in a X4T cell-tropic phenotype. Macrophage-tropic (or M-tropic) HIV-1 can evolve to efficiently enter cells expressing low densities of CD4 on their surface (such as macrophages/microglia). So far only CCR5-using M-tropic viruses have been found. M-tropic HIV-1 is most frequently found within the central nervous system (CNS), and infection of the CNS has been associated with neurologic impairment. It has been shown that interferon-resistant phenotypes have a selective advantage during transmission, but the underlying mechanism of this is still unclear. During untreated infection, HIV-1 evolves under selective pressure from both the humoral/antibody response and CD8+ T-cell killing. Sufficiently potent antiviral therapy can suppress viral replication, but if the antiviral drugs are not powerful enough to stop replication, then the replicating virus will evolve drug resistance. HIV-1 phenotypes are highly relevant to treatment efforts, clinical outcomes, vaccine studies, and cure strategies. Therefore, it is critical to understand the dynamics of the host environment that drive these phenotypes and how they affect HIV-1 pathogenesis. This review will provide a comprehensive discussion of HIV-1 entry and transmission, and drug-resistant phenotypes. Finally, we will assess the methods used in previous and current research to characterize these phenotypes.

