Background: While genomic variations can provide valuable information for health care and ancestry, the privacy of individual genomic data must be protected. Thus, a secure environment is desirable for a human DNA database such that the total data are queryable but not directly accessible to involved parties (eg, data hosts and hospitals) and that the query results are learned only by the user or authorized party.
Objective: In this study, we provide efficient and secure computations on panels of single nucleotide polymorphisms (SNPs) from genomic sequences as computed under the following set operations: union, intersection, set difference, and symmetric difference.
Methods: Using these operations, we can compute similarity metrics, such as the Jaccard similarity, which could allow querying a DNA database to find the same person and genetic relatives securely. We analyzed various security paradigms and show metrics for the protocols under several security assumptions, such as semihonest, malicious with honest majority, and malicious with a malicious majority.
Results: We show that our methods can be used practically on realistically sized data. Specifically, we can compute the Jaccard similarity of two genomes when considering sets of SNPs, each with 400,000 SNPs, in 2.16 seconds with the assumption of a malicious adversary in an honest majority and 0.36 seconds under a semihonest model.
Conclusions: Our methods may help adopt trusted environments for hosting individual genomic data with end-to-end data security.
Background: COVID-19 and Middle East Respiratory Syndrome are two pandemic respiratory diseases caused by coronavirus species. The novel disease COVID-19 caused by SARS-CoV-2 was first reported in Wuhan, Hubei Province, China, in December 2019, and became a pandemic within 2-3 months, affecting social and economic platforms worldwide. Despite the rapid development of vaccines, there have been obstacles to their distribution, including a lack of fundamental resources, poor immunization, and manual vaccine replication. Several variants of the original Wuhan strain have emerged in the last 3 years, which can pose a further challenge for control and vaccine development.
Objective: The aim of this study was to comprehensively analyze mutations in SARS-CoV-2 variants of concern (VoCs) using a bioinformatics approach toward identifying novel mutations that may be helpful in developing new vaccines by targeting these sites.
Methods: Reference sequences of the SARS-CoV-2 spike (YP_009724390) and nucleocapsid (YP_009724397) proteins were compared to retrieved sequences of isolates of four VoCs from 14 countries for mutational and evolutionary analyses. Multiple sequence alignment was performed and phylogenetic trees were constructed by the neighbor-joining method with 1000 bootstrap replicates using MEGA (version 6). Mutations in amino acid sequences were analyzed using the MultAlin online tool (version 5.4.1).
Results: Among the four VoCs, a total of 143 nonsynonymous mutations and 8 deletions were identified in the spike and nucleocapsid proteins. Multiple sequence alignment and amino acid substitution analysis revealed new mutations, including G72W, M2101I, L139F, 209-211 deletion, G212S, P199L, P67S, I292T, and substitutions with unknown amino acid replacement, reported in Egypt (MW533289), the United Kingdom (MT906649), and other regions. The variants B.1.1.7 (Alpha variant) and B.1.617.2 (Delta variant), characterized by higher transmissibility and lethality, harbored the amino acid substitutions D614G, R203K, and G204R with higher prevalence rates in most sequences. Phylogenetic analysis among the novel SARS-CoV-2 variant proteins and some previously reported β-coronavirus proteins indicated that either the evolutionary clade was weakly supported or not supported at all by the β-coronavirus species.
Conclusions: This study could contribute toward gaining a better understanding of the basic nature of SARS-CoV-2 and its four major variants. The numerous novel mutations detected could also provide a better understanding of VoCs and help in identifying suitable mutations for vaccine targets. Moreover, these data offer evidence for new types of mutations in VoCs, which will provide insight into the epidemiology of SARS-CoV-2.
JMIR Bioinformatics and Biotechnology supports interdisciplinary research and welcomes contributions that push the boundaries of bioinformatics, genomics, artificial intelligence, and pathology informatics.
Background: A thorough understanding of the patterns of genetic subdivision in a pathogen can provide crucial information that is necessary to prevent disease spread. For SARS-CoV-2, the availability of millions of genomes makes this task analytically challenging, and traditional methods for understanding genetic subdivision often fail.
Objective: The aim of our study was to use population genomics methods to identify the subtle subdivisions and demographic history of the Omicron variant, in addition to those captured by the Pango lineage.
Methods: We used a combination of an evolutionary network approach and multivariate statistical protocols to understand the subdivision and spread of the Omicron variant. We identified subdivisions within the BA.1 and BA.2 lineages and further identified the mutations associated with each cluster. We further characterized the overall genomic diversity of the Omicron variant and assessed the selection pressure for each of the genetic clusters identified.
Results: We observed concordant results, using two different methods to understand genetic subdivision. The overall pattern of subdivision in the Omicron variant was in broad agreement with the Pango lineage definition. Further, 1 cluster of the BA.1 lineage and 3 clusters of the BA.2 lineage revealed statistically significant signatures of selection or demographic expansion (Tajima's D<-2), suggesting the role of microevolutionary processes in the spread of the virus.
Conclusions: We provide an easy framework for assessing the genetic structure and demographic history of SARS-CoV-2, which can be particularly useful for understanding the local history of the virus. We identified important mutations that are advantageous to some lineages of Omicron and aid in the transmission of the virus. This is crucial information for policy makers, as preventive measures can be designed to mitigate further spread based on a holistic understanding of the variability of the virus and the evolutionary processes aiding its spread.
Background: There is a great need to develop a computational approach to analyze and exploit the information contained in gene expression data. The recent utilization of nonnegative matrix factorization (NMF) in computational biology has demonstrated the capability to derive essential details from a high amount of data in particular gene expression microarrays. A common problem in NMF is finding the proper number rank (r) of factors of the degraded demonstration, but no agreement exists on which technique is most appropriate to utilize for this purpose. Thus, various techniques have been suggested to select the optimal value of rank factorization (r).
Objective: In this work, a new metric for rank selection is proposed based on the elbow method, which was methodically compared against the cophenetic metric.
Methods: To decide the optimum number rank (r), this study focused on the unit invariant knee (UIK) method of the NMF on gene expression data sets. Since the UIK method requires an extremum distance estimator that is eventually employed for inflection and identification of a knee point, the proposed method finds the first inflection point of the curvature of the residual sum of squares of the proposed algorithms using the UIK method on gene expression data sets as a target matrix.
Results: Computation was conducted for the UIK task using gene expression data of acute lymphoblastic leukemia and acute myeloid leukemia samples. Consequently, the distinct results of NMF were subjected to comparison on different algorithms. The proposed UIK method is easy to perform, fast, free of a priori rank value input, and does not require initial parameters that significantly influence the model's functionality.
Conclusions: This study demonstrates that the elbow method provides a credible prediction for both gene expression data and for precisely estimating simulated mutational processes data with known dimensions. The proposed UIK method is faster than conventional methods, including metrics utilizing the consensus matrix as a criterion for rank selection, while achieving significantly better computational efficiency without visual inspection on the curvatives. Finally, the suggested rank tuning method based on the elbow method for gene expression data is arguably theoretically superior to the cophenetic measure.
Background: Dengue fever can progress to dengue hemorrhagic fever (DHF), a more serious and occasionally fatal form of the disease. Indicators of serious disease arise about the time the fever begins to reduce (typically 3 to 7 days following symptom onset). There are currently no effective antivirals available. Drug repurposing is an emerging drug discovery process for rapidly developing effective DHF therapies. Through network pharmacology modeling, several US Food and Drug Administration (FDA)-approved medications have already been researched for various viral outbreaks.
Objective: We aimed to identify potentially repurposable drugs for DHF among existing FDA-approved drugs for viral attacks, symptoms of viral fevers, and DHF.
Methods: Using target identification databases (GeneCards and DrugBank), we identified human-DHF virus interacting genes and drug targets against these genes. We determined hub genes and potential drugs with a network-based analysis. We performed functional enrichment and network analyses to identify pathways, protein-protein interactions, tissues where the gene expression was high, and disease-gene associations.
Results: Analyzing virus-host interactions and therapeutic targets in the human genome network revealed 45 repurposable medicines. Hub network analysis of host-virus-drug associations suggested that aspirin, captopril, and rilonacept might efficiently treat DHF. Gene enrichment analysis supported these findings. According to a Mayo Clinic report, using aspirin in the treatment of dengue fever may increase the risk of bleeding complications, but several studies from around the world suggest that thrombosis is associated with DHF. The human interactome contains the genes prostaglandin-endoperoxide synthase 2 (PTGS2), angiotensin converting enzyme (ACE), and coagulation factor II, thrombin (F2), which have been documented to have a role in the pathogenesis of disease progression in DHF, and our analysis of most of the drugs targeting these genes showed that the hub gene module (human-virus-drug) was highly enriched in tissues associated with the immune system (P=7.29 × 10-24) and human umbilical vein endothelial cells (P=1.83 × 10-20); this group of tissues acts as an anticoagulant barrier between the vessel walls and blood. Kegg analysis showed an association with genes linked to cancer (P=1.13 × 10-14) and the advanced glycation end products-receptor for advanced glycation end products signaling pathway in diabetic complications (P=3.52 × 10-14), which indicates that DHF patients with diabetes and cancer are at risk of higher pathogenicity. Thus, gene-targeting medications may play a significant part in limiting or worsening the condition of DHF patients.
Conclusions: Aspirin is not usually prescribed for dengue fever because of bleeding complications, but it
Background: T helper (Th) 9 cells are a novel subset of Th cells that develop independently from Th2 cells and are characterized by the secretion of interleukin (IL)-9. Studies have suggested the involvement of Th9 cells in variable diseases such as allergic and pulmonary diseases (eg, asthma, chronic obstructive airway disease, chronic rhinosinusitis, nasal polyps, and pulmonary hypoplasia), metabolic diseases (eg, acute leukemia, myelocytic leukemia, breast cancer, lung cancer, melanoma, pancreatic cancer), neuropsychiatric disorders (eg, Alzheimer disease), autoimmune diseases (eg, Graves disease, Crohn disease, colitis, psoriasis, systemic lupus erythematosus, systemic scleroderma, rheumatoid arthritis, multiple sclerosis, inflammatory bowel disease, atopic dermatitis, eczema), and infectious diseases (eg, tuberculosis, hepatitis). However, there is a dearth of information on its involvement in other metabolic, neuropsychiatric, and infectious diseases.
Objective: This study aims to identify significant differentially altered genes in the conversion of Th2 to Th9 cells, and their regulating microRNAs (miRs) from publicly available Gene Expression Omnibus data sets of the mouse model using in silico analysis to unravel various pathogenic pathways involved in disease processes.
Methods: Using differentially expressed genes (DEGs) identified from 2 publicly available data sets (GSE99166 and GSE123501) we performed functional enrichment and network analyses to identify pathways, protein-protein interactions, miR-messenger RNA associations, and disease-gene associations related to significant differentially altered genes implicated in the conversion of Th2 to Th9 cells.
Results: We extracted 260 common downregulated, 236 common upregulated, and 634 common DEGs from the expression profiles of data sets GSE99166 and GSE123501. Codifferentially expressed ILs, cytokines, receptors, and transcription factors (TFs) were enriched in 7 crucial Kyoto Encyclopedia of Genes and Genomes pathways and Gene Ontology. We constructed the protein-protein interaction network and predicted the top regulatory miRs involved in the Th2 to Th9 differentiation pathways. We also identified various metabolic, allergic and pulmonary, neuropsychiatric, autoimmune, and infectious diseases as well as carcinomas where the differentiation of Th2 to Th9 may play a crucial role.
Conclusions: This study identified hitherto unexplored possible associations between Th9 and disease states. Some important ILs, including CCL1 (chemokine [C-C motif] ligand 1), CCL20 (chemokine [C-C motif] ligand 20), IL-13, IL-4, IL-12A, and IL-9; receptors, including IL-12RB1, IL-4RA (interleukin 9 receptor alpha), CD53 (cluster of differentiation 53), CD6 (cluster of differentiation 6), CD5 (cluster of differentiation 5), CD83 (cluster of differentiation 83), CD197 (cluster of differentiation
Background: Emergence of the new SARS-CoV-2 variant B.1.1.529 worried health policy makers worldwide due to a large number of mutations in its genomic sequence, especially in the spike protein region. The World Health Organization (WHO) designated this variant as a global variant of concern (VOC), which was named "Omicron." Following Omicron's emergence, a surge of new COVID-19 cases was reported globally, primarily in South Africa.
Objective: The aim of this study was to understand whether Omicron had an epidemiological advantage over existing variants.
Methods: We performed an in silico analysis of the complete genomic sequences of Omicron available on the Global Initiative on Sharing Avian Influenza Data (GISAID) database to analyze the functional impact of the mutations present in this variant on virus-host interactions in terms of viral transmissibility, virulence/lethality, and immune escape. In addition, we performed a correlation analysis of the relative proportion of the genomic sequences of specific SARS-CoV-2 variants (in the period from October 1 to November 29, 2021) with matched epidemiological data (new COVID-19 cases and deaths) from South Africa.
Results: Compared with the current list of global VOCs/variants of interest (VOIs), as per the WHO, Omicron bears more sequence variation, specifically in the spike protein and host receptor-binding motif (RBM). Omicron showed the closest nucleotide and protein sequence homology with the Alpha variant for the complete sequence and the RBM. The mutations were found to be primarily condensed in the spike region (n=28-48) of the virus. Further mutational analysis showed enrichment for the mutations decreasing binding affinity to angiotensin-converting enzyme 2 receptor and receptor-binding domain protein expression, and for increasing the propensity of immune escape. An inverse correlation of Omicron with the Delta variant was noted (r=-0.99, P<.001; 95% CI -0.99 to -0.97) in the sequences reported from South Africa postemergence of the new variant, subsequently showing a decrease. There was a steep rise in new COVID-19 cases in parallel with the increase in the proportion of Omicron isolates since the report of the first case (74%-100%). By contrast, the incidence of new deaths did not increase (r=-0.04, P>.05; 95% CI -0.52 to 0.58).
Conclusions: In silico analysis of viral genomic sequences suggests that the Omicron variant has more remarkable immune-escape ability than existing VOCs/VOIs, including Delta, but reduced virulence/lethality than other reported variants. The higher power for immune escape for Omicron was a likely reason for the resurgence in COVID-19 cases and its rapid rise as the globally dominant strain. Being more infectious but less lethal than the existing variants, Omicron could have plausibly led to widespread unnoticed new, repeated, and vacci
Background: The emergence of SARS-CoV-2 variants with mutations associated with increased transmissibility and virulence is a public health concern in Ontario, Canada. Characterizing how the mutational patterns of the SARS-CoV-2 genome have changed over time can shed light on the driving factors, including selection for increased fitness and host immune response, that may contribute to the emergence of novel variants. Moreover, the study of SARS-CoV-2 in the microcosm of Ontario, Canada can reveal how different province-specific public health policies over time may be associated with observed mutational patterns as a model system.
Objective: This study aimed to perform a comprehensive analysis of single base substitution (SBS) types, counts, and genomic locations observed in SARS-CoV-2 genomic sequences sampled in Ontario, Canada. Comparisons of mutational patterns were conducted between sequences sampled during 4 different epochs delimited by major public health events to track the evolution of the SARS-CoV-2 mutational landscape over 2 years.
Methods: In total, 24,244 SARS-CoV-2 genomic sequences and associated metadata sampled in Ontario, Canada from January 1, 2020, to December 31, 2021, were retrieved from the Global Initiative on Sharing All Influenza Data database. Sequences were assigned to 4 epochs delimited by major public health events based on the sampling date. SBSs from each SARS-CoV-2 sequence were identified relative to the MN996528.1 reference genome. Catalogues of SBS types and counts were generated to estimate the impact of selection in each open reading frame, and identify mutation clusters. The estimation of mutational fitness over time was performed using the Augur pipeline.
Results: The biases in SBS types and proportions observed support previous reports of host antiviral defense activity involving the SARS-CoV-2 genome. There was an increase in U>C substitutions associated with adenosine deaminase acting on RNA (ADAR) activity uniquely observed during Epoch 4. The burden of novel SBSs observed in SARS-CoV-2 genomic sequences was the greatest in Epoch 2 (median 5), followed by Epoch 3 (median 4). Clusters of SBSs were observed in the spike protein open reading frame, ORF1a, and ORF3a. The high proportion of nonsynonymous SBSs and increasing dN/dS metric (ratio of nonsynonymous to synonymous mutations in a given open reading frame) to above 1 in Epoch 4 indicate positive selection of the spike protein open reading frame.
Conclusions: Quantitative analysis of the mutational patterns of the SARS-CoV-2 genome in the microcosm of Ontario, Canada within early consecutive epochs of the pandemic tracked the mutational dynamics in the context of public health events that instigate significant shifts in selection and mutagenesis. Continued genomic surveillance of emergent variants will be useful for the design of public he

