M. Okomo-Adhiambo, E. Ramos, Reagan J. Kelly, Yatish Jain, R. Tatusov, A. Montmayeur, Gregory Doho, Rachel L. Marine, T. Ng, Adam C. Retchless, S. Oberste, P. Rota, X. Wang, Agha N. Khan
Next-generation sequencing (NGS) has become a vital tool in clinical microbiology, with numerous applications in infectious disease diagnostics, outbreak investigations, and public health surveillance. Although the NGS technology enables comprehensive pathogen detection in a relatively short time at a low cost, the enormous amount of genomics data generated creates a critical challenge of effectively organizing, archiving, analyzing, and reporting the results within a clinically relevant timeframe. Automated pipelines provide the first step in standardizing NGS data processing and reporting, thus eliminating the common bottlenecks in bioinformatics analyses, and providing rapid turnaround. Here, we present the Viral NGS Pipeline optimized for identification and whole genome assembly of viruses, and the Bacterial Meningococcus Genome Analysis Platform (BMGAP), designed for genotypic characterization of meningitis pathogens. These respective pipelines have been used to analyze more than 11,000 clinical samples and isolates. The pipelines are deployable on both standalone and cloud-based servers, enabling their accessibility to internal CDC users, as well as external partners, including state public health laboratories and other collaborators worldwide. These automated pipelines have the potential to contribute to the development of unbiased NGS-based clinical assays for pathogen detection that demand rapid turnaround times, and are expected to play a key role in infectious disease surveillance in the future.
{"title":"Automated Next Generation Sequencing Bioinformatics Pipelines for Pathogen Discovery and Surveillance","authors":"M. Okomo-Adhiambo, E. Ramos, Reagan J. Kelly, Yatish Jain, R. Tatusov, A. Montmayeur, Gregory Doho, Rachel L. Marine, T. Ng, Adam C. Retchless, S. Oberste, P. Rota, X. Wang, Agha N. Khan","doi":"10.1145/3107411.3108192","DOIUrl":"https://doi.org/10.1145/3107411.3108192","url":null,"abstract":"Next-generation sequencing (NGS) has become a vital tool in clinical microbiology, with numerous applications in infectious disease diagnostics, outbreak investigations, and public health surveillance. Although the NGS technology enables comprehensive pathogen detection in a relatively short time at a low cost, the enormous amount of genomics data generated creates a critical challenge of effectively organizing, archiving, analyzing, and reporting the results within a clinically relevant timeframe. Automated pipelines provide the first step in standardizing NGS data processing and reporting, thus eliminating the common bottlenecks in bioinformatics analyses, and providing rapid turnaround. Here, we present the Viral NGS Pipeline optimized for identification and whole genome assembly of viruses, and the Bacterial Meningococcus Genome Analysis Platform (BMGAP), designed for genotypic characterization of meningitis pathogens. These respective pipelines have been used to analyze more than 11,000 clinical samples and isolates. The pipelines are deployable on both standalone and cloud-based servers, enabling their accessibility to internal CDC users, as well as external partners, including state public health laboratories and other collaborators worldwide. These automated pipelines have the potential to contribute to the development of unbiased NGS-based clinical assays for pathogen detection that demand rapid turnaround times, and are expected to play a key role in infectious disease surveillance in the future.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128087637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Naveen Mangalakumar, A. Alkhateeb, H. Pham, L. Rueda, A. Ngom
Studying gene expression through various time intervals of breast cancer survival may provide new insights into the recovery from the disease. In this work, we propose a hierarchical clustering method to separate dissimilar groups of gene time-series profiles, which have the furthest distances from the rest of the profiles throughout different time intervals. The isolated outliers can be used as potential biomarkers of Breast Cancer survivability. Gene expressions throughout those time points are cubic spline interpolated to create a trending profile for each gene. After universally aligning the profiles to minimize the vertical area between each pair of profiles, we cluster the genes using hierarchical clustering based on minimized vertical distances [1]. An appropriate number of clusters was chosen based on the profile alignment and agglomerative clustering (PAAC) index as well as visual observations of the clusters. Our study suggests that the combination of proper clustering, distance function and index validation for clusters is a suitable model to identify genes as informative biomarkers of breast cancer survivability.
{"title":"Outlier Genes as Biomarkers of Breast Cancer Survivability in Time-Series Data","authors":"Naveen Mangalakumar, A. Alkhateeb, H. Pham, L. Rueda, A. Ngom","doi":"10.1145/3107411.3108202","DOIUrl":"https://doi.org/10.1145/3107411.3108202","url":null,"abstract":"Studying gene expression through various time intervals of breast cancer survival may provide new insights into the recovery from the disease. In this work, we propose a hierarchical clustering method to separate dissimilar groups of gene time-series profiles, which have the furthest distances from the rest of the profiles throughout different time intervals. The isolated outliers can be used as potential biomarkers of Breast Cancer survivability. Gene expressions throughout those time points are cubic spline interpolated to create a trending profile for each gene. After universally aligning the profiles to minimize the vertical area between each pair of profiles, we cluster the genes using hierarchical clustering based on minimized vertical distances [1]. An appropriate number of clusters was chosen based on the profile alignment and agglomerative clustering (PAAC) index as well as visual observations of the clusters. Our study suggests that the combination of proper clustering, distance function and index validation for clusters is a suitable model to identify genes as informative biomarkers of breast cancer survivability.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127050249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The majority of genes in eukaryotes consist of multiple protein domains that can be independently lost or gained during evolution. This gain and loss of protein domains, through domain duplications, transfers, or losses, has important evolutionary and functional consequences for genes. Yet, most computational methods for studying gene evolution view genes as the basic unit of evolution and assume that evolutionary processes such as duplications and losses act on entire genes, rather than on parts of genes. Specifically, even though it is well understood that domains evolve inside genes and genes inside species, there do not exist any computational frameworks to simultaneously model the evolution of domains, genes, and species and account for their inter-dependency. Here, we develop a three-tree model of domain evolution that explicitly captures the interdependence of domain-, gene-, and species-level evolution. Our model extends the classical phylogenetic reconciliation framework, which infers gene family evolution by comparing gene trees and species tree, by explicitly accounting for domain-level events. The new model decouples domain-level events from gene-level events and provides a much more fine-grained view of gene family and domain family evolution that is easy to interpret. Specifically, we (i) introduce the new three-tree computational framework, (ii) prove that the associated optimization problem is NP-hard, (iii) devise an efficient heuristic solution for the problem, (iv) apply our algorithm to a large dataset of about 4000 domain trees and 7000 gene trees from 12 fly species, and (v) demonstrate the impact of using our new computational framework by comparing the inferred evolutionary histories against those obtained using existing approaches. Our experimental results show that using the new three-tree model has a significant impact on the inference of both domain-level and gene-level events, and on the inference of domain content in ancestral genes and gene content in ancestral species, compared to existing approaches.
{"title":"An Integrated Reconciliation Framework for Domain, Gene, and Species Level Evolution","authors":"Lei Li, Mukul S. Bansal","doi":"10.1145/3107411.3108220","DOIUrl":"https://doi.org/10.1145/3107411.3108220","url":null,"abstract":"The majority of genes in eukaryotes consist of multiple protein domains that can be independently lost or gained during evolution. This gain and loss of protein domains, through domain duplications, transfers, or losses, has important evolutionary and functional consequences for genes. Yet, most computational methods for studying gene evolution view genes as the basic unit of evolution and assume that evolutionary processes such as duplications and losses act on entire genes, rather than on parts of genes. Specifically, even though it is well understood that domains evolve inside genes and genes inside species, there do not exist any computational frameworks to simultaneously model the evolution of domains, genes, and species and account for their inter-dependency. Here, we develop a three-tree model of domain evolution that explicitly captures the interdependence of domain-, gene-, and species-level evolution. Our model extends the classical phylogenetic reconciliation framework, which infers gene family evolution by comparing gene trees and species tree, by explicitly accounting for domain-level events. The new model decouples domain-level events from gene-level events and provides a much more fine-grained view of gene family and domain family evolution that is easy to interpret. Specifically, we (i) introduce the new three-tree computational framework, (ii) prove that the associated optimization problem is NP-hard, (iii) devise an efficient heuristic solution for the problem, (iv) apply our algorithm to a large dataset of about 4000 domain trees and 7000 gene trees from 12 fly species, and (v) demonstrate the impact of using our new computational framework by comparing the inferred evolutionary histories against those obtained using existing approaches. Our experimental results show that using the new three-tree model has a significant impact on the inference of both domain-level and gene-level events, and on the inference of domain content in ancestral genes and gene content in ancestral species, compared to existing approaches.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132023730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Influenza A Virus (IAV) is remarkably adept at surviving in human populations. IAV thrives even among populations with wide spread access to vaccines and anti-viral drugs, and continues to be a major cause of morbidity and mortality. Correlated mutations are an important factor in IAV's evolution and are critical for host adaptation and pathogenicity. Large sets of publicly available sequences of IAV combined with its rapid and complex evolutionary dynamics present interesting opportunities and unique challenges to analyze correlated mutations in influenza proteomes. In this work, we performed a comprehensive analysis of correlated mutations in IAV using a network theory approach where residues in each protein act as nodes in the graph and edges in the graph are created based on inter-residue correlated mutations. Our approach used 'maximal information coefficient' (MIC) to compute correlations between residues and the edges connect nodes if their MIC exceeds a threshold. We created a modular and robust pipeline and applied it to multiple datasets of H1N1, H3N2, H5 and H7N9 subtypes. We studied structural dynamics of IAV sub-systems based on topological properties of their networks resulting in several important conclusions. The main finding is that correlated mutation networks in IAV are sub-type and host specific and the differences for various subtypes and hosts are significant. We identified nodes with highest degree along with edges and triplets with strongest weight for each network. To contextualize our results, we performed entropy analysis to gain a global view of sequence variation and computed solvent accessibility profiles to identify statistical differences in correlation profiles between surface and buried residues. To understand the extent of co-variation between the 10 proteins in IAV sequences, we created visualizations of protein correlation graphs where the proteins acts as nodes and the strength of connections between the nodes depends on the number of correlated mutations between residues of connected proteins. A web application and visualization tools to explore the results and search for correlated mutations were developed.
{"title":"Network Analysis of Correlated Mutations in Influenza","authors":"Uday Yallapragada, I. Vaisman","doi":"10.1145/3107411.3108237","DOIUrl":"https://doi.org/10.1145/3107411.3108237","url":null,"abstract":"Influenza A Virus (IAV) is remarkably adept at surviving in human populations. IAV thrives even among populations with wide spread access to vaccines and anti-viral drugs, and continues to be a major cause of morbidity and mortality. Correlated mutations are an important factor in IAV's evolution and are critical for host adaptation and pathogenicity. Large sets of publicly available sequences of IAV combined with its rapid and complex evolutionary dynamics present interesting opportunities and unique challenges to analyze correlated mutations in influenza proteomes. In this work, we performed a comprehensive analysis of correlated mutations in IAV using a network theory approach where residues in each protein act as nodes in the graph and edges in the graph are created based on inter-residue correlated mutations. Our approach used 'maximal information coefficient' (MIC) to compute correlations between residues and the edges connect nodes if their MIC exceeds a threshold. We created a modular and robust pipeline and applied it to multiple datasets of H1N1, H3N2, H5 and H7N9 subtypes. We studied structural dynamics of IAV sub-systems based on topological properties of their networks resulting in several important conclusions. The main finding is that correlated mutation networks in IAV are sub-type and host specific and the differences for various subtypes and hosts are significant. We identified nodes with highest degree along with edges and triplets with strongest weight for each network. To contextualize our results, we performed entropy analysis to gain a global view of sequence variation and computed solvent accessibility profiles to identify statistical differences in correlation profiles between surface and buried residues. To understand the extent of co-variation between the 10 proteins in IAV sequences, we created visualizations of protein correlation graphs where the proteins acts as nodes and the strength of connections between the nodes depends on the number of correlated mutations between residues of connected proteins. A web application and visualization tools to explore the results and search for correlated mutations were developed.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132538059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 4: Genomic Variation and Disease","authors":"Anna M. Ritz","doi":"10.1145/3254547","DOIUrl":"https://doi.org/10.1145/3254547","url":null,"abstract":"","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131859129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Adjeroh, Maen Allaga, Jun Tan, Jie Lin, Yue Jiang, A. Abbasi, Xiaobo Zhou
In this work, we study string-based approaches for the problem of RNA-Protein Interaction (RPI). We apply string algorithms and data structures to extract effective string patterns for prediction of RPI, using both sequence information (protein and RNA sequences), and structure information (protein and RNA secondary structures). This led to different string-based models for predicting interacting RNA-protein pairs. We show results that demonstrate the effectiveness of the proposed string-based models, including comparative results against state-of-the-art methods.
{"title":"String-Based Models for Predicting RNA-Protein Interaction","authors":"D. Adjeroh, Maen Allaga, Jun Tan, Jie Lin, Yue Jiang, A. Abbasi, Xiaobo Zhou","doi":"10.1145/3107411.3107508","DOIUrl":"https://doi.org/10.1145/3107411.3107508","url":null,"abstract":"In this work, we study string-based approaches for the problem of RNA-Protein Interaction (RPI). We apply string algorithms and data structures to extract effective string patterns for prediction of RPI, using both sequence information (protein and RNA sequences), and structure information (protein and RNA secondary structures). This led to different string-based models for predicting interacting RNA-protein pairs. We show results that demonstrate the effectiveness of the proposed string-based models, including comparative results against state-of-the-art methods.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133403158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Drug repositioning is a promising strategy in drug discovery. New biomedical insights of drug-target-disease relationships are important in drug repositioning, and such relationships have been intensively studied recently. Most of the studies utilize network-based computational approaches based on drug and disease similarities. However, one common limitation of existing approaches is that both drug similarities and disease similarities are defined based on a single feature of drugs/diseases. In reality, the relationships between drug (or disease) pairs can be characterized based on many different features. Therefore, it is increasingly important to include them in drug repositioning studies. In this study, we propose a flexible and robust multi-source learning (FRMSL) framework to integrate multiple heterogeneous data sources for drug-disease association predictions. We first construct a two-layer heterogeneous network consisting of drug nodes, disease nodes and known drug-disease relationships. The drug repositioning problem can thus be treated as a missing link prediction problem on the heterogeneous graph and can be solved using Kronecker regularized least square (KronRLS) method. Multiple data sources describing drugs and diseases are incorporated into the framework using similarity-based kernels. In practice, a great challenge in such data integration projects is the data incompleteness problem due to the nature of data generation and collection. To address this issue, we develop a novel multi-view learning algorithm based on symmetric nonnegative matrix factorization (SymNMF). Extensive experimental studies show that our framework outperforms several recent network-based methods.
{"title":"A Flexible and Robust Multi-Source Learning Algorithm for Drug Repositioning","authors":"Huiyuan Chen, Jing Li","doi":"10.1145/3107411.3107473","DOIUrl":"https://doi.org/10.1145/3107411.3107473","url":null,"abstract":"Drug repositioning is a promising strategy in drug discovery. New biomedical insights of drug-target-disease relationships are important in drug repositioning, and such relationships have been intensively studied recently. Most of the studies utilize network-based computational approaches based on drug and disease similarities. However, one common limitation of existing approaches is that both drug similarities and disease similarities are defined based on a single feature of drugs/diseases. In reality, the relationships between drug (or disease) pairs can be characterized based on many different features. Therefore, it is increasingly important to include them in drug repositioning studies. In this study, we propose a flexible and robust multi-source learning (FRMSL) framework to integrate multiple heterogeneous data sources for drug-disease association predictions. We first construct a two-layer heterogeneous network consisting of drug nodes, disease nodes and known drug-disease relationships. The drug repositioning problem can thus be treated as a missing link prediction problem on the heterogeneous graph and can be solved using Kronecker regularized least square (KronRLS) method. Multiple data sources describing drugs and diseases are incorporated into the framework using similarity-based kernels. In practice, a great challenge in such data integration projects is the data incompleteness problem due to the nature of data generation and collection. To address this issue, we develop a novel multi-view learning algorithm based on symmetric nonnegative matrix factorization (SymNMF). Extensive experimental studies show that our framework outperforms several recent network-based methods.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133174011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Blake Camp, J. Mandivarapu, Jay Mehta, Nagashayana Ramamurthy, James Wingo, A. Bourgeois, Xiaojun Cao, Rajshekhar Sunderraman
The CDC's Epi-Info is widely-used by epidemiologists and public health researchers to collect and analyze public health data, especially in the event of outbreaks. As it exists today, Epi-Info runs only on the Windows platform and is made of separate code-bases for several different devices and use-cases. Software portability has become increasingly important over the past few years. In this poster, we present a cross-platform architecture for Epi-Info. To simplify and expedite future development, the cross-platform system architecture uses Electron, AngularJS, and Python with the capability of running on virtually any desktop or laptop computer. Additionally, the code can be easily deployed to the Web, and has the potential to be a viable solution for several mobile use-cases.
{"title":"A Cross-Platform System Architecture for Form Design and Data Analytics for Public Health","authors":"Blake Camp, J. Mandivarapu, Jay Mehta, Nagashayana Ramamurthy, James Wingo, A. Bourgeois, Xiaojun Cao, Rajshekhar Sunderraman","doi":"10.1145/3107411.3108223","DOIUrl":"https://doi.org/10.1145/3107411.3108223","url":null,"abstract":"The CDC's Epi-Info is widely-used by epidemiologists and public health researchers to collect and analyze public health data, especially in the event of outbreaks. As it exists today, Epi-Info runs only on the Windows platform and is made of separate code-bases for several different devices and use-cases. Software portability has become increasingly important over the past few years. In this poster, we present a cross-platform architecture for Epi-Info. To simplify and expedite future development, the cross-platform system architecture uses Electron, AngularJS, and Python with the capability of running on virtually any desktop or laptop computer. Additionally, the code can be easily deployed to the Web, and has the potential to be a viable solution for several mobile use-cases.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129250379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fransiskus Xaverius Ivan, Xinrui Zhou, A. Deshpande, Rui Yin, Jie Zheng, C. Kwoh
Various computational and statistical approaches have been proposed to uncover the mutational patterns of rapidly evolving influenza viral genes. A problem that draws much attention is to identify pairs of sites that potentially co-mutate to contribute to the overall fitness of the virus. Unlike previous methods that extract the mutations from sequence alignments, here we endeavor a novel method that relies on identifying mutations in the phylogenetic trees that are reconstructed using resampled sequence data. Since the method takes into account the evolutionary structure presents in the sequence data, spurious mutations obtained by comparing sequences from different clades could be removed. Furthermore, this approach does not only allow us to capture site-pairs that potentially co-mutate, but also provides an opportunity to extract the direction of their relationships. By applying network analyses to the set of site-pairs, we could further identify and rank the sites that are likely to be influential or under influence from changes on other sites. We applied the method to the hemagglutinin of influenza H3N2, and interestingly, we successfully recovered mutational sites that are important for cluster antigenic transition of the virus in the top list of our findings. Moreover, we detected a directional relationship that would be interesting for experimental investigation.
{"title":"Phylogenetic Tree based Method for Uncovering Co-mutational Site-pairs in Influenza Viruses","authors":"Fransiskus Xaverius Ivan, Xinrui Zhou, A. Deshpande, Rui Yin, Jie Zheng, C. Kwoh","doi":"10.1145/3107411.3107479","DOIUrl":"https://doi.org/10.1145/3107411.3107479","url":null,"abstract":"Various computational and statistical approaches have been proposed to uncover the mutational patterns of rapidly evolving influenza viral genes. A problem that draws much attention is to identify pairs of sites that potentially co-mutate to contribute to the overall fitness of the virus. Unlike previous methods that extract the mutations from sequence alignments, here we endeavor a novel method that relies on identifying mutations in the phylogenetic trees that are reconstructed using resampled sequence data. Since the method takes into account the evolutionary structure presents in the sequence data, spurious mutations obtained by comparing sequences from different clades could be removed. Furthermore, this approach does not only allow us to capture site-pairs that potentially co-mutate, but also provides an opportunity to extract the direction of their relationships. By applying network analyses to the set of site-pairs, we could further identify and rank the sites that are likely to be influential or under influence from changes on other sites. We applied the method to the hemagglutinin of influenza H3N2, and interestingly, we successfully recovered mutational sites that are important for cluster antigenic transition of the virus in the top list of our findings. Moreover, we detected a directional relationship that would be interesting for experimental investigation.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124175238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Today's massive amount of biological sequence data has the potential to rapidly advance our understanding of life's processes. However, since analyzing biological sequences is a very expensive computing task, users face a formidable challenge in trying to analyze these data on their own. Cloud computing offers access to a large amount of computing resources in an on-demand and pay-per-use fashion, which is a practical way for people to analyze these huge data sets. However, many people are still reluctant to outsource biological sequences to the cloud because they contain sensitive information that should be kept secret for ethical, security, and legal reasons. One of the most fundamental and frequently used computational tools for biological sequence analysis is pairwise sequence alignment (PSA). Previous works for securely solving PSAs at the cloud suffer from poor scalability, i.e., they are unable to exploit the cloud's infrastructure to solve PSAs in parallel because resource-limited users need to be constantly involved in the computations. In this paper, we develop a secure outsourcing algorithm that allows users to solve an arbitrary number of PSAs in parallel at the cloud. Compared with previous works, our algorithm can reduce computing time of a large number of PSAs by more than 50% with as few as 5 computing nodes at the cloud.
{"title":"Secure Cloud Computing for Pairwise Sequence Alignment","authors":"Sergio Salinas, Pan Li","doi":"10.1145/3107411.3107477","DOIUrl":"https://doi.org/10.1145/3107411.3107477","url":null,"abstract":"Today's massive amount of biological sequence data has the potential to rapidly advance our understanding of life's processes. However, since analyzing biological sequences is a very expensive computing task, users face a formidable challenge in trying to analyze these data on their own. Cloud computing offers access to a large amount of computing resources in an on-demand and pay-per-use fashion, which is a practical way for people to analyze these huge data sets. However, many people are still reluctant to outsource biological sequences to the cloud because they contain sensitive information that should be kept secret for ethical, security, and legal reasons. One of the most fundamental and frequently used computational tools for biological sequence analysis is pairwise sequence alignment (PSA). Previous works for securely solving PSAs at the cloud suffer from poor scalability, i.e., they are unable to exploit the cloud's infrastructure to solve PSAs in parallel because resource-limited users need to be constantly involved in the computations. In this paper, we develop a secure outsourcing algorithm that allows users to solve an arbitrary number of PSAs in parallel at the cloud. Compared with previous works, our algorithm can reduce computing time of a large number of PSAs by more than 50% with as few as 5 computing nodes at the cloud.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"219 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114535739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}