Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics最新文献_第10页

Automated Next Generation Sequencing Bioinformatics Pipelines for Pathogen Discovery and Surveillance 用于病原体发现和监测的自动化下一代测序生物信息学管道

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

Pub Date : 2017-08-20 DOI: 10.1145/3107411.3108192

M. Okomo-Adhiambo, E. Ramos, Reagan J. Kelly, Yatish Jain, R. Tatusov, A. Montmayeur, Gregory Doho, Rachel L. Marine, T. Ng, Adam C. Retchless, S. Oberste, P. Rota, X. Wang, Agha N. Khan

Next-generation sequencing (NGS) has become a vital tool in clinical microbiology, with numerous applications in infectious disease diagnostics, outbreak investigations, and public health surveillance. Although the NGS technology enables comprehensive pathogen detection in a relatively short time at a low cost, the enormous amount of genomics data generated creates a critical challenge of effectively organizing, archiving, analyzing, and reporting the results within a clinically relevant timeframe. Automated pipelines provide the first step in standardizing NGS data processing and reporting, thus eliminating the common bottlenecks in bioinformatics analyses, and providing rapid turnaround. Here, we present the Viral NGS Pipeline optimized for identification and whole genome assembly of viruses, and the Bacterial Meningococcus Genome Analysis Platform (BMGAP), designed for genotypic characterization of meningitis pathogens. These respective pipelines have been used to analyze more than 11,000 clinical samples and isolates. The pipelines are deployable on both standalone and cloud-based servers, enabling their accessibility to internal CDC users, as well as external partners, including state public health laboratories and other collaborators worldwide. These automated pipelines have the potential to contribute to the development of unbiased NGS-based clinical assays for pathogen detection that demand rapid turnaround times, and are expected to play a key role in infectious disease surveillance in the future.

新一代测序(NGS)已成为临床微生物学的重要工具，在传染病诊断、疫情调查和公共卫生监测方面有着广泛的应用。尽管NGS技术能够在相对较短的时间内以较低的成本进行全面的病原体检测，但所产生的大量基因组学数据为在临床相关的时间框架内有效组织、存档、分析和报告结果带来了重大挑战。自动化管道为标准化NGS数据处理和报告提供了第一步，从而消除了生物信息学分析中的常见瓶颈，并提供了快速周转。在这里，我们提出了用于病毒鉴定和全基因组组装的病毒NGS管道，以及用于脑膜炎病原体基因型表征的细菌性脑膜炎球菌基因组分析平台(BMGAP)。这些各自的管道已用于分析11,000多个临床样本和分离株。这些管道可部署在独立服务器和基于云的服务器上，使CDC内部用户以及外部合作伙伴(包括州公共卫生实验室和全球其他合作者)能够访问它们。这些自动化管道有可能有助于开发基于ngs的无偏见临床检测方法，用于需要快速周转时间的病原体检测，并有望在未来的传染病监测中发挥关键作用。

{"title":"Automated Next Generation Sequencing Bioinformatics Pipelines for Pathogen Discovery and Surveillance","authors":"M. Okomo-Adhiambo, E. Ramos, Reagan J. Kelly, Yatish Jain, R. Tatusov, A. Montmayeur, Gregory Doho, Rachel L. Marine, T. Ng, Adam C. Retchless, S. Oberste, P. Rota, X. Wang, Agha N. Khan","doi":"10.1145/3107411.3108192","DOIUrl":"https://doi.org/10.1145/3107411.3108192","url":null,"abstract":"Next-generation sequencing (NGS) has become a vital tool in clinical microbiology, with numerous applications in infectious disease diagnostics, outbreak investigations, and public health surveillance. Although the NGS technology enables comprehensive pathogen detection in a relatively short time at a low cost, the enormous amount of genomics data generated creates a critical challenge of effectively organizing, archiving, analyzing, and reporting the results within a clinically relevant timeframe. Automated pipelines provide the first step in standardizing NGS data processing and reporting, thus eliminating the common bottlenecks in bioinformatics analyses, and providing rapid turnaround. Here, we present the Viral NGS Pipeline optimized for identification and whole genome assembly of viruses, and the Bacterial Meningococcus Genome Analysis Platform (BMGAP), designed for genotypic characterization of meningitis pathogens. These respective pipelines have been used to analyze more than 11,000 clinical samples and isolates. The pipelines are deployable on both standalone and cloud-based servers, enabling their accessibility to internal CDC users, as well as external partners, including state public health laboratories and other collaborators worldwide. These automated pipelines have the potential to contribute to the development of unbiased NGS-based clinical assays for pathogen detection that demand rapid turnaround times, and are expected to play a key role in infectious disease surveillance in the future.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128087637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Outlier Genes as Biomarkers of Breast Cancer Survivability in Time-Series Data 异常基因在时间序列数据中作为乳腺癌生存能力的生物标志物

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

Pub Date : 2017-08-20 DOI: 10.1145/3107411.3108202

Naveen Mangalakumar, A. Alkhateeb, H. Pham, L. Rueda, A. Ngom

Studying gene expression through various time intervals of breast cancer survival may provide new insights into the recovery from the disease. In this work, we propose a hierarchical clustering method to separate dissimilar groups of gene time-series profiles, which have the furthest distances from the rest of the profiles throughout different time intervals. The isolated outliers can be used as potential biomarkers of Breast Cancer survivability. Gene expressions throughout those time points are cubic spline interpolated to create a trending profile for each gene. After universally aligning the profiles to minimize the vertical area between each pair of profiles, we cluster the genes using hierarchical clustering based on minimized vertical distances [1]. An appropriate number of clusters was chosen based on the profile alignment and agglomerative clustering (PAAC) index as well as visual observations of the clusters. Our study suggests that the combination of proper clustering, distance function and index validation for clusters is a suitable model to identify genes as informative biomarkers of breast cancer survivability.

通过乳腺癌生存的不同时间间隔研究基因表达可能为从疾病中恢复提供新的见解。在这项工作中，我们提出了一种分层聚类方法来分离不同的基因时间序列谱，这些基因时间序列谱在不同的时间间隔内与其他谱距离最远。孤立的异常值可作为乳腺癌生存能力的潜在生物标志物。在这些时间点上的基因表达被三次样条插值，以创建每个基因的趋势剖面。在普遍对齐基因图谱以最小化每对基因图谱之间的垂直面积后，我们使用基于最小化垂直距离的分层聚类方法对基因进行聚类[1]。根据聚类指数(PAAC)和对聚类的视觉观察，选择合适的聚类数量。我们的研究表明，结合适当的聚类、距离函数和聚类的指数验证是一种合适的模型，可以识别基因作为乳腺癌生存能力的信息生物标志物。

{"title":"Outlier Genes as Biomarkers of Breast Cancer Survivability in Time-Series Data","authors":"Naveen Mangalakumar, A. Alkhateeb, H. Pham, L. Rueda, A. Ngom","doi":"10.1145/3107411.3108202","DOIUrl":"https://doi.org/10.1145/3107411.3108202","url":null,"abstract":"Studying gene expression through various time intervals of breast cancer survival may provide new insights into the recovery from the disease. In this work, we propose a hierarchical clustering method to separate dissimilar groups of gene time-series profiles, which have the furthest distances from the rest of the profiles throughout different time intervals. The isolated outliers can be used as potential biomarkers of Breast Cancer survivability. Gene expressions throughout those time points are cubic spline interpolated to create a trending profile for each gene. After universally aligning the profiles to minimize the vertical area between each pair of profiles, we cluster the genes using hierarchical clustering based on minimized vertical distances [1]. An appropriate number of clusters was chosen based on the profile alignment and agglomerative clustering (PAAC) index as well as visual observations of the clusters. Our study suggests that the combination of proper clustering, distance function and index validation for clusters is a suitable model to identify genes as informative biomarkers of breast cancer survivability.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127050249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

An Integrated Reconciliation Framework for Domain, Gene, and Species Level Evolution 领域、基因和物种水平进化的整合协调框架

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

Pub Date : 2017-08-20 DOI: 10.1145/3107411.3108220

Lei Li, Mukul S. Bansal

The majority of genes in eukaryotes consist of multiple protein domains that can be independently lost or gained during evolution. This gain and loss of protein domains, through domain duplications, transfers, or losses, has important evolutionary and functional consequences for genes. Yet, most computational methods for studying gene evolution view genes as the basic unit of evolution and assume that evolutionary processes such as duplications and losses act on entire genes, rather than on parts of genes. Specifically, even though it is well understood that domains evolve inside genes and genes inside species, there do not exist any computational frameworks to simultaneously model the evolution of domains, genes, and species and account for their inter-dependency. Here, we develop a three-tree model of domain evolution that explicitly captures the interdependence of domain-, gene-, and species-level evolution. Our model extends the classical phylogenetic reconciliation framework, which infers gene family evolution by comparing gene trees and species tree, by explicitly accounting for domain-level events. The new model decouples domain-level events from gene-level events and provides a much more fine-grained view of gene family and domain family evolution that is easy to interpret. Specifically, we (i) introduce the new three-tree computational framework, (ii) prove that the associated optimization problem is NP-hard, (iii) devise an efficient heuristic solution for the problem, (iv) apply our algorithm to a large dataset of about 4000 domain trees and 7000 gene trees from 12 fly species, and (v) demonstrate the impact of using our new computational framework by comparing the inferred evolutionary histories against those obtained using existing approaches. Our experimental results show that using the new three-tree model has a significant impact on the inference of both domain-level and gene-level events, and on the inference of domain content in ancestral genes and gene content in ancestral species, compared to existing approaches.

真核生物中的大多数基因由多个蛋白质结构域组成，这些蛋白质结构域可以在进化过程中独立地丢失或获得。这种通过结构域复制、转移或损失的蛋白质结构域的获得和损失对基因的进化和功能具有重要的影响。然而，大多数研究基因进化的计算方法将基因视为进化的基本单位，并假设复制和丢失等进化过程作用于整个基因，而不是基因的一部分。具体来说，尽管众所周知，结构域在基因内部进化，基因在物种内部进化，但目前还不存在任何计算框架来同时模拟结构域、基因和物种的进化，并解释它们之间的相互依赖性。在这里，我们开发了一个三树结构域进化模型，明确地捕捉了结构域、基因和物种水平进化的相互依存关系。我们的模型扩展了经典的系统发育和解框架，该框架通过比较基因树和物种树来推断基因家族的进化，通过明确地考虑域级事件。新模型将领域级事件与基因级事件解耦，并提供了更细粒度的基因家族和领域家族进化视图，易于解释。具体来说，我们(i)引入了新的三树计算框架，(ii)证明了相关的优化问题是np困难的，(iii)为该问题设计了一个有效的启发式解决方案，(iv)将我们的算法应用于来自12种果蝇的约4000个域树和7000个基因树的大型数据集，(v)通过将推断的进化历史与使用现有方法获得的进化历史进行比较，证明了使用我们的新计算框架的影响。我们的实验结果表明，与现有方法相比，使用新的三树模型对域水平和基因水平事件的推断，以及祖先基因的域内容和祖先物种的基因内容的推断都有显着影响。

{"title":"An Integrated Reconciliation Framework for Domain, Gene, and Species Level Evolution","authors":"Lei Li, Mukul S. Bansal","doi":"10.1145/3107411.3108220","DOIUrl":"https://doi.org/10.1145/3107411.3108220","url":null,"abstract":"The majority of genes in eukaryotes consist of multiple protein domains that can be independently lost or gained during evolution. This gain and loss of protein domains, through domain duplications, transfers, or losses, has important evolutionary and functional consequences for genes. Yet, most computational methods for studying gene evolution view genes as the basic unit of evolution and assume that evolutionary processes such as duplications and losses act on entire genes, rather than on parts of genes. Specifically, even though it is well understood that domains evolve inside genes and genes inside species, there do not exist any computational frameworks to simultaneously model the evolution of domains, genes, and species and account for their inter-dependency. Here, we develop a three-tree model of domain evolution that explicitly captures the interdependence of domain-, gene-, and species-level evolution. Our model extends the classical phylogenetic reconciliation framework, which infers gene family evolution by comparing gene trees and species tree, by explicitly accounting for domain-level events. The new model decouples domain-level events from gene-level events and provides a much more fine-grained view of gene family and domain family evolution that is easy to interpret. Specifically, we (i) introduce the new three-tree computational framework, (ii) prove that the associated optimization problem is NP-hard, (iii) devise an efficient heuristic solution for the problem, (iv) apply our algorithm to a large dataset of about 4000 domain trees and 7000 gene trees from 12 fly species, and (v) demonstrate the impact of using our new computational framework by comparing the inferred evolutionary histories against those obtained using existing approaches. Our experimental results show that using the new three-tree model has a significant impact on the inference of both domain-level and gene-level events, and on the inference of domain content in ancestral genes and gene content in ancestral species, compared to existing approaches.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132023730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Network Analysis of Correlated Mutations in Influenza 流感相关突变的网络分析

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

Pub Date : 2017-08-20 DOI: 10.1145/3107411.3108237

Uday Yallapragada, I. Vaisman

Influenza A Virus (IAV) is remarkably adept at surviving in human populations. IAV thrives even among populations with wide spread access to vaccines and anti-viral drugs, and continues to be a major cause of morbidity and mortality. Correlated mutations are an important factor in IAV's evolution and are critical for host adaptation and pathogenicity. Large sets of publicly available sequences of IAV combined with its rapid and complex evolutionary dynamics present interesting opportunities and unique challenges to analyze correlated mutations in influenza proteomes. In this work, we performed a comprehensive analysis of correlated mutations in IAV using a network theory approach where residues in each protein act as nodes in the graph and edges in the graph are created based on inter-residue correlated mutations. Our approach used 'maximal information coefficient' (MIC) to compute correlations between residues and the edges connect nodes if their MIC exceeds a threshold. We created a modular and robust pipeline and applied it to multiple datasets of H1N1, H3N2, H5 and H7N9 subtypes. We studied structural dynamics of IAV sub-systems based on topological properties of their networks resulting in several important conclusions. The main finding is that correlated mutation networks in IAV are sub-type and host specific and the differences for various subtypes and hosts are significant. We identified nodes with highest degree along with edges and triplets with strongest weight for each network. To contextualize our results, we performed entropy analysis to gain a global view of sequence variation and computed solvent accessibility profiles to identify statistical differences in correlation profiles between surface and buried residues. To understand the extent of co-variation between the 10 proteins in IAV sequences, we created visualizations of protein correlation graphs where the proteins acts as nodes and the strength of connections between the nodes depends on the number of correlated mutations between residues of connected proteins. A web application and visualization tools to explore the results and search for correlated mutations were developed.

甲型流感病毒(IAV)非常善于在人群中生存。即使在广泛获得疫苗和抗病毒药物的人群中，禽流感也很猖獗，并继续成为发病率和死亡率的一个主要原因。相关突变是IAV进化的重要因素，对宿主适应和致病性至关重要。大量可公开获得的流感病毒序列及其快速而复杂的进化动力学为分析流感蛋白质组的相关突变提供了有趣的机会和独特的挑战。在这项工作中，我们使用网络理论方法对IAV中的相关突变进行了全面分析，其中每个蛋白质中的残基作为图中的节点，图中的边是基于残基间相关突变创建的。我们的方法使用“最大信息系数”(MIC)来计算残基和连接节点的边之间的相关性，如果它们的MIC超过阈值。我们创建了一个模块化和强大的管道，并将其应用于H1N1, H3N2, H5和H7N9亚型的多个数据集。我们研究了基于网络拓扑特性的IAV子系统的结构动力学，得出了几个重要的结论。主要发现是IAV的相关突变网络具有亚型和宿主特异性，不同亚型和宿主之间差异显著。我们为每个网络识别度最高的节点以及权重最强的边和三联体。为了将我们的结果联系起来，我们进行了熵分析，以获得序列变化的全局视图，并计算了溶剂可及性曲线，以确定地表和掩埋残留物之间相关曲线的统计差异。为了了解IAV序列中10种蛋白质之间的共变异程度，我们创建了蛋白质相关图的可视化，其中蛋白质作为节点，节点之间的连接强度取决于连接蛋白质残基之间相关突变的数量。开发了一个web应用程序和可视化工具来探索结果和搜索相关突变。

{"title":"Network Analysis of Correlated Mutations in Influenza","authors":"Uday Yallapragada, I. Vaisman","doi":"10.1145/3107411.3108237","DOIUrl":"https://doi.org/10.1145/3107411.3108237","url":null,"abstract":"Influenza A Virus (IAV) is remarkably adept at surviving in human populations. IAV thrives even among populations with wide spread access to vaccines and anti-viral drugs, and continues to be a major cause of morbidity and mortality. Correlated mutations are an important factor in IAV's evolution and are critical for host adaptation and pathogenicity. Large sets of publicly available sequences of IAV combined with its rapid and complex evolutionary dynamics present interesting opportunities and unique challenges to analyze correlated mutations in influenza proteomes. In this work, we performed a comprehensive analysis of correlated mutations in IAV using a network theory approach where residues in each protein act as nodes in the graph and edges in the graph are created based on inter-residue correlated mutations. Our approach used 'maximal information coefficient' (MIC) to compute correlations between residues and the edges connect nodes if their MIC exceeds a threshold. We created a modular and robust pipeline and applied it to multiple datasets of H1N1, H3N2, H5 and H7N9 subtypes. We studied structural dynamics of IAV sub-systems based on topological properties of their networks resulting in several important conclusions. The main finding is that correlated mutation networks in IAV are sub-type and host specific and the differences for various subtypes and hosts are significant. We identified nodes with highest degree along with edges and triplets with strongest weight for each network. To contextualize our results, we performed entropy analysis to gain a global view of sequence variation and computed solvent accessibility profiles to identify statistical differences in correlation profiles between surface and buried residues. To understand the extent of co-variation between the 10 proteins in IAV sequences, we created visualizations of protein correlation graphs where the proteins acts as nodes and the strength of connections between the nodes depends on the number of correlated mutations between residues of connected proteins. A web application and visualization tools to explore the results and search for correlated mutations were developed.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132538059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Session details: Session 4: Genomic Variation and Disease 会议详情:第四部分:基因组变异与疾病

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

Pub Date : 2017-08-20 DOI: 10.1145/3254547

Anna M. Ritz

引用次数: 0

String-Based Models for Predicting RNA-Protein Interaction 基于字符串的rna -蛋白相互作用预测模型

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

Pub Date : 2017-08-20 DOI: 10.1145/3107411.3107508

D. Adjeroh, Maen Allaga, Jun Tan, Jie Lin, Yue Jiang, A. Abbasi, Xiaobo Zhou

In this work, we study string-based approaches for the problem of RNA-Protein Interaction (RPI). We apply string algorithms and data structures to extract effective string patterns for prediction of RPI, using both sequence information (protein and RNA sequences), and structure information (protein and RNA secondary structures). This led to different string-based models for predicting interacting RNA-protein pairs. We show results that demonstrate the effectiveness of the proposed string-based models, including comparative results against state-of-the-art methods.

在这项工作中，我们研究了基于字符串的rna -蛋白质相互作用(RPI)问题的方法。我们利用序列信息(蛋白质和RNA序列)和结构信息(蛋白质和RNA二级结构)，应用字符串算法和数据结构提取有效的字符串模式来预测RPI。这导致了不同的基于字符串的模型来预测相互作用的rna -蛋白对。我们展示的结果证明了所提出的基于字符串的模型的有效性，包括与最先进的方法的比较结果。

引用次数: 0

A Flexible and Robust Multi-Source Learning Algorithm for Drug Repositioning 一种灵活鲁棒的药物重定位多源学习算法

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

Pub Date : 2017-08-20 DOI: 10.1145/3107411.3107473

Huiyuan Chen, Jing Li

Drug repositioning is a promising strategy in drug discovery. New biomedical insights of drug-target-disease relationships are important in drug repositioning, and such relationships have been intensively studied recently. Most of the studies utilize network-based computational approaches based on drug and disease similarities. However, one common limitation of existing approaches is that both drug similarities and disease similarities are defined based on a single feature of drugs/diseases. In reality, the relationships between drug (or disease) pairs can be characterized based on many different features. Therefore, it is increasingly important to include them in drug repositioning studies. In this study, we propose a flexible and robust multi-source learning (FRMSL) framework to integrate multiple heterogeneous data sources for drug-disease association predictions. We first construct a two-layer heterogeneous network consisting of drug nodes, disease nodes and known drug-disease relationships. The drug repositioning problem can thus be treated as a missing link prediction problem on the heterogeneous graph and can be solved using Kronecker regularized least square (KronRLS) method. Multiple data sources describing drugs and diseases are incorporated into the framework using similarity-based kernels. In practice, a great challenge in such data integration projects is the data incompleteness problem due to the nature of data generation and collection. To address this issue, we develop a novel multi-view learning algorithm based on symmetric nonnegative matrix factorization (SymNMF). Extensive experimental studies show that our framework outperforms several recent network-based methods.

药物重新定位是一种很有前途的药物发现策略。药物-靶标-疾病关系的生物医学新见解在药物重新定位中很重要，这种关系近年来得到了广泛的研究。大多数研究利用基于药物和疾病相似性的基于网络的计算方法。然而，现有方法的一个共同局限性是，药物相似度和疾病相似度都是基于药物/疾病的单一特征来定义的。在现实中，药物(或疾病)对之间的关系可以基于许多不同的特征来表征。因此，将它们纳入药物重新定位研究变得越来越重要。在这项研究中，我们提出了一个灵活而稳健的多源学习(FRMSL)框架，以整合多个异构数据源进行药物-疾病关联预测。我们首先构建了一个由药物节点、疾病节点和已知药物-疾病关系组成的两层异构网络。因此，药物重定位问题可以看作是异构图上的缺失环节预测问题，可以使用Kronecker正则化最小二乘(KronRLS)方法进行求解。使用基于相似性的核将描述药物和疾病的多个数据源纳入框架。在实践中，由于数据生成和收集的性质，数据不完整性问题是此类数据集成项目面临的一大挑战。为了解决这个问题，我们开发了一种新的基于对称非负矩阵分解(SymNMF)的多视图学习算法。大量的实验研究表明，我们的框架优于最近几种基于网络的方法。

{"title":"A Flexible and Robust Multi-Source Learning Algorithm for Drug Repositioning","authors":"Huiyuan Chen, Jing Li","doi":"10.1145/3107411.3107473","DOIUrl":"https://doi.org/10.1145/3107411.3107473","url":null,"abstract":"Drug repositioning is a promising strategy in drug discovery. New biomedical insights of drug-target-disease relationships are important in drug repositioning, and such relationships have been intensively studied recently. Most of the studies utilize network-based computational approaches based on drug and disease similarities. However, one common limitation of existing approaches is that both drug similarities and disease similarities are defined based on a single feature of drugs/diseases. In reality, the relationships between drug (or disease) pairs can be characterized based on many different features. Therefore, it is increasingly important to include them in drug repositioning studies. In this study, we propose a flexible and robust multi-source learning (FRMSL) framework to integrate multiple heterogeneous data sources for drug-disease association predictions. We first construct a two-layer heterogeneous network consisting of drug nodes, disease nodes and known drug-disease relationships. The drug repositioning problem can thus be treated as a missing link prediction problem on the heterogeneous graph and can be solved using Kronecker regularized least square (KronRLS) method. Multiple data sources describing drugs and diseases are incorporated into the framework using similarity-based kernels. In practice, a great challenge in such data integration projects is the data incompleteness problem due to the nature of data generation and collection. To address this issue, we develop a novel multi-view learning algorithm based on symmetric nonnegative matrix factorization (SymNMF). Extensive experimental studies show that our framework outperforms several recent network-based methods.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133174011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

A Cross-Platform System Architecture for Form Design and Data Analytics for Public Health 面向公共卫生表单设计和数据分析的跨平台系统架构

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

Pub Date : 2017-08-20 DOI: 10.1145/3107411.3108223

Blake Camp, J. Mandivarapu, Jay Mehta, Nagashayana Ramamurthy, James Wingo, A. Bourgeois, Xiaojun Cao, Rajshekhar Sunderraman

The CDC's Epi-Info is widely-used by epidemiologists and public health researchers to collect and analyze public health data, especially in the event of outbreaks. As it exists today, Epi-Info runs only on the Windows platform and is made of separate code-bases for several different devices and use-cases. Software portability has become increasingly important over the past few years. In this poster, we present a cross-platform architecture for Epi-Info. To simplify and expedite future development, the cross-platform system architecture uses Electron, AngularJS, and Python with the capability of running on virtually any desktop or laptop computer. Additionally, the code can be easily deployed to the Web, and has the potential to be a viable solution for several mobile use-cases.

流行病学家和公共卫生研究人员广泛使用疾病预防控制中心的Epi-Info来收集和分析公共卫生数据，特别是在爆发疫情的情况下。就目前而言，Epi-Info仅在Windows平台上运行，由不同设备和用例的独立代码库组成。软件可移植性在过去几年中变得越来越重要。在这张海报中，我们展示了Epi-Info的跨平台架构。为了简化和加快未来的开发，跨平台系统架构使用了Electron、AngularJS和Python，并且能够在几乎任何台式机或笔记本电脑上运行。此外，代码可以很容易地部署到Web上，并且有可能成为几个移动用例的可行解决方案。

引用次数: 1

Phylogenetic Tree based Method for Uncovering Co-mutational Site-pairs in Influenza Viruses 基于系统发育树的流感病毒共突变位点对发现方法

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

Pub Date : 2017-08-20 DOI: 10.1145/3107411.3107479

Fransiskus Xaverius Ivan, Xinrui Zhou, A. Deshpande, Rui Yin, Jie Zheng, C. Kwoh

Various computational and statistical approaches have been proposed to uncover the mutational patterns of rapidly evolving influenza viral genes. A problem that draws much attention is to identify pairs of sites that potentially co-mutate to contribute to the overall fitness of the virus. Unlike previous methods that extract the mutations from sequence alignments, here we endeavor a novel method that relies on identifying mutations in the phylogenetic trees that are reconstructed using resampled sequence data. Since the method takes into account the evolutionary structure presents in the sequence data, spurious mutations obtained by comparing sequences from different clades could be removed. Furthermore, this approach does not only allow us to capture site-pairs that potentially co-mutate, but also provides an opportunity to extract the direction of their relationships. By applying network analyses to the set of site-pairs, we could further identify and rank the sites that are likely to be influential or under influence from changes on other sites. We applied the method to the hemagglutinin of influenza H3N2, and interestingly, we successfully recovered mutational sites that are important for cluster antigenic transition of the virus in the top list of our findings. Moreover, we detected a directional relationship that would be interesting for experimental investigation.

已经提出了各种计算和统计方法来揭示快速进化的流感病毒基因的突变模式。一个引起广泛关注的问题是确定可能共同突变的位点对，从而影响病毒的整体适应性。与以往从序列比对中提取突变的方法不同，本研究尝试了一种新的方法，该方法依赖于识别使用重采样序列数据重建的系统发育树中的突变。由于该方法考虑了序列数据中存在的进化结构，因此可以去除通过比较不同分支的序列而获得的虚假突变。此外，这种方法不仅使我们能够捕获可能共突变的位点对，而且还提供了提取其关系方向的机会。通过对一组站点对应用网络分析，我们可以进一步识别和排序可能有影响力或受其他站点变化影响的站点。我们将该方法应用于流感H3N2的血凝素，有趣的是，我们成功地恢复了在我们的发现列表中对病毒簇抗原转变重要的突变位点。此外，我们发现了一个方向关系，这将是有趣的实验调查。

{"title":"Phylogenetic Tree based Method for Uncovering Co-mutational Site-pairs in Influenza Viruses","authors":"Fransiskus Xaverius Ivan, Xinrui Zhou, A. Deshpande, Rui Yin, Jie Zheng, C. Kwoh","doi":"10.1145/3107411.3107479","DOIUrl":"https://doi.org/10.1145/3107411.3107479","url":null,"abstract":"Various computational and statistical approaches have been proposed to uncover the mutational patterns of rapidly evolving influenza viral genes. A problem that draws much attention is to identify pairs of sites that potentially co-mutate to contribute to the overall fitness of the virus. Unlike previous methods that extract the mutations from sequence alignments, here we endeavor a novel method that relies on identifying mutations in the phylogenetic trees that are reconstructed using resampled sequence data. Since the method takes into account the evolutionary structure presents in the sequence data, spurious mutations obtained by comparing sequences from different clades could be removed. Furthermore, this approach does not only allow us to capture site-pairs that potentially co-mutate, but also provides an opportunity to extract the direction of their relationships. By applying network analyses to the set of site-pairs, we could further identify and rank the sites that are likely to be influential or under influence from changes on other sites. We applied the method to the hemagglutinin of influenza H3N2, and interestingly, we successfully recovered mutational sites that are important for cluster antigenic transition of the virus in the top list of our findings. Moreover, we detected a directional relationship that would be interesting for experimental investigation.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124175238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Secure Cloud Computing for Pairwise Sequence Alignment 两两序列比对的安全云计算

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

Pub Date : 2017-08-20 DOI: 10.1145/3107411.3107477

Sergio Salinas, Pan Li

Today's massive amount of biological sequence data has the potential to rapidly advance our understanding of life's processes. However, since analyzing biological sequences is a very expensive computing task, users face a formidable challenge in trying to analyze these data on their own. Cloud computing offers access to a large amount of computing resources in an on-demand and pay-per-use fashion, which is a practical way for people to analyze these huge data sets. However, many people are still reluctant to outsource biological sequences to the cloud because they contain sensitive information that should be kept secret for ethical, security, and legal reasons. One of the most fundamental and frequently used computational tools for biological sequence analysis is pairwise sequence alignment (PSA). Previous works for securely solving PSAs at the cloud suffer from poor scalability, i.e., they are unable to exploit the cloud's infrastructure to solve PSAs in parallel because resource-limited users need to be constantly involved in the computations. In this paper, we develop a secure outsourcing algorithm that allows users to solve an arbitrary number of PSAs in parallel at the cloud. Compared with previous works, our algorithm can reduce computing time of a large number of PSAs by more than 50% with as few as 5 computing nodes at the cloud.

今天大量的生物序列数据有可能迅速推进我们对生命过程的理解。然而，由于分析生物序列是一项非常昂贵的计算任务，用户在尝试自己分析这些数据时面临着巨大的挑战。云计算以按需和按使用付费的方式提供了对大量计算资源的访问，这是人们分析这些庞大数据集的实用方法。然而，许多人仍然不愿意将生物序列外包给云，因为它们包含出于道德、安全和法律原因应该保密的敏感信息。生物序列分析最基本和最常用的计算工具之一是成对序列比对(PSA)。以前在云上安全解决psa的工作存在可扩展性差的问题，也就是说，它们无法利用云的基础设施并行解决psa，因为资源有限的用户需要不断地参与计算。在本文中，我们开发了一种安全的外包算法，允许用户在云端并行解决任意数量的psa。与以往的工作相比，我们的算法在云上只需5个计算节点，就可以将大量psa的计算时间减少50%以上。

{"title":"Secure Cloud Computing for Pairwise Sequence Alignment","authors":"Sergio Salinas, Pan Li","doi":"10.1145/3107411.3107477","DOIUrl":"https://doi.org/10.1145/3107411.3107477","url":null,"abstract":"Today's massive amount of biological sequence data has the potential to rapidly advance our understanding of life's processes. However, since analyzing biological sequences is a very expensive computing task, users face a formidable challenge in trying to analyze these data on their own. Cloud computing offers access to a large amount of computing resources in an on-demand and pay-per-use fashion, which is a practical way for people to analyze these huge data sets. However, many people are still reluctant to outsource biological sequences to the cloud because they contain sensitive information that should be kept secret for ethical, security, and legal reasons. One of the most fundamental and frequently used computational tools for biological sequence analysis is pairwise sequence alignment (PSA). Previous works for securely solving PSAs at the cloud suffer from poor scalability, i.e., they are unable to exploit the cloud's infrastructure to solve PSAs in parallel because resource-limited users need to be constantly involved in the computations. In this paper, we develop a secure outsourcing algorithm that allows users to solve an arbitrary number of PSAs in parallel at the cloud. Compared with previous works, our algorithm can reduce computing time of a large number of PSAs by more than 50% with as few as 5 computing nodes at the cloud.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"219 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114535739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2