Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics最新文献

CNN Based Segmentation of Infarcted Regions in Acute Cerebral Stroke Patients From Computed Tomography Perfusion Imaging 基于CNN的急性脑卒中患者ct灌注成像梗死区域分割

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412470

Luca Tomasetti, K. Engan, M. Khanmohammadi, K. D. Kurz

More than 13 million people suffer from ischemic cerebral stroke worldwide each year. Thrombolytic treatment can reduce brain damage but has a narrow treatment window. Computed Tomography Perfusion imaging is a commonly used primary assessment tool for stroke patients, and typically the radiologists will evaluate resulting parametric maps to estimate the affected areas, dead tissue (core), and the surrounding tissue at risk (penumbra), to decide further treatments. Different work has been reported, suggesting thresholds, and semi-automated methods, and in later years deep neural networks, for segmenting infarction areas based on the parametric maps. However, there is no consensus in terms of which thresholds to use, or how to combine the information from the parametric maps, and the presented methods all have limitations in terms of both accuracy and reproducibility. We propose a fully automated convolutional neural network based segmentation method that uses the full four-dimensional computed tomography perfusion dataset as input, rather than the pre-filtered parametric maps. The suggested network is tested on an available dataset as a proof-of-concept, with very encouraging results. Cross-validated results show averaged Dice score of 0.78 and 0.53, and an area under the receiver operating characteristic curve of 0.97 and 0.94 for penumbra and core respectively.

全世界每年有1300多万人患有缺血性脑卒中。溶栓治疗可以减少脑损伤，但治疗窗口很窄。灌注成像是卒中患者常用的主要评估工具，通常放射科医生会评估产生的参数图，以估计受影响的区域、死亡组织(核心)和周围危险组织(半影)，以决定进一步的治疗。不同的工作已经被报道，提出了阈值、半自动化方法，以及在后来的几年里，基于参数图分割梗死区域的深度神经网络。然而，在使用哪个阈值，或者如何结合参数图的信息方面没有达成共识，并且所提出的方法在准确性和可重复性方面都有局限性。我们提出了一种全自动的基于卷积神经网络的分割方法，该方法使用完整的四维计算机断层扫描灌注数据集作为输入，而不是预先过滤的参数图。建议的网络在一个可用的数据集上进行了测试，作为概念验证，得到了非常令人鼓舞的结果。交叉验证结果显示，平均Dice评分为0.78和0.53，半影和核心的受试者工作特征曲线下面积分别为0.97和0.94。

{"title":"CNN Based Segmentation of Infarcted Regions in Acute Cerebral Stroke Patients From Computed Tomography Perfusion Imaging","authors":"Luca Tomasetti, K. Engan, M. Khanmohammadi, K. D. Kurz","doi":"10.1145/3388440.3412470","DOIUrl":"https://doi.org/10.1145/3388440.3412470","url":null,"abstract":"More than 13 million people suffer from ischemic cerebral stroke worldwide each year. Thrombolytic treatment can reduce brain damage but has a narrow treatment window. Computed Tomography Perfusion imaging is a commonly used primary assessment tool for stroke patients, and typically the radiologists will evaluate resulting parametric maps to estimate the affected areas, dead tissue (core), and the surrounding tissue at risk (penumbra), to decide further treatments. Different work has been reported, suggesting thresholds, and semi-automated methods, and in later years deep neural networks, for segmenting infarction areas based on the parametric maps. However, there is no consensus in terms of which thresholds to use, or how to combine the information from the parametric maps, and the presented methods all have limitations in terms of both accuracy and reproducibility. We propose a fully automated convolutional neural network based segmentation method that uses the full four-dimensional computed tomography perfusion dataset as input, rather than the pre-filtered parametric maps. The suggested network is tested on an available dataset as a proof-of-concept, with very encouraging results. Cross-validated results show averaged Dice score of 0.78 and 0.53, and an area under the receiver operating characteristic curve of 0.97 and 0.94 for penumbra and core respectively.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117132112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Predicting Criticality in COVID-19 Patients 预测COVID-19患者的危重性

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412463

Roger A. Hallman, Anjali Chikkula, T. Prioleau

The COVID-19 pandemic has infected millions of people around the world, spreading rapidly and causing a flood of patients that risk overwhelming clinical facilities. Whether in urban or rural areas, hospitals have limited resources and personnel to treat critical infections in intensive care units, which must be allocated effectively. To assist clinical staff in deciding which patients are in the greatest need of critical care, we develop a predictive model based on a publicly-available data set that is rich in clinical markers. We perform statistical analysis to determine which clinical markers strongly correlate with hospital admission, semi-intensive care, and intensive care for COVID-19 patients. We create a predictive model that will assist clinical personnel in determining COVID-19 patient prognosis. Additionally, we take a step towards a global framework for COVID-19 prognosis prediction by incorporating statistical data for geographically and ethnically diverse COVID--19 patient sets into our own model. To the best of our knowledge, this is the first model which does not exclusively utilize local data.

COVID-19大流行已经感染了全球数百万人，传播迅速，导致患者大量涌入，有可能使临床设施不堪重负。无论是在城市还是农村地区，医院在重症监护病房治疗重症感染的资源和人员有限，必须有效分配。为了帮助临床工作人员决定哪些患者最需要重症监护，我们基于丰富的临床标志物的公开数据集开发了一个预测模型。我们进行统计分析，以确定哪些临床指标与COVID-19患者住院、半重症监护和重症监护密切相关。我们创建了一个预测模型，帮助临床人员确定COVID-19患者的预后。此外，通过将地理和种族不同的COVID-19患者集的统计数据纳入我们自己的模型，我们朝着构建COVID-19预后预测的全球框架迈出了一步。据我们所知，这是第一个不完全利用本地数据的模型。

{"title":"Predicting Criticality in COVID-19 Patients","authors":"Roger A. Hallman, Anjali Chikkula, T. Prioleau","doi":"10.1145/3388440.3412463","DOIUrl":"https://doi.org/10.1145/3388440.3412463","url":null,"abstract":"The COVID-19 pandemic has infected millions of people around the world, spreading rapidly and causing a flood of patients that risk overwhelming clinical facilities. Whether in urban or rural areas, hospitals have limited resources and personnel to treat critical infections in intensive care units, which must be allocated effectively. To assist clinical staff in deciding which patients are in the greatest need of critical care, we develop a predictive model based on a publicly-available data set that is rich in clinical markers. We perform statistical analysis to determine which clinical markers strongly correlate with hospital admission, semi-intensive care, and intensive care for COVID-19 patients. We create a predictive model that will assist clinical personnel in determining COVID-19 patient prognosis. Additionally, we take a step towards a global framework for COVID-19 prognosis prediction by incorporating statistical data for geographically and ethnically diverse COVID--19 patient sets into our own model. To the best of our knowledge, this is the first model which does not exclusively utilize local data.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126152895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Avocado 鳄梨

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414215

Jacob M. Schreiber, Timothy J. Durham, W. Noble, J. Bilmes

In the past decade, the use of high-throughput sequencing assays has allowed researchers to experimentally acquire thousands of functional measurements for each basepair in the human genome. Despite their value, these measurements are only a small fraction of the potential experiments that could be performed while also being too numerous to easily visualize or compute on. In a recent pair of publications [1,2], we address both of these challenges with a deep neural network tensor factorization method, Avocado, that compresses these measurements into dense, information-rich representations. We demonstrate that these learned representations can be used to impute, with high accuracy, the output of tens of thousands of functional experiments that have not yet been performed. Further, we show that, on a variety of genomics tasks, machine learning models that leverage these learned representations outperform those trained directly on the functional measurements. The code is publicly available at https://github.com/jmschrei/avocado.

引用次数: 1

From Interatomic Distances to Protein Tertiary Structures with a Deep Convolutional Neural Network 用深度卷积神经网络从原子间距离到蛋白质三级结构

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414699

Yuanqi Du, Anowarul Kabir, Liang Zhao, Amarda Shehu

Elucidating biologically-active protein structures remains a daunting task both in the wet and dry laboratory, and many proteins lack structural characterization. This lack of knowledge continues to motivate the development of computational methods for protein structure prediction. Methods are diverse in their approaches, and recent efforts have debuted deep learning-based methods for various sub-problems within the larger problem of protein structure prediction. In this paper, we focus on such a sub-problem, the reconstruction of three-dimensional structures consistent with given inter-atomic distances. Inspired by a recent architecture put forward in the larger context of generative frameworks, we design and evaluate a deep convolutional network model on experimentally- and computationally-obtained tertiary structures. Comparison with convex and stochastic optimization-based methods shows that the deep model is faster and similarly or more accurate, opening up several venues of further research to advance the larger problem of protein structure prediction.

阐明生物活性蛋白的结构仍然是一项艰巨的任务，无论是在潮湿和干燥的实验室，许多蛋白质缺乏结构表征。这种知识的缺乏继续推动着蛋白质结构预测计算方法的发展。方法的方法多种多样，最近的努力已经推出了基于深度学习的方法来解决蛋白质结构预测这一更大问题中的各种子问题。在本文中，我们关注于这样一个子问题，即与给定原子间距离一致的三维结构的重建。受最近在更大的生成框架背景下提出的架构的启发，我们设计并评估了一个基于实验和计算获得的三级结构的深度卷积网络模型。与基于凸优化和随机优化的方法的比较表明，深度模型更快，相似或更准确，为推进蛋白质结构预测这一更大问题的进一步研究开辟了几个领域。

{"title":"From Interatomic Distances to Protein Tertiary Structures with a Deep Convolutional Neural Network","authors":"Yuanqi Du, Anowarul Kabir, Liang Zhao, Amarda Shehu","doi":"10.1145/3388440.3414699","DOIUrl":"https://doi.org/10.1145/3388440.3414699","url":null,"abstract":"Elucidating biologically-active protein structures remains a daunting task both in the wet and dry laboratory, and many proteins lack structural characterization. This lack of knowledge continues to motivate the development of computational methods for protein structure prediction. Methods are diverse in their approaches, and recent efforts have debuted deep learning-based methods for various sub-problems within the larger problem of protein structure prediction. In this paper, we focus on such a sub-problem, the reconstruction of three-dimensional structures consistent with given inter-atomic distances. Inspired by a recent architecture put forward in the larger context of generative frameworks, we design and evaluate a deep convolutional network model on experimentally- and computationally-obtained tertiary structures. Comparison with convex and stochastic optimization-based methods shows that the deep model is faster and similarly or more accurate, opening up several venues of further research to advance the larger problem of protein structure prediction.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114311887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rhabdomyosarcoma Histology Classification using Ensemble of Deep Learning Networks 基于深度学习网络集成的横纹肌肉瘤组织学分类

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412486

Saloni Agarwal, M. Abaker, Xinyi Zhang, O. Daescu, D. Barkauskas, E. Rudzinski, P. Leavey

A significant number of machine learning methods have been developed to identify major tumor types in histology images, yet much less is known about automatic classification of tumor subtypes. Rhabdomyosarcoma (RMS), the most common type of soft tissue cancer in children, has several subtypes, the most common being Embryonal, Alveolar, and Spindle Cell. Classifying RMS to the right subtype is critical, since subtypes are known to respond to different treatment protocols. Manual classification requires high expertise and is time consuming due to subtle variance in appearance of histopathology images. In this paper, we introduce and compare machine learning based architectures for automatic classification of Rhabdomyosarcoma into the three major subtypes, from whole slide images (WSI). For training purpose, we only know the class assigned to a WSI, having no manual annotations on the image, while most related work on tumor classification requires manual region or nuclei annotations on WSIs. To predict the class of a new WSI we first divide it into tiles, predict the class of each tile, then use thresholding with soft voting to convert tile level predictions to WSI level prediction. We obtain 94.87% WSI tumor subtype classification accuracy on a large and diverse test dataset. We achieve such accurate classification at 5X magnification level of WSIs, departing from related work, that uses 20X or 10X for best results. A direct advantage of our method is that both training and testing can be performed much faster computationally due to the lower image resolution.

已经开发了大量的机器学习方法来识别组织学图像中的主要肿瘤类型，但对肿瘤亚型的自动分类知之甚少。横纹肌肉瘤(Rhabdomyosarcoma, RMS)是儿童中最常见的软组织癌，有几种亚型，最常见的是胚胎型、肺泡型和梭形细胞型。将RMS分类为正确的亚型是至关重要的，因为已知亚型对不同的治疗方案有反应。人工分类需要很高的专业知识，并且由于组织病理图像的细微变化而耗时。在本文中，我们介绍并比较了基于机器学习的架构，用于横纹肌肉瘤的三种主要亚型的自动分类，从整个幻灯片图像(WSI)。出于训练目的，我们只知道分配给WSI的类别，没有对图像进行手动注释，而大多数与肿瘤分类相关的工作都需要对WSI进行手动区域或细胞核注释。为了预测新的WSI的类别，我们首先将其划分为块，预测每个块的类别，然后使用软投票的阈值将块级别预测转换为WSI级别预测。在一个大而多样的测试数据集上，我们获得了94.87%的WSI肿瘤亚型分类准确率。我们在5倍的wsi放大水平下实现了如此精确的分类，与使用20倍或10倍的相关工作不同。我们的方法的一个直接优点是，由于图像分辨率较低，训练和测试都可以更快地进行计算。

{"title":"Rhabdomyosarcoma Histology Classification using Ensemble of Deep Learning Networks","authors":"Saloni Agarwal, M. Abaker, Xinyi Zhang, O. Daescu, D. Barkauskas, E. Rudzinski, P. Leavey","doi":"10.1145/3388440.3412486","DOIUrl":"https://doi.org/10.1145/3388440.3412486","url":null,"abstract":"A significant number of machine learning methods have been developed to identify major tumor types in histology images, yet much less is known about automatic classification of tumor subtypes. Rhabdomyosarcoma (RMS), the most common type of soft tissue cancer in children, has several subtypes, the most common being Embryonal, Alveolar, and Spindle Cell. Classifying RMS to the right subtype is critical, since subtypes are known to respond to different treatment protocols. Manual classification requires high expertise and is time consuming due to subtle variance in appearance of histopathology images. In this paper, we introduce and compare machine learning based architectures for automatic classification of Rhabdomyosarcoma into the three major subtypes, from whole slide images (WSI). For training purpose, we only know the class assigned to a WSI, having no manual annotations on the image, while most related work on tumor classification requires manual region or nuclei annotations on WSIs. To predict the class of a new WSI we first divide it into tiles, predict the class of each tile, then use thresholding with soft voting to convert tile level predictions to WSI level prediction. We obtain 94.87% WSI tumor subtype classification accuracy on a large and diverse test dataset. We achieve such accurate classification at 5X magnification level of WSIs, departing from related work, that uses 20X or 10X for best results. A direct advantage of our method is that both training and testing can be performed much faster computationally due to the lower image resolution.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128159275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Characterization of S. cerevisiae Protein Complexes by Representative DDI Graph Planarity 酿酒酵母蛋白复合物的代表性DDI图平面性表征

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412465

William Gasper, Kathryn M. Cooper, Nathan Cornelius, H. Ali, S. Bhowmick

With the increasing availability of various types of biological data and the ability to measure interrelationships among molecular elements, biological networks have quickly emerged as the go-to structure to model biological elements and relationships. However, there is not a large body of research that closely analyzes the properties of the various biological networks in ways that allow for the increased extraction of valuable information from these networks and establishes useful connections between network structures and corresponding biological properties. In particular, exploring the underlying graph properties of biological networks augments our understanding of biological organisms as complex systems. Understanding these properties is critical to the process of generating knowledge from biological network models. These properties become particularly interesting when they can be correlated with specific structural and functional qualities associated with the entities represented by the graph/network. Planarity may be especially important to understanding and identifying protein complexes, which are frequently subject to physical constraints that may prevent the constitutive protein components from interacting in such a way that the resulting graph abstraction is densely connected. In this work, we investigate the planarity of domain-domain interaction (DDI) graphs for S. cerevisiae protein complexes with validated three-dimensional structures. We found that the majority of these protein complexes were planar, even with the exclusion of complexes that had small DDI graphs with very few edges. We also found significant structural and functional differences between groups of complexes with planar and nonplanar DDI graphs. These results provide additional context for the study of protein complexes within the network model, and this additional context may be important for general knowledge generation, as well as for specific tasks like protein complex identification.

随着各种生物数据的不断增加和测量分子元素之间相互关系的能力，生物网络迅速成为建模生物元素和关系的首选结构。然而，并没有大量的研究仔细分析各种生物网络的特性，从而允许从这些网络中增加有价值的信息的提取，并在网络结构和相应的生物特性之间建立有用的联系。特别是，探索生物网络的潜在图属性增强了我们对生物有机体作为复杂系统的理解。理解这些属性对于从生物网络模型中生成知识的过程至关重要。当这些属性可以与图/网络所表示的实体相关的特定结构和功能特性相关联时，它们就变得特别有趣。平面性对于理解和识别蛋白质复合物尤其重要，因为蛋白质复合物经常受到物理限制，这些限制可能会阻止构成蛋白质成分以这种方式相互作用，从而导致图抽象紧密相连。在这项工作中，我们研究了酿酒酵母蛋白复合物具有验证的三维结构的域-域相互作用(DDI)图的平面性。我们发现大多数这些蛋白质复合物是平面的，即使排除了具有很少边的小DDI图的复合物。我们还发现平面和非平面DDI图复合物组在结构和功能上存在显著差异。这些结果为网络模型中蛋白质复合物的研究提供了额外的背景，这种额外的背景可能对一般知识的产生以及蛋白质复合物鉴定等特定任务很重要。

{"title":"Characterization of S. cerevisiae Protein Complexes by Representative DDI Graph Planarity","authors":"William Gasper, Kathryn M. Cooper, Nathan Cornelius, H. Ali, S. Bhowmick","doi":"10.1145/3388440.3412465","DOIUrl":"https://doi.org/10.1145/3388440.3412465","url":null,"abstract":"With the increasing availability of various types of biological data and the ability to measure interrelationships among molecular elements, biological networks have quickly emerged as the go-to structure to model biological elements and relationships. However, there is not a large body of research that closely analyzes the properties of the various biological networks in ways that allow for the increased extraction of valuable information from these networks and establishes useful connections between network structures and corresponding biological properties. In particular, exploring the underlying graph properties of biological networks augments our understanding of biological organisms as complex systems. Understanding these properties is critical to the process of generating knowledge from biological network models. These properties become particularly interesting when they can be correlated with specific structural and functional qualities associated with the entities represented by the graph/network. Planarity may be especially important to understanding and identifying protein complexes, which are frequently subject to physical constraints that may prevent the constitutive protein components from interacting in such a way that the resulting graph abstraction is densely connected. In this work, we investigate the planarity of domain-domain interaction (DDI) graphs for S. cerevisiae protein complexes with validated three-dimensional structures. We found that the majority of these protein complexes were planar, even with the exclusion of complexes that had small DDI graphs with very few edges. We also found significant structural and functional differences between groups of complexes with planar and nonplanar DDI graphs. These results provide additional context for the study of protein complexes within the network model, and this additional context may be important for general knowledge generation, as well as for specific tasks like protein complex identification.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115741048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Integer Linear Programming Solution for the Most Parsimonious Reconciliation Problem under the Duplication-Loss-Coalescence Model 最简调和问题在重复-损失-合并模型下的整数线性规划解

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412474

Morgan Carothers, Joseph Gardi, Gianluca Gross, Tatsuki Kuze, Nuo Liu, Fiona Plunkett, Julia Qian, Yi-Chieh Wu

Given a gene tree, a species tree, and an association between their leaves, the maximum parsimony reconciliation (MPR) problem seeks to find a mapping of the gene tree to the species tree that explains their incongruity using a biological model of evolutionary events. Unfortunately, when simultaneously accounting for gene duplication, gene loss, and coalescence, the MPR problem is NP-hard. While an exact algorithm exists, it can be problematic to use in practice due to time and memory requirements. In this work, we present an integer linear programming (ILP) formulation for solving the MPR problem when considering duplications, losses, and coalescence. Our experimental results on a simulated data set of 12 Drosophila species shows that our new algorithm is both accurate and scalable. Furthermore, in contrast to the existing exact algorithm, our formulation allows users to limit the maximum runtime and thus trade-off accuracy and scalability, making it an attractive choice for phylogenetic pipelines.

给定一个基因树，一个物种树，以及它们的叶子之间的关联，最大简约和解(MPR)问题寻求找到一个基因树到物种树的映射，用进化事件的生物学模型来解释它们的不一致性。不幸的是，当同时考虑基因复制、基因丢失和合并时，MPR问题是np困难的。虽然存在精确的算法，但由于时间和内存需求，在实践中使用它可能会有问题。在这项工作中，我们提出了一个整数线性规划(ILP)公式，用于在考虑重复，损失和合并时解决MPR问题。我们在12种果蝇的模拟数据集上的实验结果表明，我们的新算法既准确又可扩展。此外，与现有的精确算法相比，我们的公式允许用户限制最大运行时间，从而权衡准确性和可扩展性，使其成为系统发育管道的有吸引力的选择。

{"title":"An Integer Linear Programming Solution for the Most Parsimonious Reconciliation Problem under the Duplication-Loss-Coalescence Model","authors":"Morgan Carothers, Joseph Gardi, Gianluca Gross, Tatsuki Kuze, Nuo Liu, Fiona Plunkett, Julia Qian, Yi-Chieh Wu","doi":"10.1145/3388440.3412474","DOIUrl":"https://doi.org/10.1145/3388440.3412474","url":null,"abstract":"Given a gene tree, a species tree, and an association between their leaves, the maximum parsimony reconciliation (MPR) problem seeks to find a mapping of the gene tree to the species tree that explains their incongruity using a biological model of evolutionary events. Unfortunately, when simultaneously accounting for gene duplication, gene loss, and coalescence, the MPR problem is NP-hard. While an exact algorithm exists, it can be problematic to use in practice due to time and memory requirements. In this work, we present an integer linear programming (ILP) formulation for solving the MPR problem when considering duplications, losses, and coalescence. Our experimental results on a simulated data set of 12 Drosophila species shows that our new algorithm is both accurate and scalable. Furthermore, in contrast to the existing exact algorithm, our formulation allows users to limit the maximum runtime and thus trade-off accuracy and scalability, making it an attractive choice for phylogenetic pipelines.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128830194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Mining representative approximate frequent coexpression subnetworks 挖掘具有代表性的近似频繁共表达式子网

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3415584

Sangmin Seo, Saeed Salem

Advances in high-throughput microarray and RNA-sequencing technologies have lead to a rapid accumulation of gene expression data for various biological conditions across multiple species. Mining frequent gene modules from a set of multiple gene coexpression networks has applications in functional gene annotation and biomarker discovery. Biclustering algorithms have been proposed to allow for missing coexpression links. Existing approaches report a large number of edgesets which are computationally intensive to analyze, and have high overlap among the reported subnetworks. In this work, we propose an algorithm to mine frequent dense modules from multiple coexpression networks using an online data summarization method. Our algorithm mines a succinct set of representative subgraphs that have little overlap which reduces the downstream analysis of the reported modules. Experiments on human gene expression data show that the reported modules are biologically significant as evident by the high enrichment of GO molecular functions and KEGG pathways in the reported modules.

高通量微阵列技术和rna测序技术的进步使得多种生物条件下不同物种的基因表达数据快速积累。从一组多基因共表达网络中挖掘频繁基因模块在功能基因注释和生物标志物发现方面具有重要的应用价值。已经提出了双聚类算法来考虑缺失的共表达链接。现有的方法报告了大量的边缘集，计算量大，难以分析，并且报告的子网之间有很高的重叠。在这项工作中，我们提出了一种使用在线数据汇总方法从多个共表达网络中挖掘频繁密集模块的算法。我们的算法挖掘了一组简洁的具有代表性的子图，这些子图很少重叠，从而减少了对报告模块的下游分析。人类基因表达数据实验表明，所报道的模块具有显著的生物学意义，因为所报道的模块中GO分子功能和KEGG通路高度富集。

引用次数: 0

SparkBeagle: Scalable Genotype Imputation from Distributed Whole-Genome Reference Panels in the Cloud SparkBeagle:云中的分布式全基因组参考面板的可扩展基因型插入

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414860

Altti Ilari Maarala, K. Pärn, J. Nuñez-Fontarnau, Keijo Heljanko

Massive whole-genome genotype reference panels now provide accurate and fast genotyping by imputation for high-resolution genome-wide association (GWA) studies. Imputation-assisted genotyping can increase the genomic coverage of genotypes and thus satisfy the resolution required in comprehensive GWA studies in a cost-effective manner. However, the imputation of missing genotypes from large reference panels is a compute-intensive process that requires high-performance computing (HPC). Although HPC uses extremely distributed and parallel computing, current imputation tools, and existing algorithms have not been developed to fully exploit the power of distributed computing. To this end, we have developed SparkBeagle, a scalable, fast, and accurate distributed genotype imputation tool based on popular Beagle software. SparkBeagle is designed for HPC and cloud computing environments and it is implemented on top of the Apache Spark distributed computing framework. We have carried out scalability experiments by imputing 64,976,316 variants of 2504 samples from the 1000 Genomes reference panel in the cloud. SparkBeagle shows near-linear scalability while increasing the number of computing nodes. A speedup of 30x was achieved with 40 nodes. The imputation time of the whole data set decreased from 565 minutes to 18 minutes compared to a single node parallel execution. Near identical imputation accuracy was measured in the concordance analysis between the original Beagle and the distributed SparkBeagle tool.

大量的全基因组基因型参考面板现在为高分辨率全基因组关联(GWA)研究提供了准确和快速的基因分型。植入辅助基因分型可以增加基因型的基因组覆盖率，从而以经济有效的方式满足GWA综合研究所需的分辨率。然而，从大型参考面板中插入缺失基因型是一个需要高性能计算(HPC)的计算密集型过程。虽然HPC使用了非常分布式和并行的计算，但目前的计算工具和现有的算法还没有发展到充分利用分布式计算的能力。为此，我们开发了SparkBeagle，一个基于流行的Beagle软件的可扩展，快速，准确的分布式基因型插入工具。SparkBeagle是为HPC和云计算环境设计的，它是在Apache Spark分布式计算框架之上实现的。我们通过在云中输入来自1000个基因组参考面板的2504个样本的64,976,316个变体，进行了可扩展性实验。SparkBeagle在增加计算节点数量的同时显示出近似线性的可扩展性。使用40个节点实现了30倍的加速。与单节点并行执行相比，整个数据集的插入时间从565分钟减少到18分钟。在原始Beagle和分布式SparkBeagle工具之间的一致性分析中，测量了几乎相同的插入精度。

{"title":"SparkBeagle: Scalable Genotype Imputation from Distributed Whole-Genome Reference Panels in the Cloud","authors":"Altti Ilari Maarala, K. Pärn, J. Nuñez-Fontarnau, Keijo Heljanko","doi":"10.1145/3388440.3414860","DOIUrl":"https://doi.org/10.1145/3388440.3414860","url":null,"abstract":"Massive whole-genome genotype reference panels now provide accurate and fast genotyping by imputation for high-resolution genome-wide association (GWA) studies. Imputation-assisted genotyping can increase the genomic coverage of genotypes and thus satisfy the resolution required in comprehensive GWA studies in a cost-effective manner. However, the imputation of missing genotypes from large reference panels is a compute-intensive process that requires high-performance computing (HPC). Although HPC uses extremely distributed and parallel computing, current imputation tools, and existing algorithms have not been developed to fully exploit the power of distributed computing. To this end, we have developed SparkBeagle, a scalable, fast, and accurate distributed genotype imputation tool based on popular Beagle software. SparkBeagle is designed for HPC and cloud computing environments and it is implemented on top of the Apache Spark distributed computing framework. We have carried out scalability experiments by imputing 64,976,316 variants of 2504 samples from the 1000 Genomes reference panel in the cloud. SparkBeagle shows near-linear scalability while increasing the number of computing nodes. A speedup of 30x was achieved with 40 nodes. The imputation time of the whole data set decreased from 565 minutes to 18 minutes compared to a single node parallel execution. Near identical imputation accuracy was measured in the concordance analysis between the original Beagle and the distributed SparkBeagle tool.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133811368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Global Surveillance of COVID-19 by mining news media using a multi-source dynamic embedded topic model 基于多源动态嵌入式主题模型挖掘新闻媒体的COVID-19全球监测

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412418

Yue Li, Pratheeksha Nair, Zhi Wen, I. Chafi, A. Okhmatovskaia, G. Powell, Yannan Shen, D. Buckeridge

As the COVID-19 pandemic continues to unfold, understanding the global impact of non-pharmacological interventions (NPI) is important for formulating effective intervention strategies, particularly as many countries prepare for future waves. We used a machine learning approach to distill latent topics related to NPI from large-scale international news media. We hypothesize that these topics are informative about the timing and nature of implemented NPI, dependent on the source of the information (e.g., local news versus official government announcements) and the target countries. Given a set of latent topics associated with NPI (e.g., self-quarantine, social distancing, online education, etc), we assume that countries and media sources have different prior distributions over these topics, which are sampled to generate the news articles. To model the source-specific topic priors, we developed a semi-supervised, multi-source, dynamic, embedded topic model. Our model is able to simultaneously infer latent topics and learn a linear classifier to predict NPI labels using the topic mixtures as input for each news article. To learn these models, we developed an efficient end-to-end amortized variational inference algorithm. We applied our models to news data collected and labelled by the World Health Organization (WHO) and the Global Public Health Intelligence Network (GPHIN). Through comprehensive experiments, we observed superior topic quality and intervention prediction accuracy, compared to the baseline embedded topic models, which ignore information on media source and intervention labels. The inferred latent topics reveal distinct policies and media framing in different countries and media sources, and also characterize reaction to COVID-19 and NPI in a semantically meaningful manner. Our PyTorch code is available on Github (htps://github.com/li-lab-mcgill/covid19_media).

随着COVID-19大流行的持续发展，了解非药物干预措施的全球影响对于制定有效的干预战略非常重要，特别是在许多国家为未来的疫情做准备之际。我们使用机器学习方法从大型国际新闻媒体中提取与NPI相关的潜在话题。我们假设这些主题是关于实施新产品导入的时间和性质的信息，这取决于信息来源(例如，当地新闻与官方政府公告)和目标国家。给定一组与NPI相关的潜在主题(例如，自我隔离、社交距离、在线教育等)，我们假设国家和媒体来源对这些主题具有不同的先验分布，对这些主题进行抽样以生成新闻文章。为了建模特定于源的主题先验，我们开发了一个半监督的、多源的、动态的嵌入式主题模型。我们的模型能够同时推断潜在主题，并学习线性分类器来预测NPI标签，使用主题混合物作为每篇新闻文章的输入。为了学习这些模型，我们开发了一种高效的端到端平摊变分推理算法。我们将我们的模型应用于世界卫生组织(WHO)和全球公共卫生情报网(GPHIN)收集和标记的新闻数据。通过综合实验，我们观察到，与忽略媒体来源和干预标签信息的基线嵌入主题模型相比，该模型的主题质量和干预预测精度更高。推断出的潜在话题揭示了不同国家和媒体来源的不同政策和媒体框架，并以语义上有意义的方式描述了对COVID-19和NPI的反应。我们的PyTorch代码可以在Github上获得(https:// github.com/li-lab-mcgill/covid19_media)。

{"title":"Global Surveillance of COVID-19 by mining news media using a multi-source dynamic embedded topic model","authors":"Yue Li, Pratheeksha Nair, Zhi Wen, I. Chafi, A. Okhmatovskaia, G. Powell, Yannan Shen, D. Buckeridge","doi":"10.1145/3388440.3412418","DOIUrl":"https://doi.org/10.1145/3388440.3412418","url":null,"abstract":"As the COVID-19 pandemic continues to unfold, understanding the global impact of non-pharmacological interventions (NPI) is important for formulating effective intervention strategies, particularly as many countries prepare for future waves. We used a machine learning approach to distill latent topics related to NPI from large-scale international news media. We hypothesize that these topics are informative about the timing and nature of implemented NPI, dependent on the source of the information (e.g., local news versus official government announcements) and the target countries. Given a set of latent topics associated with NPI (e.g., self-quarantine, social distancing, online education, etc), we assume that countries and media sources have different prior distributions over these topics, which are sampled to generate the news articles. To model the source-specific topic priors, we developed a semi-supervised, multi-source, dynamic, embedded topic model. Our model is able to simultaneously infer latent topics and learn a linear classifier to predict NPI labels using the topic mixtures as input for each news article. To learn these models, we developed an efficient end-to-end amortized variational inference algorithm. We applied our models to news data collected and labelled by the World Health Organization (WHO) and the Global Public Health Intelligence Network (GPHIN). Through comprehensive experiments, we observed superior topic quality and intervention prediction accuracy, compared to the baseline embedded topic models, which ignore information on media source and intervention labels. The inferred latent topics reveal distinct policies and media framing in different countries and media sources, and also characterize reaction to COVID-19 and NPI in a semantically meaningful manner. Our PyTorch code is available on Github (htps://github.com/li-lab-mcgill/covid19_media).","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133721224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9