Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics最新文献_第3页

An Investigation on Public Cloud Performance Variation for an RNA Sequencing Workflow RNA测序工作流程的公有云性能变化研究

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414859

David Perez, Ling-Hong Hung, Sonia Xu, K. Y. Yeung, W. Lloyd

Public Infrastructure-as-a-Service (IaaS) clouds abstract various details regarding the implementation of resources provided to users. For example, users are not informed about the exact physical location of their virtual machines (VMs), the specific hardware used, the number of co-resident VMs they reside with, or the workloads that co-resident VMs are running. Detecting when VMs underperform can help identify resource contention from co-resident VMs to spur their replacement. Resource utilization metrics can be used to help classify performance of runs for use in VM performance model datasets to sample the distribution of performance outcomes in the cloud. VM performance models are key to predicting the cost of bioinformatics analyses in the public cloud. This paper investigates the performance variations of running a RNA sequencing workflow in the public cloud. We examine causes of performance variations including VM provisioning, CPU heterogeneity, and resource contention. We leverage Amazon Elastic Compute Cloud (EC2) placement groups, a feature designed to help influence VM placement to help examine how VM placement impacts performance variations. As a use case, we investigate the performance of a multi-stage bioinformatics RNA sequencing (RNA-seq) analytical workflow consisting of four distinct phases, executing in 90 minutes on average using 8-core public cloud VMs. In addition, we investigate whether Linux resource utilization metrics collected by profiling workflow runs can help identify performance implications.

公共基础设施即服务(IaaS)云抽象了与提供给用户的资源实现相关的各种细节。例如，用户不会被告知其虚拟机(vm)的确切物理位置、所使用的特定硬件、所驻留的共同驻留vm的数量，或者共同驻留vm正在运行的工作负载。检测虚拟机性能不佳可以帮助识别来自共同驻留虚拟机的资源争用，从而促使它们被替换。资源利用率指标可用于帮助对运行的性能进行分类，以便在VM性能模型数据集中使用，从而对云中性能结果的分布进行抽样。VM性能模型是预测公共云中生物信息学分析成本的关键。本文研究了在公共云中运行RNA测序工作流程的性能变化。我们研究了性能变化的原因，包括VM配置、CPU异构性和资源争用。我们利用Amazon Elastic Compute Cloud (EC2)放置组，这是一个旨在帮助影响VM放置的功能，以帮助检查VM放置如何影响性能变化。作为一个用例，我们研究了一个多阶段生物信息学RNA测序(RNA-seq)分析工作流的性能，该工作流由四个不同的阶段组成，使用8核公共云虚拟机平均在90分钟内执行。此外，我们还研究了通过分析工作流运行收集的Linux资源利用率指标是否有助于确定性能影响。

{"title":"An Investigation on Public Cloud Performance Variation for an RNA Sequencing Workflow","authors":"David Perez, Ling-Hong Hung, Sonia Xu, K. Y. Yeung, W. Lloyd","doi":"10.1145/3388440.3414859","DOIUrl":"https://doi.org/10.1145/3388440.3414859","url":null,"abstract":"Public Infrastructure-as-a-Service (IaaS) clouds abstract various details regarding the implementation of resources provided to users. For example, users are not informed about the exact physical location of their virtual machines (VMs), the specific hardware used, the number of co-resident VMs they reside with, or the workloads that co-resident VMs are running. Detecting when VMs underperform can help identify resource contention from co-resident VMs to spur their replacement. Resource utilization metrics can be used to help classify performance of runs for use in VM performance model datasets to sample the distribution of performance outcomes in the cloud. VM performance models are key to predicting the cost of bioinformatics analyses in the public cloud. This paper investigates the performance variations of running a RNA sequencing workflow in the public cloud. We examine causes of performance variations including VM provisioning, CPU heterogeneity, and resource contention. We leverage Amazon Elastic Compute Cloud (EC2) placement groups, a feature designed to help influence VM placement to help examine how VM placement impacts performance variations. As a use case, we investigate the performance of a multi-stage bioinformatics RNA sequencing (RNA-seq) analytical workflow consisting of four distinct phases, executing in 90 minutes on average using 8-core public cloud VMs. In addition, we investigate whether Linux resource utilization metrics collected by profiling workflow runs can help identify performance implications.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132167520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Performance Evaluation of Viral Infection Diagnosis using T-Cell Receptor Sequence and Artificial Intelligence 基于t细胞受体序列和人工智能的病毒感染诊断性能评价

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412420

Tim Kosfeld, Jonathan McMillan, R. DiPaolo, Jie Hou, Tae-Hyuk Ahn

The adaptive immune system expresses millions of different receptors that detect and fight pathogens encountered throughout life. These receptors are encoded by unique DNA sequences that allow immune cells to express millions of different receptors. High-throughput sequencing and analyses of immune cell receptor sequences present a unique opportunity to inform our understanding of immunological responses to infections and to evaluate vaccine efficacy. Even after the infection is eliminated, pathogen-specific immune cells and their receptor sequences are present at higher frequencies than prior to infection, and their increase in frequency prevents secondary infections. As a result of their persistence in the body, they may be useful for diagnosing infections and evaluating vaccine efficacy as a stable biomarker. However, this process requires thorough analysis of massive datasets at an accuracy beyond traditional statistical tests to diagnose infectious statuses based on sequence analyses. Here we evaluate various machine learning and deep learning algorithms to measure the performance of the identification and diagnosis of specific viral infections or vaccination statuses using the publicly available mouse (monkeypox infection and smallpox vaccination) and human (cytomegalovirus serostatus) T-cell receptor sequenced datasets. Our intensive experiments hold the potential for effective screening of disease status, including recently encountered strains like the ongoing SARS-CoV-2 pandemic.

适应性免疫系统表达了数百万种不同的受体，用于检测和对抗生命中遇到的病原体。这些受体由独特的DNA序列编码，使免疫细胞能够表达数百万种不同的受体。免疫细胞受体序列的高通量测序和分析为我们了解对感染的免疫反应和评估疫苗效力提供了一个独特的机会。即使在感染被消除后，病原体特异性免疫细胞及其受体序列仍以比感染前更高的频率存在，其频率的增加可防止继发性感染。由于它们在体内的持久性，它们可能作为一种稳定的生物标志物用于诊断感染和评估疫苗效力。然而，这一过程需要对大量数据集进行彻底的分析，其准确性超过传统的统计测试，以基于序列分析来诊断感染状态。在这里，我们评估了各种机器学习和深度学习算法，以使用公开可用的小鼠(猴痘感染和天花疫苗接种)和人类(巨细胞病毒血清状态)t细胞受体测序数据集来衡量识别和诊断特定病毒感染或疫苗接种状态的性能。我们的密集实验具有有效筛查疾病状态的潜力，包括最近遇到的病毒株，如正在进行的SARS-CoV-2大流行。

{"title":"Performance Evaluation of Viral Infection Diagnosis using T-Cell Receptor Sequence and Artificial Intelligence","authors":"Tim Kosfeld, Jonathan McMillan, R. DiPaolo, Jie Hou, Tae-Hyuk Ahn","doi":"10.1145/3388440.3412420","DOIUrl":"https://doi.org/10.1145/3388440.3412420","url":null,"abstract":"The adaptive immune system expresses millions of different receptors that detect and fight pathogens encountered throughout life. These receptors are encoded by unique DNA sequences that allow immune cells to express millions of different receptors. High-throughput sequencing and analyses of immune cell receptor sequences present a unique opportunity to inform our understanding of immunological responses to infections and to evaluate vaccine efficacy. Even after the infection is eliminated, pathogen-specific immune cells and their receptor sequences are present at higher frequencies than prior to infection, and their increase in frequency prevents secondary infections. As a result of their persistence in the body, they may be useful for diagnosing infections and evaluating vaccine efficacy as a stable biomarker. However, this process requires thorough analysis of massive datasets at an accuracy beyond traditional statistical tests to diagnose infectious statuses based on sequence analyses. Here we evaluate various machine learning and deep learning algorithms to measure the performance of the identification and diagnosis of specific viral infections or vaccination statuses using the publicly available mouse (monkeypox infection and smallpox vaccination) and human (cytomegalovirus serostatus) T-cell receptor sequenced datasets. Our intensive experiments hold the potential for effective screening of disease status, including recently encountered strains like the ongoing SARS-CoV-2 pandemic.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117037691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Graphery Graphery

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414915

Heyuan Zeng, Anna M. Ritz

引用次数: 0

A Supervised Machine Learning Approach for Distinguishing Between Additive and Replacing Horizontal Gene Transfers 一种有监督的机器学习方法用于区分加性和替代水平基因转移

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412428

Abhijit Mondal, Misagh Kordi, Mukul S. Bansal

Horizontal gene transfer is one of the most important drivers of microbial gene and genome evolution. Despite its central role in microbial evolution, several aspects of horizontal gene transfer remain poorly understood. In particular, transfers can be either additive or replacing depending on whether the transferred gene adds itself as a new gene in the recipient genome or replaces an existing homologous gene. However, despite recent efforts, there do not yet exist effective computational approaches for classifying inferred transfers as being additive or replacing. In this work, we address this gap by devising a novel supervised machine learning approach for classifying transfers as being either additive or replacing. Our approach is based on phylogenetic reconciliation, a standard computational technique for inferring transfers. Our classifier, named ARTra, uses as features the classifications provided by several simple reconciliation-based classification rules, along with topological information from the gene tree, and ensembles them to produce a more accurate classification. ARTra is efficient and robust and significantly improves upon the classification accuracy of the only existing computational approach for this problem. We demonstrate the accuracy of ARTra by applying it to a wide range of simulated datasets and to a large biological dataset. Our results show that ARTra performs well over a broad range of evolutionary conditions and on real data, and that it does so even when trained only on a narrow range of such conditions and only using simulated data. An open-source implementation of ARTra is freely available from https://compbio.engr.uconn.edu/software/ARTra/.

水平基因转移是微生物基因和基因组进化的重要驱动因素之一。尽管它在微生物进化中的核心作用，水平基因转移的几个方面仍然知之甚少。具体地说，转移可以是加性的，也可以是替换性的，这取决于转移的基因是作为一个新基因在受体基因组中添加自己，还是替换现有的同源基因。然而，尽管最近的努力，目前还没有有效的计算方法来分类推断转移为可加性或可替换性。在这项工作中，我们通过设计一种新的监督机器学习方法来解决这一差距，该方法将转移分类为可加性或可替换性。我们的方法基于系统发育调节，这是一种推断转移的标准计算技术。我们的分类器名为ARTra，它使用几个简单的基于协调的分类规则提供的分类特征，以及来自基因树的拓扑信息，并将它们集成在一起以产生更准确的分类。ARTra是一种高效且鲁棒的算法，在现有算法的基础上显著提高了分类精度。我们通过将其应用于广泛的模拟数据集和大型生物数据集来证明ARTra的准确性。我们的结果表明，ARTra在大范围的进化条件和真实数据上表现良好，即使只在小范围的进化条件和模拟数据上训练，它也能做到这一点。ARTra的开源实现可以从https://compbio.engr.uconn.edu/software/ARTra/免费获得。

{"title":"A Supervised Machine Learning Approach for Distinguishing Between Additive and Replacing Horizontal Gene Transfers","authors":"Abhijit Mondal, Misagh Kordi, Mukul S. Bansal","doi":"10.1145/3388440.3412428","DOIUrl":"https://doi.org/10.1145/3388440.3412428","url":null,"abstract":"Horizontal gene transfer is one of the most important drivers of microbial gene and genome evolution. Despite its central role in microbial evolution, several aspects of horizontal gene transfer remain poorly understood. In particular, transfers can be either additive or replacing depending on whether the transferred gene adds itself as a new gene in the recipient genome or replaces an existing homologous gene. However, despite recent efforts, there do not yet exist effective computational approaches for classifying inferred transfers as being additive or replacing. In this work, we address this gap by devising a novel supervised machine learning approach for classifying transfers as being either additive or replacing. Our approach is based on phylogenetic reconciliation, a standard computational technique for inferring transfers. Our classifier, named ARTra, uses as features the classifications provided by several simple reconciliation-based classification rules, along with topological information from the gene tree, and ensembles them to produce a more accurate classification. ARTra is efficient and robust and significantly improves upon the classification accuracy of the only existing computational approach for this problem. We demonstrate the accuracy of ARTra by applying it to a wide range of simulated datasets and to a large biological dataset. Our results show that ARTra performs well over a broad range of evolutionary conditions and on real data, and that it does so even when trained only on a narrow range of such conditions and only using simulated data. An open-source implementation of ARTra is freely available from https://compbio.engr.uconn.edu/software/ARTra/.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124516747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Unified Cloud-Native Architecture For Heterogeneous Data Aggregation And Computation 异构数据聚合与计算的统一云原生架构

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414911

Fatemeh Rouzbeh, A. Grama, Paul M. Griffin, Mohammad Adibuzzaman

Improving healthcare depends on collecting and analyzing different types of health related data such as Electronic Health Records (EHR), Patient Generated Health Data (PGHD), prescription and medication data and medical image data. Even though different solutions in terms of storage and processing have been designed and developed but each solution is usually designed for a specific type of data. Storing, processing, and analyzing all types of data using a single solution necessarily doesn't result in best performance and quality of analysis. To acquire the better quality, each types of data requires its own type of storage, data processing and machine learning solutions which cannot be integrated as a unified system in some cases. In order to have a unified system that serves all types of data we propose a modular cloud native architecture with autonomous modules in terms of control, deployment and management for each types of data.

改善医疗保健取决于收集和分析不同类型的健康相关数据，如电子健康记录(EHR)、患者生成的健康数据(PGHD)、处方和药物数据以及医疗图像数据。尽管在存储和处理方面已经设计和开发了不同的解决方案，但每种解决方案通常是为特定类型的数据设计的。使用单一解决方案存储、处理和分析所有类型的数据不一定会产生最佳的性能和分析质量。为了获得更好的质量，每种类型的数据都需要自己的存储类型、数据处理和机器学习解决方案，在某些情况下，这些解决方案不能集成为一个统一的系统。为了有一个统一的系统来服务所有类型的数据，我们提出了一个模块化的云原生架构，在控制、部署和管理每种类型的数据方面都有自主的模块。

引用次数: 0

CanMod

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3415586

Duc Do, S. Bozdag

Transcription factors (TFs) and microRNAs (miRNAs) are two important classes of gene regulators that govern many critical biological processes. Dysregulation of TF-gene and miRNA-gene interactions can lead to the development of multiple diseases including cancer. Many studies aimed to identify interactions between target genes and their regulators in both normal and disease settings. However, few studies attempted to elucidate the collaborative relationship between TFs and miRNAs in regulating genes involved in cancer-associated biological processes. Identification of the co-regulatory functions of those regulators in cancer would provide a better understanding of gene regulation at different layers and may also suggest better approaches for targeted therapy. This study proposes a computational pipeline called CanMod to identify cancer-associated gene regulatory modules. CanMod was designed so that it could infer gene regulatory modules that meet three criteria. First, within a module, target genes should involve in similar biological processes; thus, the modules are distinguishable based on their biological functions. Second, the expression of target genes in a module should be collectively dependent on the expression of their regulators. Third, a regulator and a target should be allowed to be included in multiple modules to reflect the diverse biological roles that the genes and the regulators may be responsible for. CanMod also incorporates other regulatory factors such as copy number alteration and DNA methylation data to infer regulator-target gene interactions with higher accuracy. We applied CanMod on the breast cancer dataset (BRCA) from The Cancer Genome Atlas (TCGA). We found that modules found by CanMod were associated with distinguishable biological functions and the expression of target genes in the modules were significantly correlated. In addition, many hub regulators in CanMod were known cancer genes, and CanMod was able to find experimentally validated regulator-target interactions.

{"title":"CanMod","authors":"Duc Do, S. Bozdag","doi":"10.1145/3388440.3415586","DOIUrl":"https://doi.org/10.1145/3388440.3415586","url":null,"abstract":"Transcription factors (TFs) and microRNAs (miRNAs) are two important classes of gene regulators that govern many critical biological processes. Dysregulation of TF-gene and miRNA-gene interactions can lead to the development of multiple diseases including cancer. Many studies aimed to identify interactions between target genes and their regulators in both normal and disease settings. However, few studies attempted to elucidate the collaborative relationship between TFs and miRNAs in regulating genes involved in cancer-associated biological processes. Identification of the co-regulatory functions of those regulators in cancer would provide a better understanding of gene regulation at different layers and may also suggest better approaches for targeted therapy. This study proposes a computational pipeline called CanMod to identify cancer-associated gene regulatory modules. CanMod was designed so that it could infer gene regulatory modules that meet three criteria. First, within a module, target genes should involve in similar biological processes; thus, the modules are distinguishable based on their biological functions. Second, the expression of target genes in a module should be collectively dependent on the expression of their regulators. Third, a regulator and a target should be allowed to be included in multiple modules to reflect the diverse biological roles that the genes and the regulators may be responsible for. CanMod also incorporates other regulatory factors such as copy number alteration and DNA methylation data to infer regulator-target gene interactions with higher accuracy. We applied CanMod on the breast cancer dataset (BRCA) from The Cancer Genome Atlas (TCGA). We found that modules found by CanMod were associated with distinguishable biological functions and the expression of target genes in the modules were significantly correlated. In addition, many hub regulators in CanMod were known cancer genes, and CanMod was able to find experimentally validated regulator-target interactions.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114201974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Large-Scale Machine Learning and Optimization for Bioinformatics Data Analysis 生物信息学数据分析的大规模机器学习和优化

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3415587

Jianlin Cheng

Empowered by the availability of high-performance computing (HPC) infrastructure (e.g. GPUs and HPC clusters), machine learning and optimization have become key technologies to analyze big bioinformatics data. In this keynote talk, I will present our machine learning and optimization algorithms for addressing three important data-intensive bioinformatics problems: (1) predicting protein tertiary structures from the evolutionary information in big protein sequence data generated by genomics and meta-genomics sequencing; (2) reconstructing high-resolution 3D genome conformations for integrating omics data; and (3) modeling gene regulatory networks from transcriptomics and genomics data, leveraging the high-performance computing platform available at the University of Missouri -- Columbia. The three research topics are briefly described below. Protein structure modeling on big protein sequence data. Predicting protein tertiary structure from sequence is a major challenge in bioinformatics and protein science. After a long period stagnancy, the field is experiencing a revolution driven by applying deep learning to leverage the amino acid (residue) evolutionary information hidden in the large amount of protein sequence data generated by the genome and meta-genome sequencing effort. In this talk, I will describe our deep convolutional neural network methods for predicting residue-residue contacts (e.g. interactions) and the distance-based method of reconstructing protein tertiary structures from predicted contacts that was ranked among the top methods in the 13th Critical Assessment of Techniques for Protein Structure Prediction in 2018 [1], along with Google DeepMind's AlphaFold. Reconstructing high-resolution 3D Conformations of large genomes for omics data analysis. 3D conformations (or structures) of genomes provide critical gene-gene and gene-enhancer interactions not available in 1D genome sequences. Unlike genome sequencing, there is no experimental technique to directly determine the 3D structure of genome. In this talk, I will present our high-performance, large-scale, data-driven optimization algorithm for reconstructing high-resolution 3D genome structures from deep chromosome conformation capturing (i.e. Hi-C) data [2]. The algorithm is highly scalable and efficient to reconstruct the 3D structures of large genomes such as the human genome at 5KB resolution. The high-resolution 3D genome models can be used to study gene function, gene expression, genome methylation and integrate multiple sources of omics data. Gene regulatory network modeling on transcriptomics and genomics data. Inferring gene regulatory relationships from large-scale gene expression data is an important, yet unsolved problem in bioinformatics. Gene regulatory networks provide a concise and informative representation of complex gene regulatory relationships. In this talk, I will present our probabilistic graphical model method for reliably reconstructing gene regulatory ne

由于高性能计算(HPC)基础设施(如gpu和HPC集群)的可用性，机器学习和优化已成为分析大生物信息学数据的关键技术。在这次主题演讲中，我将介绍我们的机器学习和优化算法，以解决三个重要的数据密集型生物信息学问题:(1)从基因组学和元基因组学测序产生的大蛋白质序列数据中的进化信息预测蛋白质三级结构;(2)重建高分辨率三维基因组构象，整合组学数据;(3)利用密苏里大学哥伦比亚分校的高性能计算平台，利用转录组学和基因组学数据对基因调控网络进行建模。下面简要介绍这三个研究课题。基于大蛋白质序列数据的蛋白质结构建模。从序列预测蛋白质三级结构是生物信息学和蛋白质科学的主要挑战。经过长时间的停滞，该领域正在经历一场革命，通过应用深度学习来利用隐藏在基因组和元基因组测序工作产生的大量蛋白质序列数据中的氨基酸(残基)进化信息。在这次演讲中，我将描述我们用于预测残基-残基接触(例如相互作用)的深度卷积神经网络方法，以及从预测的接触中重建蛋白质三级结构的基于距离的方法，该方法与谷歌DeepMind的AlphaFold一起，在2018年第13届蛋白质结构预测技术关键评估中名列前茅[1]。重建用于组学数据分析的大基因组的高分辨率三维构象。基因组的三维构象(或结构)提供了1D基因组序列中无法提供的关键基因-基因和基因-增强子相互作用。与基因组测序不同，没有实验技术可以直接确定基因组的三维结构。在这次演讲中，我将展示我们的高性能，大规模，数据驱动的优化算法，用于从深层染色体构象捕获(即Hi-C)数据重建高分辨率3D基因组结构[2]。该算法具有很高的可扩展性和高效性，可用于5KB分辨率的人类基因组等大型基因组的三维结构重建。高分辨率的三维基因组模型可用于研究基因功能、基因表达、基因组甲基化，并整合多种来源的组学数据。基于转录组学和基因组学数据的基因调控网络建模。从大规模基因表达数据推断基因调控关系是生物信息学中一个重要但尚未解决的问题。基因调控网络为复杂的基因调控关系提供了一个简明而信息丰富的表述。在这次演讲中，我将介绍我们的概率图形模型方法，用于从转录组学和基因组学数据可靠地重建基因调控网络[3]。通过基因功能分析和转录结合数据分析验证了推测的基因调控关系。总之，在这次主题演讲中，我将展示大规模机器学习和优化算法在分析和集成多源组学数据以解决重要生物信息学问题方面发挥关键作用，设计适合问题的机器学习和优化方法，并利用大型数据集和高性能计算基础设施对于他们在该领域的成功至关重要。

{"title":"Large-Scale Machine Learning and Optimization for Bioinformatics Data Analysis","authors":"Jianlin Cheng","doi":"10.1145/3388440.3415587","DOIUrl":"https://doi.org/10.1145/3388440.3415587","url":null,"abstract":"Empowered by the availability of high-performance computing (HPC) infrastructure (e.g. GPUs and HPC clusters), machine learning and optimization have become key technologies to analyze big bioinformatics data. In this keynote talk, I will present our machine learning and optimization algorithms for addressing three important data-intensive bioinformatics problems: (1) predicting protein tertiary structures from the evolutionary information in big protein sequence data generated by genomics and meta-genomics sequencing; (2) reconstructing high-resolution 3D genome conformations for integrating omics data; and (3) modeling gene regulatory networks from transcriptomics and genomics data, leveraging the high-performance computing platform available at the University of Missouri -- Columbia. The three research topics are briefly described below. Protein structure modeling on big protein sequence data. Predicting protein tertiary structure from sequence is a major challenge in bioinformatics and protein science. After a long period stagnancy, the field is experiencing a revolution driven by applying deep learning to leverage the amino acid (residue) evolutionary information hidden in the large amount of protein sequence data generated by the genome and meta-genome sequencing effort. In this talk, I will describe our deep convolutional neural network methods for predicting residue-residue contacts (e.g. interactions) and the distance-based method of reconstructing protein tertiary structures from predicted contacts that was ranked among the top methods in the 13th Critical Assessment of Techniques for Protein Structure Prediction in 2018 [1], along with Google DeepMind's AlphaFold. Reconstructing high-resolution 3D Conformations of large genomes for omics data analysis. 3D conformations (or structures) of genomes provide critical gene-gene and gene-enhancer interactions not available in 1D genome sequences. Unlike genome sequencing, there is no experimental technique to directly determine the 3D structure of genome. In this talk, I will present our high-performance, large-scale, data-driven optimization algorithm for reconstructing high-resolution 3D genome structures from deep chromosome conformation capturing (i.e. Hi-C) data [2]. The algorithm is highly scalable and efficient to reconstruct the 3D structures of large genomes such as the human genome at 5KB resolution. The high-resolution 3D genome models can be used to study gene function, gene expression, genome methylation and integrate multiple sources of omics data. Gene regulatory network modeling on transcriptomics and genomics data. Inferring gene regulatory relationships from large-scale gene expression data is an important, yet unsolved problem in bioinformatics. Gene regulatory networks provide a concise and informative representation of complex gene regulatory relationships. In this talk, I will present our probabilistic graphical model method for reliably reconstructing gene regulatory ne","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126918998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Novel Generated Peptides for COVID-19 Targets 新生成的针对COVID-19靶点的肽

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414919

Allison M. Rossetto, Wenjin Zhou

With the world in the midst of a global pandemic, it is important to be able to quickly generate new drug-like compounds for drug research purposes. While some successful work has been done [3, 6] there is still much work to be done, especially as viruses like Coronavirus are notoriously hard to treat. Since peptide drugs are generally better at blocking protein-protein interactions than small molecule drugs [5], something important in anti-viral work, we will use our GANDALF methodology to generate new peptides to interact with targets of interest. Here we are working with two important COVID-19 targets: the SARS-CoV-2 main protease (M[Pro]) and the andangiotensin-converting enzyme 2 (ACE2). Covid-19 is able to enter human cells via interaction between its spike protein and ACE2 and, once in the cell, MPro breaks down polyproteins to create more of the virus [1]. We have generated peptides for each of our targets using our GANDALF (Generative Adversarial Network Drug-tArget Ligand Fructifier) methodology [4]. We compare our generated peptides with a previously discovered novel ACE2 inhibitor [2]. We also compare our results for MPro with a recently publish small molecule based on α-ketoamide inhibitors recently developed as a drug lead [7]. Our best generated peptide for ACE2 is a small, six residue peptide [SSNATV]. This peptide has a binding affinity of --29.880. The novel, peptide inhibitor previously designed has a binding affinity of -19.843. Our generated peptide has a lower binding affinity, which is generally more desirable and indicates more stable binding. However, the novel inhibitor is larger at 26 peptides and may be more suitable for use without the need for too many additional modifications. Our peptide though is a good starting place for further improvements and optimization. The binding affinity for our best generated peptide of MPro is --41.038. This peptide has a size of eleven residues [WWTWTPFHLLV]. Our peptide has a similar binding affinity to that of the small molecule, α-ketoamide based inhibitor is --5.501. Not only does our peptide have a better binding affinity, but as a peptide, it has the added advantage of being better able to disrupt the activity of the MPro than the small molecule inhibitor. It is also encouraging that our binding affinity for our best MPro generated peptide is comparable to the best available compounds. Peptide based drugs are an important part of viral treatment. Our work here provides reasonable starting peptides for further drug research and development.

随着世界处于全球大流行之中，能够快速产生用于药物研究目的的新的类药物化合物非常重要。虽然已经做了一些成功的工作[3,6]，但仍有很多工作要做，特别是像冠状病毒这样的病毒是出了名的难以治疗。由于肽药物通常比小分子药物更擅长阻断蛋白质-蛋白质相互作用[5]，这在抗病毒工作中很重要，因此我们将使用我们的GANDALF方法生成新的肽来与感兴趣的靶点相互作用。在这里，我们正在研究两个重要的COVID-19靶点:SARS-CoV-2主要蛋白酶(M[Pro])和血管紧张素转换酶2 (ACE2)。Covid-19能够通过其刺突蛋白与ACE2之间的相互作用进入人体细胞，一旦进入细胞，MPro就会分解多蛋白，产生更多的病毒[1]。我们使用我们的甘道夫(生成对抗网络药物靶标配体合成物)方法为每个靶标生成了多肽[4]。我们将我们生成的肽与先前发现的新型ACE2抑制剂进行了比较[2]。我们还将MPro的研究结果与最近发表的一种基于α-酮酰胺抑制剂的小分子药物进行了比较[7]。我们为ACE2生成的最佳肽是一个小的，有六个残基的肽[SSNATV]。该肽的结合亲和力为-29.880。先前设计的新型肽抑制剂的结合亲和力为-19.843。我们生成的肽具有较低的结合亲和力，这通常是更理想的，并且表明更稳定的结合。然而，这种新型抑制剂在26个肽上更大，可能更适合使用，而不需要太多额外的修饰。我们的肽虽然是一个很好的起点，为进一步的改进和优化。我们合成的最佳MPro肽的结合亲和力为-41.038。该肽的大小为11个残基[WWTWTPFHLLV]。我们的肽具有与小分子相似的结合亲和力，α-酮酰胺基抑制剂为-5.501。我们的肽不仅具有更好的结合亲和力，而且作为肽，它具有比小分子抑制剂更好地破坏MPro活性的额外优势。同样令人鼓舞的是，我们对最好的MPro生成的肽的结合亲和力与最好的可用化合物相当。肽类药物是病毒治疗的重要组成部分。我们的工作为进一步的药物研究和开发提供了合理的起始肽。

{"title":"Novel Generated Peptides for COVID-19 Targets","authors":"Allison M. Rossetto, Wenjin Zhou","doi":"10.1145/3388440.3414919","DOIUrl":"https://doi.org/10.1145/3388440.3414919","url":null,"abstract":"With the world in the midst of a global pandemic, it is important to be able to quickly generate new drug-like compounds for drug research purposes. While some successful work has been done [3, 6] there is still much work to be done, especially as viruses like Coronavirus are notoriously hard to treat. Since peptide drugs are generally better at blocking protein-protein interactions than small molecule drugs [5], something important in anti-viral work, we will use our GANDALF methodology to generate new peptides to interact with targets of interest. Here we are working with two important COVID-19 targets: the SARS-CoV-2 main protease (M[Pro]) and the andangiotensin-converting enzyme 2 (ACE2). Covid-19 is able to enter human cells via interaction between its spike protein and ACE2 and, once in the cell, MPro breaks down polyproteins to create more of the virus [1]. We have generated peptides for each of our targets using our GANDALF (Generative Adversarial Network Drug-tArget Ligand Fructifier) methodology [4]. We compare our generated peptides with a previously discovered novel ACE2 inhibitor [2]. We also compare our results for MPro with a recently publish small molecule based on α-ketoamide inhibitors recently developed as a drug lead [7]. Our best generated peptide for ACE2 is a small, six residue peptide [SSNATV]. This peptide has a binding affinity of --29.880. The novel, peptide inhibitor previously designed has a binding affinity of -19.843. Our generated peptide has a lower binding affinity, which is generally more desirable and indicates more stable binding. However, the novel inhibitor is larger at 26 peptides and may be more suitable for use without the need for too many additional modifications. Our peptide though is a good starting place for further improvements and optimization. The binding affinity for our best generated peptide of MPro is --41.038. This peptide has a size of eleven residues [WWTWTPFHLLV]. Our peptide has a similar binding affinity to that of the small molecule, α-ketoamide based inhibitor is --5.501. Not only does our peptide have a better binding affinity, but as a peptide, it has the added advantage of being better able to disrupt the activity of the MPro than the small molecule inhibitor. It is also encouraging that our binding affinity for our best MPro generated peptide is comparable to the best available compounds. Peptide based drugs are an important part of viral treatment. Our work here provides reasonable starting peptides for further drug research and development.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126305789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

ELMV

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412431

L. J. Liu, Hongwei Zhang, Jianzhong Di, Jin Chen

Many real-world Electronic Health Record (EHR) data contain a large proportion of missing values. Leaving a substantial portion of missing information unaddressed usually causes significant bias, leading to invalid conclusions to be drawn. On the other hand, training a machine learning model with a much smaller nearly-complete subset can drastically impact the reliability and accuracy of model inference. Data imputation algorithms that attempt to replace missing data with meaningful values, inevitably increase the variability of effect estimates with increased missingness, making it unreliable for hypothesis validation. We propose a novel Ensemble-Learning for Missing Value (ELMV) framework, an effective approach to construct multiple subsets with much lower missing rates of the original EHR data as well as to mobilize dedicated support data for ensemble learning, for the purpose of reducing the bias caused by substantial missing values. ELMV has been evaluated on real-world healthcare data for critical feature identification and simulation data with different missing rates for outcome prediction. In both experiments, ELMV outperforms conventional missing value imputation methods and traditional ensemble learning models. The source code of ELMV is available at https://github.com/lucasliu0928/ELMV.

{"title":"ELMV","authors":"L. J. Liu, Hongwei Zhang, Jianzhong Di, Jin Chen","doi":"10.1145/3388440.3412431","DOIUrl":"https://doi.org/10.1145/3388440.3412431","url":null,"abstract":"Many real-world Electronic Health Record (EHR) data contain a large proportion of missing values. Leaving a substantial portion of missing information unaddressed usually causes significant bias, leading to invalid conclusions to be drawn. On the other hand, training a machine learning model with a much smaller nearly-complete subset can drastically impact the reliability and accuracy of model inference. Data imputation algorithms that attempt to replace missing data with meaningful values, inevitably increase the variability of effect estimates with increased missingness, making it unreliable for hypothesis validation. We propose a novel Ensemble-Learning for Missing Value (ELMV) framework, an effective approach to construct multiple subsets with much lower missing rates of the original EHR data as well as to mobilize dedicated support data for ensemble learning, for the purpose of reducing the bias caused by substantial missing values. ELMV has been evaluated on real-world healthcare data for critical feature identification and simulation data with different missing rates for outcome prediction. In both experiments, ELMV outperforms conventional missing value imputation methods and traditional ensemble learning models. The source code of ELMV is available at https://github.com/lucasliu0928/ELMV.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124922879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Beyond B-Cell Epitopes: Curating Positive Data on Antipeptide Paratope Binding to Support Development of Computational Tools for Vaccine Design and Other Translational Applications 超越b细胞表位:整理抗肽副表位结合的阳性数据，以支持疫苗设计和其他转化应用的计算工具的发展

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414923

S. Caoili

B-cell epitope prediction was first developed to help design peptide-based vaccines for protective antibody-mediated immunity exemplified by neutralization of biological activity (e.g., pathogen infectivity). Requisite computational tools are benchmarked using experimentally obtained paratope-epitope binding data, which also serve as training data for machine-learning approaches to development of said tools. Such data are curated in the Immune Epitope Database (IEDB). However, IEDB curation guidelines define B-cell epitopes primarily on the basis of paratope-bound epitope structures, obscuring the crucial role of conformational disorder in the underlying immune recognition process. For the present work, pertinent IEDB B-cell assay records were retrieved and analyzed in relation to other data from both IEDB and external sources including the Protein Data Bank (PDB) and published literature, with special attention to data on conformational disorder among B-cell epitopes. This revealed examples of antipeptide antibodies that recognize conformationally disordered B-cell epitopes and thereby neutralize the biological activity of cognate targets (e.g., proteins and pathogens), with inconsistency noted in the definition of some epitopes. These results suggest an alternative approach to curating paratope-epitope binding data based on neutralization of biological activity by polyclonal antipeptide antibodies, with reference to immunogenic peptide sequences and their conformational disorder in the unbound state.

b细胞表位预测最初是为了帮助设计基于肽的保护性抗体介导免疫疫苗而开发的，例如，中和生物活性(例如，病原体传染性)。必要的计算工具使用实验获得的旁位-表位结合数据进行基准测试，这些数据也作为开发上述工具的机器学习方法的训练数据。这些数据被收录在免疫表位数据库(IEDB)中。然而，IEDB管理指南主要是根据副表位结合表位结构来定义b细胞表位，模糊了构象紊乱在潜在免疫识别过程中的关键作用。在目前的工作中，检索并分析了相关的IEDB b细胞检测记录，并将其与IEDB和外部来源(包括蛋白质数据库(PDB)和已发表的文献)的其他数据相关联，特别关注b细胞表位之间的构象紊乱数据。这揭示了识别构象紊乱的b细胞表位的抗肽抗体的例子，从而中和同源靶标(例如，蛋白质和病原体)的生物活性，在一些表位的定义中注意到不一致。这些结果提示了一种基于多克隆抗肽抗体中和生物活性来管理副表位结合数据的替代方法，参考免疫原性肽序列及其未结合状态下的构象紊乱。

{"title":"Beyond B-Cell Epitopes: Curating Positive Data on Antipeptide Paratope Binding to Support Development of Computational Tools for Vaccine Design and Other Translational Applications","authors":"S. Caoili","doi":"10.1145/3388440.3414923","DOIUrl":"https://doi.org/10.1145/3388440.3414923","url":null,"abstract":"B-cell epitope prediction was first developed to help design peptide-based vaccines for protective antibody-mediated immunity exemplified by neutralization of biological activity (e.g., pathogen infectivity). Requisite computational tools are benchmarked using experimentally obtained paratope-epitope binding data, which also serve as training data for machine-learning approaches to development of said tools. Such data are curated in the Immune Epitope Database (IEDB). However, IEDB curation guidelines define B-cell epitopes primarily on the basis of paratope-bound epitope structures, obscuring the crucial role of conformational disorder in the underlying immune recognition process. For the present work, pertinent IEDB B-cell assay records were retrieved and analyzed in relation to other data from both IEDB and external sources including the Protein Data Bank (PDB) and published literature, with special attention to data on conformational disorder among B-cell epitopes. This revealed examples of antipeptide antibodies that recognize conformationally disordered B-cell epitopes and thereby neutralize the biological activity of cognate targets (e.g., proteins and pathogens), with inconsistency noted in the definition of some epitopes. These results suggest an alternative approach to curating paratope-epitope binding data based on neutralization of biological activity by polyclonal antipeptide antibodies, with reference to immunogenic peptide sequences and their conformational disorder in the unbound state.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125205720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0