生物信息学数据分析的大规模机器学习和优化

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics Pub Date : 2020-09-21 DOI:10.1145/3388440.3415587

Jianlin Cheng

{"title":"生物信息学数据分析的大规模机器学习和优化","authors":"Jianlin Cheng","doi":"10.1145/3388440.3415587","DOIUrl":null,"url":null,"abstract":"Empowered by the availability of high-performance computing (HPC) infrastructure (e.g. GPUs and HPC clusters), machine learning and optimization have become key technologies to analyze big bioinformatics data. In this keynote talk, I will present our machine learning and optimization algorithms for addressing three important data-intensive bioinformatics problems: (1) predicting protein tertiary structures from the evolutionary information in big protein sequence data generated by genomics and meta-genomics sequencing; (2) reconstructing high-resolution 3D genome conformations for integrating omics data; and (3) modeling gene regulatory networks from transcriptomics and genomics data, leveraging the high-performance computing platform available at the University of Missouri -- Columbia. The three research topics are briefly described below. Protein structure modeling on big protein sequence data. Predicting protein tertiary structure from sequence is a major challenge in bioinformatics and protein science. After a long period stagnancy, the field is experiencing a revolution driven by applying deep learning to leverage the amino acid (residue) evolutionary information hidden in the large amount of protein sequence data generated by the genome and meta-genome sequencing effort. In this talk, I will describe our deep convolutional neural network methods for predicting residue-residue contacts (e.g. interactions) and the distance-based method of reconstructing protein tertiary structures from predicted contacts that was ranked among the top methods in the 13th Critical Assessment of Techniques for Protein Structure Prediction in 2018 [1], along with Google DeepMind's AlphaFold. Reconstructing high-resolution 3D Conformations of large genomes for omics data analysis. 3D conformations (or structures) of genomes provide critical gene-gene and gene-enhancer interactions not available in 1D genome sequences. Unlike genome sequencing, there is no experimental technique to directly determine the 3D structure of genome. In this talk, I will present our high-performance, large-scale, data-driven optimization algorithm for reconstructing high-resolution 3D genome structures from deep chromosome conformation capturing (i.e. Hi-C) data [2]. The algorithm is highly scalable and efficient to reconstruct the 3D structures of large genomes such as the human genome at 5KB resolution. The high-resolution 3D genome models can be used to study gene function, gene expression, genome methylation and integrate multiple sources of omics data. Gene regulatory network modeling on transcriptomics and genomics data. Inferring gene regulatory relationships from large-scale gene expression data is an important, yet unsolved problem in bioinformatics. Gene regulatory networks provide a concise and informative representation of complex gene regulatory relationships. In this talk, I will present our probabilistic graphical model method for reliably reconstructing gene regulatory networks from transcriptomics and genomics data [3]. The inferred gene regulatory relationships are validated by gene function analysis and transcription binding data analysis. In conclusion, in this keynote talk, I will demonstrate that large-scale machine learning and optimization algorithms play a key role in analyzing and integrating multiple sources of omics data to solve important bioinformatics problems and designing machine learning and optimization methods that fit the problem well and leverage large datasets and high-performance computing infrastructure is critical for their success in the field.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Large-Scale Machine Learning and Optimization for Bioinformatics Data Analysis\",\"authors\":\"Jianlin Cheng\",\"doi\":\"10.1145/3388440.3415587\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Empowered by the availability of high-performance computing (HPC) infrastructure (e.g. GPUs and HPC clusters), machine learning and optimization have become key technologies to analyze big bioinformatics data. In this keynote talk, I will present our machine learning and optimization algorithms for addressing three important data-intensive bioinformatics problems: (1) predicting protein tertiary structures from the evolutionary information in big protein sequence data generated by genomics and meta-genomics sequencing; (2) reconstructing high-resolution 3D genome conformations for integrating omics data; and (3) modeling gene regulatory networks from transcriptomics and genomics data, leveraging the high-performance computing platform available at the University of Missouri -- Columbia. The three research topics are briefly described below. Protein structure modeling on big protein sequence data. Predicting protein tertiary structure from sequence is a major challenge in bioinformatics and protein science. After a long period stagnancy, the field is experiencing a revolution driven by applying deep learning to leverage the amino acid (residue) evolutionary information hidden in the large amount of protein sequence data generated by the genome and meta-genome sequencing effort. In this talk, I will describe our deep convolutional neural network methods for predicting residue-residue contacts (e.g. interactions) and the distance-based method of reconstructing protein tertiary structures from predicted contacts that was ranked among the top methods in the 13th Critical Assessment of Techniques for Protein Structure Prediction in 2018 [1], along with Google DeepMind's AlphaFold. Reconstructing high-resolution 3D Conformations of large genomes for omics data analysis. 3D conformations (or structures) of genomes provide critical gene-gene and gene-enhancer interactions not available in 1D genome sequences. Unlike genome sequencing, there is no experimental technique to directly determine the 3D structure of genome. In this talk, I will present our high-performance, large-scale, data-driven optimization algorithm for reconstructing high-resolution 3D genome structures from deep chromosome conformation capturing (i.e. Hi-C) data [2]. The algorithm is highly scalable and efficient to reconstruct the 3D structures of large genomes such as the human genome at 5KB resolution. The high-resolution 3D genome models can be used to study gene function, gene expression, genome methylation and integrate multiple sources of omics data. Gene regulatory network modeling on transcriptomics and genomics data. Inferring gene regulatory relationships from large-scale gene expression data is an important, yet unsolved problem in bioinformatics. Gene regulatory networks provide a concise and informative representation of complex gene regulatory relationships. In this talk, I will present our probabilistic graphical model method for reliably reconstructing gene regulatory networks from transcriptomics and genomics data [3]. The inferred gene regulatory relationships are validated by gene function analysis and transcription binding data analysis. In conclusion, in this keynote talk, I will demonstrate that large-scale machine learning and optimization algorithms play a key role in analyzing and integrating multiple sources of omics data to solve important bioinformatics problems and designing machine learning and optimization methods that fit the problem well and leverage large datasets and high-performance computing infrastructure is critical for their success in the field.\",\"PeriodicalId\":411338,\"journal\":{\"name\":\"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3388440.3415587\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3388440.3415587","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

由于高性能计算(HPC)基础设施(如gpu和HPC集群)的可用性，机器学习和优化已成为分析大生物信息学数据的关键技术。在这次主题演讲中，我将介绍我们的机器学习和优化算法，以解决三个重要的数据密集型生物信息学问题:(1)从基因组学和元基因组学测序产生的大蛋白质序列数据中的进化信息预测蛋白质三级结构;(2)重建高分辨率三维基因组构象，整合组学数据;(3)利用密苏里大学哥伦比亚分校的高性能计算平台，利用转录组学和基因组学数据对基因调控网络进行建模。下面简要介绍这三个研究课题。基于大蛋白质序列数据的蛋白质结构建模。从序列预测蛋白质三级结构是生物信息学和蛋白质科学的主要挑战。经过长时间的停滞，该领域正在经历一场革命，通过应用深度学习来利用隐藏在基因组和元基因组测序工作产生的大量蛋白质序列数据中的氨基酸(残基)进化信息。在这次演讲中，我将描述我们用于预测残基-残基接触(例如相互作用)的深度卷积神经网络方法，以及从预测的接触中重建蛋白质三级结构的基于距离的方法，该方法与谷歌DeepMind的AlphaFold一起，在2018年第13届蛋白质结构预测技术关键评估中名列前茅[1]。重建用于组学数据分析的大基因组的高分辨率三维构象。基因组的三维构象(或结构)提供了1D基因组序列中无法提供的关键基因-基因和基因-增强子相互作用。与基因组测序不同，没有实验技术可以直接确定基因组的三维结构。在这次演讲中，我将展示我们的高性能，大规模，数据驱动的优化算法，用于从深层染色体构象捕获(即Hi-C)数据重建高分辨率3D基因组结构[2]。该算法具有很高的可扩展性和高效性，可用于5KB分辨率的人类基因组等大型基因组的三维结构重建。高分辨率的三维基因组模型可用于研究基因功能、基因表达、基因组甲基化，并整合多种来源的组学数据。基于转录组学和基因组学数据的基因调控网络建模。从大规模基因表达数据推断基因调控关系是生物信息学中一个重要但尚未解决的问题。基因调控网络为复杂的基因调控关系提供了一个简明而信息丰富的表述。在这次演讲中，我将介绍我们的概率图形模型方法，用于从转录组学和基因组学数据可靠地重建基因调控网络[3]。通过基因功能分析和转录结合数据分析验证了推测的基因调控关系。总之，在这次主题演讲中，我将展示大规模机器学习和优化算法在分析和集成多源组学数据以解决重要生物信息学问题方面发挥关键作用，设计适合问题的机器学习和优化方法，并利用大型数据集和高性能计算基础设施对于他们在该领域的成功至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Large-Scale Machine Learning and Optimization for Bioinformatics Data Analysis

Empowered by the availability of high-performance computing (HPC) infrastructure (e.g. GPUs and HPC clusters), machine learning and optimization have become key technologies to analyze big bioinformatics data. In this keynote talk, I will present our machine learning and optimization algorithms for addressing three important data-intensive bioinformatics problems: (1) predicting protein tertiary structures from the evolutionary information in big protein sequence data generated by genomics and meta-genomics sequencing; (2) reconstructing high-resolution 3D genome conformations for integrating omics data; and (3) modeling gene regulatory networks from transcriptomics and genomics data, leveraging the high-performance computing platform available at the University of Missouri -- Columbia. The three research topics are briefly described below. Protein structure modeling on big protein sequence data. Predicting protein tertiary structure from sequence is a major challenge in bioinformatics and protein science. After a long period stagnancy, the field is experiencing a revolution driven by applying deep learning to leverage the amino acid (residue) evolutionary information hidden in the large amount of protein sequence data generated by the genome and meta-genome sequencing effort. In this talk, I will describe our deep convolutional neural network methods for predicting residue-residue contacts (e.g. interactions) and the distance-based method of reconstructing protein tertiary structures from predicted contacts that was ranked among the top methods in the 13th Critical Assessment of Techniques for Protein Structure Prediction in 2018 [1], along with Google DeepMind's AlphaFold. Reconstructing high-resolution 3D Conformations of large genomes for omics data analysis. 3D conformations (or structures) of genomes provide critical gene-gene and gene-enhancer interactions not available in 1D genome sequences. Unlike genome sequencing, there is no experimental technique to directly determine the 3D structure of genome. In this talk, I will present our high-performance, large-scale, data-driven optimization algorithm for reconstructing high-resolution 3D genome structures from deep chromosome conformation capturing (i.e. Hi-C) data [2]. The algorithm is highly scalable and efficient to reconstruct the 3D structures of large genomes such as the human genome at 5KB resolution. The high-resolution 3D genome models can be used to study gene function, gene expression, genome methylation and integrate multiple sources of omics data. Gene regulatory network modeling on transcriptomics and genomics data. Inferring gene regulatory relationships from large-scale gene expression data is an important, yet unsolved problem in bioinformatics. Gene regulatory networks provide a concise and informative representation of complex gene regulatory relationships. In this talk, I will present our probabilistic graphical model method for reliably reconstructing gene regulatory networks from transcriptomics and genomics data [3]. The inferred gene regulatory relationships are validated by gene function analysis and transcription binding data analysis. In conclusion, in this keynote talk, I will demonstrate that large-scale machine learning and optimization algorithms play a key role in analyzing and integrating multiple sources of omics data to solve important bioinformatics problems and designing machine learning and optimization methods that fit the problem well and leverage large datasets and high-performance computing infrastructure is critical for their success in the field.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

自引率

0.00%

发文量

期刊最新文献

RA2Vec CanMod From Interatomic Distances to Protein Tertiary Structures with a Deep Convolutional Neural Network Prediction of Large for Gestational Age Infants in Overweight and Obese Women at Approximately 20 Gestational Weeks Using Patient Information for the Prediction of Caregiver Burden in Amyotrophic Lateral Sclerosis