{"title":"生物信息学数据分析的大规模机器学习和优化","authors":"Jianlin Cheng","doi":"10.1145/3388440.3415587","DOIUrl":null,"url":null,"abstract":"Empowered by the availability of high-performance computing (HPC) infrastructure (e.g. GPUs and HPC clusters), machine learning and optimization have become key technologies to analyze big bioinformatics data. In this keynote talk, I will present our machine learning and optimization algorithms for addressing three important data-intensive bioinformatics problems: (1) predicting protein tertiary structures from the evolutionary information in big protein sequence data generated by genomics and meta-genomics sequencing; (2) reconstructing high-resolution 3D genome conformations for integrating omics data; and (3) modeling gene regulatory networks from transcriptomics and genomics data, leveraging the high-performance computing platform available at the University of Missouri -- Columbia. The three research topics are briefly described below. Protein structure modeling on big protein sequence data. Predicting protein tertiary structure from sequence is a major challenge in bioinformatics and protein science. After a long period stagnancy, the field is experiencing a revolution driven by applying deep learning to leverage the amino acid (residue) evolutionary information hidden in the large amount of protein sequence data generated by the genome and meta-genome sequencing effort. In this talk, I will describe our deep convolutional neural network methods for predicting residue-residue contacts (e.g. interactions) and the distance-based method of reconstructing protein tertiary structures from predicted contacts that was ranked among the top methods in the 13th Critical Assessment of Techniques for Protein Structure Prediction in 2018 [1], along with Google DeepMind's AlphaFold. Reconstructing high-resolution 3D Conformations of large genomes for omics data analysis. 3D conformations (or structures) of genomes provide critical gene-gene and gene-enhancer interactions not available in 1D genome sequences. Unlike genome sequencing, there is no experimental technique to directly determine the 3D structure of genome. In this talk, I will present our high-performance, large-scale, data-driven optimization algorithm for reconstructing high-resolution 3D genome structures from deep chromosome conformation capturing (i.e. Hi-C) data [2]. The algorithm is highly scalable and efficient to reconstruct the 3D structures of large genomes such as the human genome at 5KB resolution. The high-resolution 3D genome models can be used to study gene function, gene expression, genome methylation and integrate multiple sources of omics data. Gene regulatory network modeling on transcriptomics and genomics data. Inferring gene regulatory relationships from large-scale gene expression data is an important, yet unsolved problem in bioinformatics. Gene regulatory networks provide a concise and informative representation of complex gene regulatory relationships. In this talk, I will present our probabilistic graphical model method for reliably reconstructing gene regulatory networks from transcriptomics and genomics data [3]. The inferred gene regulatory relationships are validated by gene function analysis and transcription binding data analysis. In conclusion, in this keynote talk, I will demonstrate that large-scale machine learning and optimization algorithms play a key role in analyzing and integrating multiple sources of omics data to solve important bioinformatics problems and designing machine learning and optimization methods that fit the problem well and leverage large datasets and high-performance computing infrastructure is critical for their success in the field.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Large-Scale Machine Learning and Optimization for Bioinformatics Data Analysis\",\"authors\":\"Jianlin Cheng\",\"doi\":\"10.1145/3388440.3415587\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Empowered by the availability of high-performance computing (HPC) infrastructure (e.g. GPUs and HPC clusters), machine learning and optimization have become key technologies to analyze big bioinformatics data. In this keynote talk, I will present our machine learning and optimization algorithms for addressing three important data-intensive bioinformatics problems: (1) predicting protein tertiary structures from the evolutionary information in big protein sequence data generated by genomics and meta-genomics sequencing; (2) reconstructing high-resolution 3D genome conformations for integrating omics data; and (3) modeling gene regulatory networks from transcriptomics and genomics data, leveraging the high-performance computing platform available at the University of Missouri -- Columbia. The three research topics are briefly described below. Protein structure modeling on big protein sequence data. Predicting protein tertiary structure from sequence is a major challenge in bioinformatics and protein science. After a long period stagnancy, the field is experiencing a revolution driven by applying deep learning to leverage the amino acid (residue) evolutionary information hidden in the large amount of protein sequence data generated by the genome and meta-genome sequencing effort. In this talk, I will describe our deep convolutional neural network methods for predicting residue-residue contacts (e.g. interactions) and the distance-based method of reconstructing protein tertiary structures from predicted contacts that was ranked among the top methods in the 13th Critical Assessment of Techniques for Protein Structure Prediction in 2018 [1], along with Google DeepMind's AlphaFold. Reconstructing high-resolution 3D Conformations of large genomes for omics data analysis. 3D conformations (or structures) of genomes provide critical gene-gene and gene-enhancer interactions not available in 1D genome sequences. Unlike genome sequencing, there is no experimental technique to directly determine the 3D structure of genome. In this talk, I will present our high-performance, large-scale, data-driven optimization algorithm for reconstructing high-resolution 3D genome structures from deep chromosome conformation capturing (i.e. Hi-C) data [2]. The algorithm is highly scalable and efficient to reconstruct the 3D structures of large genomes such as the human genome at 5KB resolution. The high-resolution 3D genome models can be used to study gene function, gene expression, genome methylation and integrate multiple sources of omics data. Gene regulatory network modeling on transcriptomics and genomics data. Inferring gene regulatory relationships from large-scale gene expression data is an important, yet unsolved problem in bioinformatics. Gene regulatory networks provide a concise and informative representation of complex gene regulatory relationships. In this talk, I will present our probabilistic graphical model method for reliably reconstructing gene regulatory networks from transcriptomics and genomics data [3]. The inferred gene regulatory relationships are validated by gene function analysis and transcription binding data analysis. In conclusion, in this keynote talk, I will demonstrate that large-scale machine learning and optimization algorithms play a key role in analyzing and integrating multiple sources of omics data to solve important bioinformatics problems and designing machine learning and optimization methods that fit the problem well and leverage large datasets and high-performance computing infrastructure is critical for their success in the field.\",\"PeriodicalId\":411338,\"journal\":{\"name\":\"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3388440.3415587\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3388440.3415587","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Large-Scale Machine Learning and Optimization for Bioinformatics Data Analysis
Empowered by the availability of high-performance computing (HPC) infrastructure (e.g. GPUs and HPC clusters), machine learning and optimization have become key technologies to analyze big bioinformatics data. In this keynote talk, I will present our machine learning and optimization algorithms for addressing three important data-intensive bioinformatics problems: (1) predicting protein tertiary structures from the evolutionary information in big protein sequence data generated by genomics and meta-genomics sequencing; (2) reconstructing high-resolution 3D genome conformations for integrating omics data; and (3) modeling gene regulatory networks from transcriptomics and genomics data, leveraging the high-performance computing platform available at the University of Missouri -- Columbia. The three research topics are briefly described below. Protein structure modeling on big protein sequence data. Predicting protein tertiary structure from sequence is a major challenge in bioinformatics and protein science. After a long period stagnancy, the field is experiencing a revolution driven by applying deep learning to leverage the amino acid (residue) evolutionary information hidden in the large amount of protein sequence data generated by the genome and meta-genome sequencing effort. In this talk, I will describe our deep convolutional neural network methods for predicting residue-residue contacts (e.g. interactions) and the distance-based method of reconstructing protein tertiary structures from predicted contacts that was ranked among the top methods in the 13th Critical Assessment of Techniques for Protein Structure Prediction in 2018 [1], along with Google DeepMind's AlphaFold. Reconstructing high-resolution 3D Conformations of large genomes for omics data analysis. 3D conformations (or structures) of genomes provide critical gene-gene and gene-enhancer interactions not available in 1D genome sequences. Unlike genome sequencing, there is no experimental technique to directly determine the 3D structure of genome. In this talk, I will present our high-performance, large-scale, data-driven optimization algorithm for reconstructing high-resolution 3D genome structures from deep chromosome conformation capturing (i.e. Hi-C) data [2]. The algorithm is highly scalable and efficient to reconstruct the 3D structures of large genomes such as the human genome at 5KB resolution. The high-resolution 3D genome models can be used to study gene function, gene expression, genome methylation and integrate multiple sources of omics data. Gene regulatory network modeling on transcriptomics and genomics data. Inferring gene regulatory relationships from large-scale gene expression data is an important, yet unsolved problem in bioinformatics. Gene regulatory networks provide a concise and informative representation of complex gene regulatory relationships. In this talk, I will present our probabilistic graphical model method for reliably reconstructing gene regulatory networks from transcriptomics and genomics data [3]. The inferred gene regulatory relationships are validated by gene function analysis and transcription binding data analysis. In conclusion, in this keynote talk, I will demonstrate that large-scale machine learning and optimization algorithms play a key role in analyzing and integrating multiple sources of omics data to solve important bioinformatics problems and designing machine learning and optimization methods that fit the problem well and leverage large datasets and high-performance computing infrastructure is critical for their success in the field.