Pub Date : 2024-12-01Epub Date: 2024-06-27DOI: 10.1002/qub2.67
Jinge Wang, Zien Cheng, Qiuming Yao, Li Liu, Dong Xu, Gangqing Hu
The year 2023 marked a significant surge in the exploration of applying large language model chatbots, notably Chat Generative Pre-trained Transformer (ChatGPT), across various disciplines. We surveyed the application of ChatGPT in bioinformatics and biomedical informatics throughout the year, covering omics, genetics, biomedical text mining, drug discovery, biomedical image understanding, bioinformatics programming, and bioinformatics education. Our survey delineates the current strengths and limitations of this chatbot in bioinformatics and offers insights into potential avenues for future developments.
{"title":"Bioinformatics and biomedical informatics with ChatGPT: Year one review.","authors":"Jinge Wang, Zien Cheng, Qiuming Yao, Li Liu, Dong Xu, Gangqing Hu","doi":"10.1002/qub2.67","DOIUrl":"10.1002/qub2.67","url":null,"abstract":"<p><p>The year 2023 marked a significant surge in the exploration of applying large language model chatbots, notably Chat Generative Pre-trained Transformer (ChatGPT), across various disciplines. We surveyed the application of ChatGPT in bioinformatics and biomedical informatics throughout the year, covering omics, genetics, biomedical text mining, drug discovery, biomedical image understanding, bioinformatics programming, and bioinformatics education. Our survey delineates the current strengths and limitations of this chatbot in bioinformatics and offers insights into potential avenues for future developments.</p>","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"12 4","pages":"345-359"},"PeriodicalIF":0.6,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11446534/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142373184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-01Epub Date: 2024-06-21DOI: 10.1002/qub2.57
Muhammad Azam, Yibo Chen, Micheal Olaolu Arowolo, Haowang Liu, Mihail Popescu, Dong Xu
Understanding complex biological pathways, including gene-gene interactions and gene regulatory networks, is critical for exploring disease mechanisms and drug development. Manual literature curation of biological pathways cannot keep up with the exponential growth of new discoveries in the literature. Large-scale language models (LLMs) trained on extensive text corpora contain rich biological information, and they can be mined as a biological knowledge graph. This study assesses 21 LLMs, including both application programming interface (API)-based models and open-source models in their capacities of retrieving biological knowledge. The evaluation focuses on predicting gene regulatory relations (activation, inhibition, and phosphorylation) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway components. Results indicated a significant disparity in model performance. API-based models GPT-4 and Claude-Pro showed superior performance, with an F1 score of 0.4448 and 0.4386 for the gene regulatory relation prediction, and a Jaccard similarity index of 0.2778 and 0.2657 for the KEGG pathway prediction, respectively. Open-source models lagged behind their API-based counterparts, whereas Falcon-180b and llama2-7b had the highest F1 scores of 0.2787 and 0.1923 in gene regulatory relations, respectively. The KEGG pathway recognition had a Jaccard similarity index of 0.2237 for Falcon-180b and 0.2207 for llama2-7b. Our study suggests that LLMs are informative in gene network analysis and pathway mapping, but their effectiveness varies, necessitating careful model selection. This work also provides a case study and insight into using LLMs das knowledge graphs. Our code is publicly available at the website of GitHub (Muh-aza).
了解复杂的生物通路,包括基因与基因之间的相互作用和基因调控网络,对于探索疾病机理和药物开发至关重要。生物通路的人工文献整理跟不上文献中新发现的指数级增长。在大量文本语料库中训练的大规模语言模型(LLM)包含丰富的生物信息,可以作为生物知识图谱进行挖掘。本研究评估了 21 种 LLM,包括基于应用编程接口(API)的模型和开源模型,以评估它们检索生物知识的能力。评估的重点是预测基因调控关系(激活、抑制和磷酸化)以及《京都基因组百科全书》(KEGG)通路成分。结果表明,模型性能存在明显差异。基于 API 的模型 GPT-4 和 Claude-Pro 表现优异,基因调控关系预测的 F1 分数分别为 0.4448 和 0.4386,KEGG 通路预测的 Jaccard 相似度指数分别为 0.2778 和 0.2657。开源模型落后于基于 API 的模型,而 Falcon-180b 和 llama2-7b 在基因调控关系方面的 F1 分数最高,分别为 0.2787 和 0.1923。在 KEGG 通路识别中,Falcon-180b 和 llama2-7b 的 Jaccard 相似度指数分别为 0.2237 和 0.2207。我们的研究表明,LLMs 在基因网络分析和通路图绘制中具有参考价值,但其有效性各不相同,因此需要谨慎选择模型。这项工作还为使用 LLMs das 知识图谱提供了案例研究和见解。我们的代码可在 GitHub 网站(Muh-aza)上公开获取。
{"title":"A comprehensive evaluation of large language models in mining gene relations and pathway knowledge.","authors":"Muhammad Azam, Yibo Chen, Micheal Olaolu Arowolo, Haowang Liu, Mihail Popescu, Dong Xu","doi":"10.1002/qub2.57","DOIUrl":"10.1002/qub2.57","url":null,"abstract":"<p><p>Understanding complex biological pathways, including gene-gene interactions and gene regulatory networks, is critical for exploring disease mechanisms and drug development. Manual literature curation of biological pathways cannot keep up with the exponential growth of new discoveries in the literature. Large-scale language models (LLMs) trained on extensive text corpora contain rich biological information, and they can be mined as a biological knowledge graph. This study assesses 21 LLMs, including both application programming interface (API)-based models and open-source models in their capacities of retrieving biological knowledge. The evaluation focuses on predicting gene regulatory relations (activation, inhibition, and phosphorylation) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway components. Results indicated a significant disparity in model performance. API-based models GPT-4 and Claude-Pro showed superior performance, with an F1 score of 0.4448 and 0.4386 for the gene regulatory relation prediction, and a Jaccard similarity index of 0.2778 and 0.2657 for the KEGG pathway prediction, respectively. Open-source models lagged behind their API-based counterparts, whereas Falcon-180b and llama2-7b had the highest F1 scores of 0.2787 and 0.1923 in gene regulatory relations, respectively. The KEGG pathway recognition had a Jaccard similarity index of 0.2237 for Falcon-180b and 0.2207 for llama2-7b. Our study suggests that LLMs are informative in gene network analysis and pathway mapping, but their effectiveness varies, necessitating careful model selection. This work also provides a case study and insight into using LLMs das knowledge graphs. Our code is publicly available at the website of GitHub (Muh-aza).</p>","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"12 4","pages":"360-374"},"PeriodicalIF":0.6,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11446478/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142373183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gene regulatory network (GRN) inference from gene expression data is a significant approach to understanding aspects of the biological system. Compared with generalized correlation‐based methods, causality‐inspired ones seem more rational to infer regulatory relationships. We propose GRINCD, a novel GRN inference framework empowered by graph representation learning and causal asymmetric learning, considering both linear and non‐linear regulatory relationships. First, high‐quality representation of each gene is generated using graph neural network. Then, we apply the additive noise model to predict the causal regulation of each regulator‐target pair. Additionally, we design two channels and finally assemble them for robust prediction. Through comprehensive comparisons of our framework with state‐of‐the‐art methods based on different principles on numerous datasets of diverse types and scales, the experimental results show that our framework achieves superior or comparable performance under various evaluation metrics. Our work provides a new clue for constructing GRNs, and our proposed framework GRINCD also shows potential in identifying key factors affecting cancer development.
{"title":"Gene regulatory network inference based on causal discovery integrating with graph neural network","authors":"Ke Feng, Hongyang Jiang, Chaoyi Yin, Huiyan Sun","doi":"10.1002/qub2.26","DOIUrl":"https://doi.org/10.1002/qub2.26","url":null,"abstract":"Gene regulatory network (GRN) inference from gene expression data is a significant approach to understanding aspects of the biological system. Compared with generalized correlation‐based methods, causality‐inspired ones seem more rational to infer regulatory relationships. We propose GRINCD, a novel GRN inference framework empowered by graph representation learning and causal asymmetric learning, considering both linear and non‐linear regulatory relationships. First, high‐quality representation of each gene is generated using graph neural network. Then, we apply the additive noise model to predict the causal regulation of each regulator‐target pair. Additionally, we design two channels and finally assemble them for robust prediction. Through comprehensive comparisons of our framework with state‐of‐the‐art methods based on different principles on numerous datasets of diverse types and scales, the experimental results show that our framework achieves superior or comparable performance under various evaluation metrics. Our work provides a new clue for constructing GRNs, and our proposed framework GRINCD also shows potential in identifying key factors affecting cancer development.","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"458 ","pages":""},"PeriodicalIF":3.1,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139022894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The information on host–microbe interactions contained in the operational taxonomic unit (OTU) abundance table can serve as a clue to understanding the biological traits of OTUs and samples. Some studies have inferred the taxonomies or functions of OTUs by constructing co‐occurrence networks, but co‐occurrence networks can only encompass a small fraction of all OTUs due to the high sparsity of the OTU table. There is a lack of studies that intensively explore and use the information on sample‐OTU interactions. This study constructed a sample‐OTU heterogeneous information network and represented the nodes in the network through the heterogeneous graph embedding method to form the OTU space and sample space. Taking advantage of the represented OTU and sample vectors combined with the original OTU abundance information, an Integrated Model of Embedded Taxonomies and Abundance (IMETA) was proposed for predicting sample attributes, such as phenotypes and individual diet habits. Both the OTU space and sample space contain reasonable biological or medical semantic information, and the IMETA using embedded OTU and sample vectors can have stable and good performance in the sample classification tasks. This suggests that the embedding representation based on the sample‐OTU heterogeneous information network can provide more useful information for understanding microbiome samples. This study conducted quantified representations of the biological characteristics within the OTUs and samples, which is a good attempt to increase the utilization rate of information in the OTU abundance table, and it promotes a deeper understanding of the underlying knowledge of human microbiome.
{"title":"Reorganizing heterogeneous information from host–microbe interaction reveals innate associations among samples","authors":"Hongfei Cui","doi":"10.1002/qub2.25","DOIUrl":"https://doi.org/10.1002/qub2.25","url":null,"abstract":"The information on host–microbe interactions contained in the operational taxonomic unit (OTU) abundance table can serve as a clue to understanding the biological traits of OTUs and samples. Some studies have inferred the taxonomies or functions of OTUs by constructing co‐occurrence networks, but co‐occurrence networks can only encompass a small fraction of all OTUs due to the high sparsity of the OTU table. There is a lack of studies that intensively explore and use the information on sample‐OTU interactions. This study constructed a sample‐OTU heterogeneous information network and represented the nodes in the network through the heterogeneous graph embedding method to form the OTU space and sample space. Taking advantage of the represented OTU and sample vectors combined with the original OTU abundance information, an Integrated Model of Embedded Taxonomies and Abundance (IMETA) was proposed for predicting sample attributes, such as phenotypes and individual diet habits. Both the OTU space and sample space contain reasonable biological or medical semantic information, and the IMETA using embedded OTU and sample vectors can have stable and good performance in the sample classification tasks. This suggests that the embedding representation based on the sample‐OTU heterogeneous information network can provide more useful information for understanding microbiome samples. This study conducted quantified representations of the biological characteristics within the OTUs and samples, which is a good attempt to increase the utilization rate of information in the OTU abundance table, and it promotes a deeper understanding of the underlying knowledge of human microbiome.","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"16 10","pages":""},"PeriodicalIF":3.1,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139214325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dali Wang, Jiaxuan Li, Lei Wang, Yipeng Cao, Bo Kang, Xiangfei Meng, Sai Li, Chen Song
The causative pathogen of coronavirus disease 2019 (COVID‐19), severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2), is an enveloped virus assembled by a lipid envelope and multiple structural proteins. In this study, by integrating experimental data, structural modeling, as well as coarse‐grained and all‐atom molecular dynamics simulations, we constructed multiscale models of SARS‐CoV‐2. Our 500‐ns coarse‐grained simulation of the intact virion allowed us to investigate the dynamic behavior of the membrane‐embedded proteins and the surrounding lipid molecules in situ. Our results indicated that the membrane‐embedded proteins are highly dynamic, and certain types of lipids exhibit various binding preferences to specific sites of the membrane‐embedded proteins. The equilibrated virion model was transformed into atomic resolution, which provided a 3D structure for scientific demonstration and can serve as a framework for future exascale all‐atom molecular dynamics (MD) simulations. A short all‐atom molecular dynamics simulation of 255 ps was conducted as a preliminary test for large‐scale simulations of this complex system.
{"title":"Toward atomistic models of intact severe acute respiratory syndrome coronavirus 2 via Martini coarse‐grained molecular dynamics simulations","authors":"Dali Wang, Jiaxuan Li, Lei Wang, Yipeng Cao, Bo Kang, Xiangfei Meng, Sai Li, Chen Song","doi":"10.1002/qub2.20","DOIUrl":"https://doi.org/10.1002/qub2.20","url":null,"abstract":"The causative pathogen of coronavirus disease 2019 (COVID‐19), severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2), is an enveloped virus assembled by a lipid envelope and multiple structural proteins. In this study, by integrating experimental data, structural modeling, as well as coarse‐grained and all‐atom molecular dynamics simulations, we constructed multiscale models of SARS‐CoV‐2. Our 500‐ns coarse‐grained simulation of the intact virion allowed us to investigate the dynamic behavior of the membrane‐embedded proteins and the surrounding lipid molecules in situ. Our results indicated that the membrane‐embedded proteins are highly dynamic, and certain types of lipids exhibit various binding preferences to specific sites of the membrane‐embedded proteins. The equilibrated virion model was transformed into atomic resolution, which provided a 3D structure for scientific demonstration and can serve as a framework for future exascale all‐atom molecular dynamics (MD) simulations. A short all‐atom molecular dynamics simulation of 255 ps was conducted as a preliminary test for large‐scale simulations of this complex system.","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"20 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2023-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139223234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Creating a man‐made life in the laboratory is one of life science’s most intriguing yet challenging problems. Advances in synthetic biology and related theories, particularly those related to the origin of life, have laid the groundwork for further exploration and understanding in this field of artificial life or man‐made life. But there remains a wealth of quantitative mathematical models and tools that have yet to be applied to this area. In this paper, we review the two main approaches often employed in the field of man‐made life: the top‐down approach that reduces the complexity of extant and existing living systems and the bottom‐up approach that integrates well‐defined components, by introducing the theoretical basis, recent advances, and their limitations. We then argue for another possible approach, namely “bottom‐up from the origin of life”: Starting with the establishment of autocatalytic chemical reaction networks that employ physical boundaries as the initial compartments, then designing directed evolutionary systems, with the expectation that independent compartments will eventually emerge so that the system becomes free‐living. This approach is actually analogous to the process of how life originated. With this paper, we aim to stimulate the interest of synthetic biologists and experimentalists to consider a more theoretical perspective, and to promote the communication between the origin of life community and the synthetic man‐made life community.
{"title":"Theoretical perspective on synthetic man‐made life: Learning from the origin of life","authors":"Lu Peng, Zecheng Zhang, Xianyi Wang, Weiyi Qiu, Liqian Zhou, Hui Xiao, Chunxiuzi Liu, Shaohua Tang, Zhiwei Qin, Jiakun Jiang, Zengru Di, Yu Liu","doi":"10.1002/qub2.22","DOIUrl":"https://doi.org/10.1002/qub2.22","url":null,"abstract":"Creating a man‐made life in the laboratory is one of life science’s most intriguing yet challenging problems. Advances in synthetic biology and related theories, particularly those related to the origin of life, have laid the groundwork for further exploration and understanding in this field of artificial life or man‐made life. But there remains a wealth of quantitative mathematical models and tools that have yet to be applied to this area. In this paper, we review the two main approaches often employed in the field of man‐made life: the top‐down approach that reduces the complexity of extant and existing living systems and the bottom‐up approach that integrates well‐defined components, by introducing the theoretical basis, recent advances, and their limitations. We then argue for another possible approach, namely “bottom‐up from the origin of life”: Starting with the establishment of autocatalytic chemical reaction networks that employ physical boundaries as the initial compartments, then designing directed evolutionary systems, with the expectation that independent compartments will eventually emerge so that the system becomes free‐living. This approach is actually analogous to the process of how life originated. With this paper, we aim to stimulate the interest of synthetic biologists and experimentalists to consider a more theoretical perspective, and to promote the communication between the origin of life community and the synthetic man‐made life community.","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"30 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139231000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Electroactive microorganisms (EAMs) could utilize extracellular electron transfer (EET) pathways to exchange electrons and energy with their external surroundings. Conductive cytochrome proteins and nanowires play crucial roles in controlling electron transfer rate from cytosol to extracellular electrode. Many previous studies elucidated how the c‐type cytochrome proteins and conductive nanowires are synthesized, assembled, and engineered to manipulate the EET rate, and quantified the kinetic processes of electron generation and EET. Here, we firstly overview the electron transfer pathways of EAMs and quantify the kinetic parameters that dictating intracellular electron production and EET. Secondly, we systematically review the structure, conductivity mechanisms, and engineering strategies to manipulate conductive cytochromes and nanowire in EAMs. Lastly, we outlook potential directions for future research in cytochromes and conductive nanowires for enhanced electron transfer. This article reviews the quantitative kinetics of intracellular electron production and EET, and the contribution of engineered c‐type cytochromes and conductive nanowire in enhancing the EET rate, which lay the foundation for enhancing electron transfer capacity of EAMs.
{"title":"Conductive proteins‐based extracellular electron transfer of electroactive microorganisms","authors":"Junqi Zhang, Zixuan You, Dingyuan Liu, Rui Tang, Chao Zhao, Yingxiu Cao, Feng Li, Hao-Qing Song","doi":"10.1002/qub2.24","DOIUrl":"https://doi.org/10.1002/qub2.24","url":null,"abstract":"Electroactive microorganisms (EAMs) could utilize extracellular electron transfer (EET) pathways to exchange electrons and energy with their external surroundings. Conductive cytochrome proteins and nanowires play crucial roles in controlling electron transfer rate from cytosol to extracellular electrode. Many previous studies elucidated how the c‐type cytochrome proteins and conductive nanowires are synthesized, assembled, and engineered to manipulate the EET rate, and quantified the kinetic processes of electron generation and EET. Here, we firstly overview the electron transfer pathways of EAMs and quantify the kinetic parameters that dictating intracellular electron production and EET. Secondly, we systematically review the structure, conductivity mechanisms, and engineering strategies to manipulate conductive cytochromes and nanowire in EAMs. Lastly, we outlook potential directions for future research in cytochromes and conductive nanowires for enhanced electron transfer. This article reviews the quantitative kinetics of intracellular electron production and EET, and the contribution of engineered c‐type cytochromes and conductive nanowire in enhancing the EET rate, which lay the foundation for enhancing electron transfer capacity of EAMs.","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"61 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139228917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The prediction of molecular properties is a crucial task in the field of drug discovery. Computational methods that can accurately predict molecular properties can significantly accelerate the drug discovery process and reduce the cost of drug discovery. In recent years, iterative updates in computing hardware and the rise of deep learning have created a new and effective path for molecular property prediction. Deep learning methods can leverage the vast amount of data accumulated over the years in drug discovery and do not require complex feature engineering. In this review, we summarize molecular representations and commonly used datasets in molecular property prediction models and present advanced deep learning methods for molecular property prediction, including state‐of‐the‐art deep learning networks such as graph neural networks and Transformer‐based models, as well as state‐of‐the‐art deep learning strategies such as 3D pre‐train, contrastive learning, multi‐task learning, transfer learning, and meta‐learning. We also point out some critical issues such as lack of datasets, low information utilization, and lack of specificity for diseases.
预测分子性质是药物发现领域的一项重要任务。能够准确预测分子性质的计算方法可以大大加快药物发现的进程,降低药物发现的成本。近年来,计算硬件的迭代更新和深度学习的兴起为分子性质预测开辟了一条新的有效途径。深度学习方法可以利用药物发现过程中多年积累的大量数据,而且不需要复杂的特征工程。在这篇综述中,我们总结了分子性质预测模型中的分子表征和常用数据集,并介绍了用于分子性质预测的先进深度学习方法,包括最先进的深度学习网络(如图神经网络和基于 Transformer 的模型),以及最先进的深度学习策略(如 3D 预训练、对比学习、多任务学习、迁移学习和元学习)。我们还指出了一些关键问题,如缺乏数据集、信息利用率低、缺乏疾病特异性等。
{"title":"Advanced deep learning methods for molecular property prediction","authors":"Chao Pang, Henry H. Y. Tong, Leyi Wei","doi":"10.1002/qub2.23","DOIUrl":"https://doi.org/10.1002/qub2.23","url":null,"abstract":"The prediction of molecular properties is a crucial task in the field of drug discovery. Computational methods that can accurately predict molecular properties can significantly accelerate the drug discovery process and reduce the cost of drug discovery. In recent years, iterative updates in computing hardware and the rise of deep learning have created a new and effective path for molecular property prediction. Deep learning methods can leverage the vast amount of data accumulated over the years in drug discovery and do not require complex feature engineering. In this review, we summarize molecular representations and commonly used datasets in molecular property prediction models and present advanced deep learning methods for molecular property prediction, including state‐of‐the‐art deep learning networks such as graph neural networks and Transformer‐based models, as well as state‐of‐the‐art deep learning strategies such as 3D pre‐train, contrastive learning, multi‐task learning, transfer learning, and meta‐learning. We also point out some critical issues such as lack of datasets, low information utilization, and lack of specificity for diseases.","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"33 2","pages":""},"PeriodicalIF":3.1,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139259366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Feiran Li, Yu Chen, Johan Gustafsson, Hao Wang, Yi Wang, Chong Zhang, Xinhui Xing
Abstract Over the last 15 years, genome‐scale metabolic models (GEMs) have been reconstructed for human and model animals, such as mouse and rat, to systematically understand metabolism, simulate multicellular or multi‐tissue interplay, understand human diseases, and guide cell factory design for biopharmaceutical protein production. Here, we describe how metabolic networks can be represented using stoichiometric matrices and well‐defined constraints for flux simulation. Then, we review the history of GEM development for quantitative understanding of Homo sapiens and other relevant animals, together with their applications. We describe how model develops from H . sapiens to other animals and from generic purpose to precise context‐specific simulation. The progress of GEMs for animals greatly expand our systematic understanding of metabolism in human and related animals. We discuss the difficulties and present perspectives on the GEM development and the quest to integrate more biological processes and omics data for future research and translation. We truly hope that this review can inspire new models developed for other mammalian organisms and generate new algorithms for integrating big data to conduct more in‐depth analysis to further make progress on human health and biopharmaceutical engineering.
{"title":"Genome‐scale metabolic models applied for human health and biopharmaceutical engineering","authors":"Feiran Li, Yu Chen, Johan Gustafsson, Hao Wang, Yi Wang, Chong Zhang, Xinhui Xing","doi":"10.1002/qub2.21","DOIUrl":"https://doi.org/10.1002/qub2.21","url":null,"abstract":"Abstract Over the last 15 years, genome‐scale metabolic models (GEMs) have been reconstructed for human and model animals, such as mouse and rat, to systematically understand metabolism, simulate multicellular or multi‐tissue interplay, understand human diseases, and guide cell factory design for biopharmaceutical protein production. Here, we describe how metabolic networks can be represented using stoichiometric matrices and well‐defined constraints for flux simulation. Then, we review the history of GEM development for quantitative understanding of Homo sapiens and other relevant animals, together with their applications. We describe how model develops from H . sapiens to other animals and from generic purpose to precise context‐specific simulation. The progress of GEMs for animals greatly expand our systematic understanding of metabolism in human and related animals. We discuss the difficulties and present perspectives on the GEM development and the quest to integrate more biological processes and omics data for future research and translation. We truly hope that this review can inspire new models developed for other mammalian organisms and generate new algorithms for integrating big data to conduct more in‐depth analysis to further make progress on human health and biopharmaceutical engineering.","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"8 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136352058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, Quantitative Biology (QB) held a discussion on “AI (artificial intelligence) for Life Science” among editorial board members and interested scholars in anticipation of rapid development of this growing area after AlphaGo and ChatGPT mania. Many young people tend to get confused between facts and fictions; heated debates are unavoidable even among their mentors. When deep learning as represented by convolutional neural networks and LSTM (long short-term memory) was made available for bioinformatics students, many of them rushed into this research field and tried to adopt these methods in all their projects without knowing the history that these tools were becoming successful consistently with Moore’s Law (relating to rapid computer technology advances), but more importantly due to new structural/functional understanding of vision and auditory circuits in the brain. Recently, some young people have claimed “LSTM is dead, long live transformer” (which is somewhat like saying “the bike is dead, long live the car”), and have amplified the threat that ChatGPT could wipe out human jobs. They believe transformer is the “silver bullet” for all learning tasks, clearly reflecting their lack of basic knowledge (i.e. “No Free Lunch Theory,” the trade-off of such global “attention network” is to pay the price for complexity: difficulty of training and high memory costs). There is no doubt ML (machine learning) and AI have brought a new revolution in science and technology, and will deliver huge unforeseeable impact to human everyday life as well as to social relationships. In this context, QB journal could be a great platform for encouraging intellectual discussions and for promoting AI for Life Science. Here, I would like to use the DIALOG to “抛砖引玉” (make some initial remarks to get the ball rolling), although it is my personal opinion which is inevitably subject to bias and limitations. AI: Do you know my name “Artificial Intelligence” is defined by the Oxford English Dictionary as the capacity of computer systems (which may be referred as a “robot”) to exhibit or simulate your intelligent behavior? NI: Wait a minute, intelligence itself is defined as the ability to learn, understand and think in a logical way. Can you think? AI: No. But that definition is too restrictive, actually intelligence has different scopes and degrees. Simple intelligent control devices date back to antiquity, from windmills to thermostat. NI: Agree, everything is relative. Macromolecules (e.g., enzyme) and cells (e.g., immune cell) might be considered to be intelligent; see how a white blood cell is chasing bacteria in the youtube website (search for “Crawling neutrophil chasing a bacterium”). Our emergent/collective intelligent behavior does not require a brain or even a neuron; see how slime molds can solve optimization—Hamilton cycle-problem more effectively than a human in the youtube website (search for “Intelligence without a brain?”). Before there was any neuron, C
{"title":"Dialog between artificial intelligence & natural intelligence","authors":"Michael Q. Zhang","doi":"10.1002/qub2.5","DOIUrl":"https://doi.org/10.1002/qub2.5","url":null,"abstract":"Recently, Quantitative Biology (QB) held a discussion on “AI (artificial intelligence) for Life Science” among editorial board members and interested scholars in anticipation of rapid development of this growing area after AlphaGo and ChatGPT mania. Many young people tend to get confused between facts and fictions; heated debates are unavoidable even among their mentors. When deep learning as represented by convolutional neural networks and LSTM (long short-term memory) was made available for bioinformatics students, many of them rushed into this research field and tried to adopt these methods in all their projects without knowing the history that these tools were becoming successful consistently with Moore’s Law (relating to rapid computer technology advances), but more importantly due to new structural/functional understanding of vision and auditory circuits in the brain. Recently, some young people have claimed “LSTM is dead, long live transformer” (which is somewhat like saying “the bike is dead, long live the car”), and have amplified the threat that ChatGPT could wipe out human jobs. They believe transformer is the “silver bullet” for all learning tasks, clearly reflecting their lack of basic knowledge (i.e. “No Free Lunch Theory,” the trade-off of such global “attention network” is to pay the price for complexity: difficulty of training and high memory costs). There is no doubt ML (machine learning) and AI have brought a new revolution in science and technology, and will deliver huge unforeseeable impact to human everyday life as well as to social relationships. In this context, QB journal could be a great platform for encouraging intellectual discussions and for promoting AI for Life Science. Here, I would like to use the DIALOG to “抛砖引玉” (make some initial remarks to get the ball rolling), although it is my personal opinion which is inevitably subject to bias and limitations. AI: Do you know my name “Artificial Intelligence” is defined by the Oxford English Dictionary as the capacity of computer systems (which may be referred as a “robot”) to exhibit or simulate your intelligent behavior? NI: Wait a minute, intelligence itself is defined as the ability to learn, understand and think in a logical way. Can you think? AI: No. But that definition is too restrictive, actually intelligence has different scopes and degrees. Simple intelligent control devices date back to antiquity, from windmills to thermostat. NI: Agree, everything is relative. Macromolecules (e.g., enzyme) and cells (e.g., immune cell) might be considered to be intelligent; see how a white blood cell is chasing bacteria in the youtube website (search for “Crawling neutrophil chasing a bacterium”). Our emergent/collective intelligent behavior does not require a brain or even a neuron; see how slime molds can solve optimization—Hamilton cycle-problem more effectively than a human in the youtube website (search for “Intelligence without a brain?”). Before there was any neuron, C","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"241 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135974128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}