Pub Date : 2025-01-06eCollection Date: 2025-06-01DOI: 10.1002/qub2.78
Jorge A Tzec-Interián, Daianna González-Padilla, Elsa B Góngora-Castillo
The transcriptome, the complete set of RNA molecules within a cell, plays a critical role in regulating physiological processes. The advent of RNA sequencing (RNA-seq) facilitated by Next Generation Sequencing (NGS) technologies, has revolutionized transcriptome research, providing unique insights into gene expression dynamics. This powerful strategy can be applied at both bulk tissue and single-cell levels. Bulk RNA-seq provides a gene expression profile within a tissue sample. Conversely, single-cell RNA sequencing (scRNA-seq) offers resolution at the cellular level, allowing the uncovering of cellular heterogeneity, identification of rare cell types, and distinction between distinct cell populations. As computational tools, machine learning techniques, and NGS sequencing platforms continue to evolve, the field of transcriptome research is poised for significant advancements. Therefore, to fully harness this potential, a comprehensive understanding of bulk RNA-seq and scRNA-seq technologies, including their advantages, limitations, and computational considerations, is crucial. This review provides a systematic comparison of the computational processes involved in both RNA-seq and scRNA-seq, highlighting their fundamental principles, applications, strengths, and limitations, while outlining future directions in transcriptome research.
{"title":"Bioinformatics perspectives on transcriptomics: A comprehensive review of bulk and single-cell RNA sequencing analyses.","authors":"Jorge A Tzec-Interián, Daianna González-Padilla, Elsa B Góngora-Castillo","doi":"10.1002/qub2.78","DOIUrl":"10.1002/qub2.78","url":null,"abstract":"<p><p>The transcriptome, the complete set of RNA molecules within a cell, plays a critical role in regulating physiological processes. The advent of RNA sequencing (RNA-seq) facilitated by Next Generation Sequencing (NGS) technologies, has revolutionized transcriptome research, providing unique insights into gene expression dynamics. This powerful strategy can be applied at both bulk tissue and single-cell levels. Bulk RNA-seq provides a gene expression profile within a tissue sample. Conversely, single-cell RNA sequencing (scRNA-seq) offers resolution at the cellular level, allowing the uncovering of cellular heterogeneity, identification of rare cell types, and distinction between distinct cell populations. As computational tools, machine learning techniques, and NGS sequencing platforms continue to evolve, the field of transcriptome research is poised for significant advancements. Therefore, to fully harness this potential, a comprehensive understanding of bulk RNA-seq and scRNA-seq technologies, including their advantages, limitations, and computational considerations, is crucial. This review provides a systematic comparison of the computational processes involved in both RNA-seq and scRNA-seq, highlighting their fundamental principles, applications, strengths, and limitations, while outlining future directions in transcriptome research.</p>","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"13 2","pages":"e78"},"PeriodicalIF":1.4,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12806032/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146166983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-03eCollection Date: 2025-06-01DOI: 10.1002/qub2.88
Xiaojie Li, Jianhui Shi, Lei M Li
The divergence rate between the alignable genomes of humans and chimpanzees is as little as 1.23%. Their phenotypical difference was hypothesized to be accounted for by gene regulation. We construct the cis-regulatory element frequency (CREF) matrix to represent the proximal regulatory sequences for each species. Each CREF matrix is further decomposed into dual eigen-modules. By comparing the CREF modules of four existing hominid species, we examine their quantitative and qualitative changes along evolution. We identified two saltations: one between the 4th and 5th, the other between the 9th and 10th eigen-levels. The cognition and intelligence unique to humans are thus found from the saltations at the molecular level. They include long-term memory, cochlea/inner ear morphogenesis that enables the development of human language/music, social behavior that allows us to live together peacefully and to work collaboratively, and visual/observational/associative learning. Moreover, we found exploratory behavior crucial for humans' creativity, the GABA-B receptor activation that protects our neurons, and serotonin biosynthesis/signaling that regulates our happiness. We observed a remarkable increase in the number of motifs present on Alu elements on the 4th/9th motif-eigenvectors. The cognition and intelligence unique to humans can, by and large, be identified using only the CREF profiles without any a priori. Although gradual evolution might be the only mode in the mutations of protein sequences, the evolution of gene regulation has both gradual and saltational modes, which could be explained by the framework of CREF eigen-modules.
{"title":"The human intelligence evolved from proximal <i>cis</i>-regulatory saltations.","authors":"Xiaojie Li, Jianhui Shi, Lei M Li","doi":"10.1002/qub2.88","DOIUrl":"10.1002/qub2.88","url":null,"abstract":"<p><p>The divergence rate between the alignable genomes of humans and chimpanzees is as little as 1.23%. Their phenotypical difference was hypothesized to be accounted for by gene regulation. We construct the <i>cis</i>-regulatory element frequency (CREF) matrix to represent the proximal regulatory sequences for each species. Each CREF matrix is further decomposed into dual eigen-modules. By comparing the CREF modules of four existing hominid species, we examine their quantitative and qualitative changes along evolution. We identified two saltations: one between the 4th and 5th, the other between the 9th and 10th eigen-levels. The cognition and intelligence unique to humans are thus found from the saltations at the molecular level. They include long-term memory, cochlea/inner ear morphogenesis that enables the development of human language/music, social behavior that allows us to live together peacefully and to work collaboratively, and visual/observational/associative learning. Moreover, we found exploratory behavior crucial for humans' creativity, the GABA-B receptor activation that protects our neurons, and serotonin biosynthesis/signaling that regulates our happiness. We observed a remarkable increase in the number of motifs present on Alu elements on the 4th/9th motif-eigenvectors. The cognition and intelligence unique to humans can, by and large, be identified using only the CREF profiles without any a priori. Although gradual evolution might be the only mode in the mutations of protein sequences, the evolution of gene regulation has both gradual and saltational modes, which could be explained by the framework of CREF eigen-modules.</p>","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"13 2","pages":"e88"},"PeriodicalIF":1.4,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12806144/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146167260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-20eCollection Date: 2025-03-01DOI: 10.1002/qub2.83
Zelin Li, Zhaoke Huang, Jianfeng Cao, Guoye Guan, Zhongying Zhao, Hong Yan
Embryogenesis is the most basic process in developmental biology. Effectively and simply quantifying cell shape is challenging for the complex and dynamic 3D embryonic cells. Traditional descriptors such as volume, surface area, and mean curvature often fall short, providing only a global view and lacking in local detail and reconstruction capability. Addressing this, we introduce an effective integrated method, 3D Cell Shape Quantification (3DCSQ), for transforming digitized 3D cell shapes into analytical feature vectors, named eigengrid (proposed grid descriptor like eigen value), eigenharmonic, and eigenspectrum. We uniquely combine spherical grids, spherical harmonics, and principal component analysis for cell shape quantification. We demonstrate 3DCSQ's effectiveness in recognizing cellular morphological phenotypes and clustering cells. Applied to Caenorhabditis elegans embryos of 29 living embryos from 4- to 350-cell stages, 3DCSQ identifies and quantifies biologically reproducible cellular patterns including distinct skin cell deformations. We also provide automatically cell shape lineaging analysis program. This method not only systematizes cell shape description and evaluation but also monitors cell differentiation through shape changes, presenting an advancement in biological imaging and analysis.
{"title":"An effective method for quantification, visualization, and analysis of 3D cell shape during early embryogenesis.","authors":"Zelin Li, Zhaoke Huang, Jianfeng Cao, Guoye Guan, Zhongying Zhao, Hong Yan","doi":"10.1002/qub2.83","DOIUrl":"10.1002/qub2.83","url":null,"abstract":"<p><p>Embryogenesis is the most basic process in developmental biology. Effectively and simply quantifying cell shape is challenging for the complex and dynamic 3D embryonic cells. Traditional descriptors such as volume, surface area, and mean curvature often fall short, providing only a global view and lacking in local detail and reconstruction capability. Addressing this, we introduce an effective integrated method, 3D Cell Shape Quantification (3DCSQ), for transforming digitized 3D cell shapes into analytical feature vectors, named <i>eigengrid</i> (<i>proposed grid descriptor like eigen value</i>), <i>eigenharmonic,</i> and <i>eigenspectrum</i>. We uniquely combine spherical grids, spherical harmonics, and principal component analysis for cell shape quantification. We demonstrate 3DCSQ's effectiveness in recognizing cellular morphological phenotypes and clustering cells. Applied to <i>Caenorhabditis elegans</i> embryos of 29 living embryos from 4- to 350-cell stages, 3DCSQ identifies and quantifies biologically reproducible cellular patterns including distinct skin cell deformations. We also provide automatically cell shape lineaging analysis program. This method not only systematizes cell shape description and evaluation but also monitors cell differentiation through shape changes, presenting an advancement in biological imaging and analysis.</p>","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"13 1","pages":"e83"},"PeriodicalIF":1.4,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12806067/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146167265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-18eCollection Date: 2025-03-01DOI: 10.1002/qub2.80
Siyu Liu, Qihui Hou, Daniel B Kearns, Yilin Wu
Self-organized pattern formation is common in biological systems. Microbial populations can generate spatiotemporal patterns through various mechanisms, such as chemotaxis, quorum sensing, and mechanical interactions. When their motile behavior is coupled to a gravitational potential field, swimming microorganisms display a phenomenon known as bioconvection, which is characterized by the pattern formation of active cellular plumes that enhance material mixing in the fluid. While bioconvection patterns have been characterized in various organisms, including eukaryotic and bacterial microswimmers, the dynamics of bioconvection pattern formation in bacteria is less explored. Here, we study this phenomenon using suspensions of a chemotactic bacterium Bacillus subtilis confined in closed three-dimensional (3D) fluid chambers. We discovered an active plume lattice pattern that displays hexagonal order and emerges via a self-organization process. By flow field measurement, we revealed a toroidal flow structure associated with individual plumes. We also uncovered a power-law scaling relation between the lattice pattern's wavelength and the dimensionless Rayleigh number that characterizes the ratio of buoyancy-driven convection to diffusion. Taken together, this study highlights that coupling between chemotaxis and external potential fields can promote the self-assembly of regular spatial structures in bacterial populations. The findings are also relevant to material transport in surface water environments populated by swimming microorganisms.
{"title":"Self-organization of active plume lattice in bacterial bioconvection.","authors":"Siyu Liu, Qihui Hou, Daniel B Kearns, Yilin Wu","doi":"10.1002/qub2.80","DOIUrl":"10.1002/qub2.80","url":null,"abstract":"<p><p>Self-organized pattern formation is common in biological systems. Microbial populations can generate spatiotemporal patterns through various mechanisms, such as chemotaxis, quorum sensing, and mechanical interactions. When their motile behavior is coupled to a gravitational potential field, swimming microorganisms display a phenomenon known as bioconvection, which is characterized by the pattern formation of active cellular plumes that enhance material mixing in the fluid. While bioconvection patterns have been characterized in various organisms, including eukaryotic and bacterial microswimmers, the dynamics of bioconvection pattern formation in bacteria is less explored. Here, we study this phenomenon using suspensions of a chemotactic bacterium <i>Bacillus subtilis</i> confined in closed three-dimensional (3D) fluid chambers. We discovered an active plume lattice pattern that displays hexagonal order and emerges via a self-organization process. By flow field measurement, we revealed a toroidal flow structure associated with individual plumes. We also uncovered a power-law scaling relation between the lattice pattern's wavelength and the dimensionless Rayleigh number that characterizes the ratio of buoyancy-driven convection to diffusion. Taken together, this study highlights that coupling between chemotaxis and external potential fields can promote the self-assembly of regular spatial structures in bacterial populations. The findings are also relevant to material transport in surface water environments populated by swimming microorganisms.</p>","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"13 1","pages":"e80"},"PeriodicalIF":1.4,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12806025/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146166915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-16eCollection Date: 2025-03-01DOI: 10.1002/qub2.75
Yuansheng Cao, Shiling Liang
Living systems operate within physical constraints imposed by nonequilibrium thermodynamics. This review explores recent advancements in applying these principles to understand the fundamental limits of biological functions. We introduce the framework of stochastic thermodynamics and its recent developments, followed by its application to various biological systems. We emphasize the interconnectedness of kinetics and energetics within this framework, focusing on how network topology, kinetics, and energetics influence functions in thermodynamically consistent models. We discuss examples in the areas of molecular machine, error correction, biological sensing, and collective behaviors. This review aims to bridge physics and biology by fostering a quantitative understanding of biological functions.
{"title":"Stochastic thermodynamics for biological functions.","authors":"Yuansheng Cao, Shiling Liang","doi":"10.1002/qub2.75","DOIUrl":"10.1002/qub2.75","url":null,"abstract":"<p><p>Living systems operate within physical constraints imposed by nonequilibrium thermodynamics. This review explores recent advancements in applying these principles to understand the fundamental limits of biological functions. We introduce the framework of stochastic thermodynamics and its recent developments, followed by its application to various biological systems. We emphasize the interconnectedness of kinetics and energetics within this framework, focusing on how network topology, kinetics, and energetics influence functions in thermodynamically consistent models. We discuss examples in the areas of molecular machine, error correction, biological sensing, and collective behaviors. This review aims to bridge physics and biology by fostering a quantitative understanding of biological functions.</p>","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"13 1","pages":"e75"},"PeriodicalIF":1.4,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12806147/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146167028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-14eCollection Date: 2025-03-01DOI: 10.1002/qub2.74
Panpan Han, Yang Zhou, Weihua Deng
Cell senescence has attracted much attention in the long history of human beings, and telomere shortening (TS) is one of the main concerns in the study of cell senescence. To reveal the microscopic mechanism of TS process, we model it based on molecular stochastic process from the perspective of nonequilibrium statistical physics. We associate the TS process with the continuous time random walk and derive the Fokker-Planck equation to describe the length distribution of the TS. We further modify the model describing the TS process, similar to the anomalous tempered diffusion, and derive the Feynman-Kac equation characterizing the functional distribution of the TS process. Finally, we study the statistics related to the critical telomere length lc , including the occupation time and first passage time. These two kinds of statistics help us understand the time scale of cell senescence.
{"title":"Modeling telomere shortening process.","authors":"Panpan Han, Yang Zhou, Weihua Deng","doi":"10.1002/qub2.74","DOIUrl":"10.1002/qub2.74","url":null,"abstract":"<p><p>Cell senescence has attracted much attention in the long history of human beings, and telomere shortening (TS) is one of the main concerns in the study of cell senescence. To reveal the microscopic mechanism of TS process, we model it based on molecular stochastic process from the perspective of nonequilibrium statistical physics. We associate the TS process with the continuous time random walk and derive the Fokker-Planck equation to describe the length distribution of the TS. We further modify the model describing the TS process, similar to the anomalous tempered diffusion, and derive the Feynman-Kac equation characterizing the functional distribution of the TS process. Finally, we study the statistics related to the critical telomere length <i>l</i> <sub><i>c</i></sub> , including the occupation time and first passage time. These two kinds of statistics help us understand the time scale of cell senescence.</p>","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"13 1","pages":"e74"},"PeriodicalIF":1.4,"publicationDate":"2024-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12806131/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146165951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-07eCollection Date: 2025-03-01DOI: 10.1002/qub2.70
Zonghua Liu
In recent years, exploring the physical mechanisms of brain functions has been a hot topic in the fields of nonlinear dynamics and complex networks, and many important achievements have been made, mainly based on the characteristic features of time series of human brain. To speed up the further study of this problem, herein we make a brief review on these important achievements, which includes the aspects of explaining: (i) the mechanism of brain rhythms by network synchronization, (ii) the mechanism of unihemispheric sleep by chimera states, (iii) the fundamental difference between the structural and functional brain networks by remote synchronization, (iv) the mechanism of stronger detection ability of human brain to weak signals by remote firing propagation, and (v) the mechanism of dementia patterns by eigen-microstate analysis. As a brief review, we will mainly focus on the aspects of basic ideas, research histories, and key results but ignore the tedious mathematical derivations. Moreover, some outlooks will be discussed for future studies.
{"title":"Physical mechanisms of human brain functions.","authors":"Zonghua Liu","doi":"10.1002/qub2.70","DOIUrl":"10.1002/qub2.70","url":null,"abstract":"<p><p>In recent years, exploring the physical mechanisms of brain functions has been a hot topic in the fields of nonlinear dynamics and complex networks, and many important achievements have been made, mainly based on the characteristic features of time series of human brain. To speed up the further study of this problem, herein we make a brief review on these important achievements, which includes the aspects of explaining: (i) the mechanism of brain rhythms by network synchronization, (ii) the mechanism of unihemispheric sleep by chimera states, (iii) the fundamental difference between the structural and functional brain networks by remote synchronization, (iv) the mechanism of stronger detection ability of human brain to weak signals by remote firing propagation, and (v) the mechanism of dementia patterns by eigen-microstate analysis. As a brief review, we will mainly focus on the aspects of basic ideas, research histories, and key results but ignore the tedious mathematical derivations. Moreover, some outlooks will be discussed for future studies.</p>","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"13 1","pages":"e70"},"PeriodicalIF":1.4,"publicationDate":"2024-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12806051/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146166507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-01Epub Date: 2024-06-27DOI: 10.1002/qub2.67
Jinge Wang, Zien Cheng, Qiuming Yao, Li Liu, Dong Xu, Gangqing Hu
The year 2023 marked a significant surge in the exploration of applying large language model chatbots, notably Chat Generative Pre-trained Transformer (ChatGPT), across various disciplines. We surveyed the application of ChatGPT in bioinformatics and biomedical informatics throughout the year, covering omics, genetics, biomedical text mining, drug discovery, biomedical image understanding, bioinformatics programming, and bioinformatics education. Our survey delineates the current strengths and limitations of this chatbot in bioinformatics and offers insights into potential avenues for future developments.
{"title":"Bioinformatics and biomedical informatics with ChatGPT: Year one review.","authors":"Jinge Wang, Zien Cheng, Qiuming Yao, Li Liu, Dong Xu, Gangqing Hu","doi":"10.1002/qub2.67","DOIUrl":"10.1002/qub2.67","url":null,"abstract":"<p><p>The year 2023 marked a significant surge in the exploration of applying large language model chatbots, notably Chat Generative Pre-trained Transformer (ChatGPT), across various disciplines. We surveyed the application of ChatGPT in bioinformatics and biomedical informatics throughout the year, covering omics, genetics, biomedical text mining, drug discovery, biomedical image understanding, bioinformatics programming, and bioinformatics education. Our survey delineates the current strengths and limitations of this chatbot in bioinformatics and offers insights into potential avenues for future developments.</p>","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"12 4","pages":"345-359"},"PeriodicalIF":1.4,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11446534/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142373184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-01Epub Date: 2024-06-21DOI: 10.1002/qub2.57
Muhammad Azam, Yibo Chen, Micheal Olaolu Arowolo, Haowang Liu, Mihail Popescu, Dong Xu
Understanding complex biological pathways, including gene-gene interactions and gene regulatory networks, is critical for exploring disease mechanisms and drug development. Manual literature curation of biological pathways cannot keep up with the exponential growth of new discoveries in the literature. Large-scale language models (LLMs) trained on extensive text corpora contain rich biological information, and they can be mined as a biological knowledge graph. This study assesses 21 LLMs, including both application programming interface (API)-based models and open-source models in their capacities of retrieving biological knowledge. The evaluation focuses on predicting gene regulatory relations (activation, inhibition, and phosphorylation) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway components. Results indicated a significant disparity in model performance. API-based models GPT-4 and Claude-Pro showed superior performance, with an F1 score of 0.4448 and 0.4386 for the gene regulatory relation prediction, and a Jaccard similarity index of 0.2778 and 0.2657 for the KEGG pathway prediction, respectively. Open-source models lagged behind their API-based counterparts, whereas Falcon-180b and llama2-7b had the highest F1 scores of 0.2787 and 0.1923 in gene regulatory relations, respectively. The KEGG pathway recognition had a Jaccard similarity index of 0.2237 for Falcon-180b and 0.2207 for llama2-7b. Our study suggests that LLMs are informative in gene network analysis and pathway mapping, but their effectiveness varies, necessitating careful model selection. This work also provides a case study and insight into using LLMs das knowledge graphs. Our code is publicly available at the website of GitHub (Muh-aza).
了解复杂的生物通路,包括基因与基因之间的相互作用和基因调控网络,对于探索疾病机理和药物开发至关重要。生物通路的人工文献整理跟不上文献中新发现的指数级增长。在大量文本语料库中训练的大规模语言模型(LLM)包含丰富的生物信息,可以作为生物知识图谱进行挖掘。本研究评估了 21 种 LLM,包括基于应用编程接口(API)的模型和开源模型,以评估它们检索生物知识的能力。评估的重点是预测基因调控关系(激活、抑制和磷酸化)以及《京都基因组百科全书》(KEGG)通路成分。结果表明,模型性能存在明显差异。基于 API 的模型 GPT-4 和 Claude-Pro 表现优异,基因调控关系预测的 F1 分数分别为 0.4448 和 0.4386,KEGG 通路预测的 Jaccard 相似度指数分别为 0.2778 和 0.2657。开源模型落后于基于 API 的模型,而 Falcon-180b 和 llama2-7b 在基因调控关系方面的 F1 分数最高,分别为 0.2787 和 0.1923。在 KEGG 通路识别中,Falcon-180b 和 llama2-7b 的 Jaccard 相似度指数分别为 0.2237 和 0.2207。我们的研究表明,LLMs 在基因网络分析和通路图绘制中具有参考价值,但其有效性各不相同,因此需要谨慎选择模型。这项工作还为使用 LLMs das 知识图谱提供了案例研究和见解。我们的代码可在 GitHub 网站(Muh-aza)上公开获取。
{"title":"A comprehensive evaluation of large language models in mining gene relations and pathway knowledge.","authors":"Muhammad Azam, Yibo Chen, Micheal Olaolu Arowolo, Haowang Liu, Mihail Popescu, Dong Xu","doi":"10.1002/qub2.57","DOIUrl":"10.1002/qub2.57","url":null,"abstract":"<p><p>Understanding complex biological pathways, including gene-gene interactions and gene regulatory networks, is critical for exploring disease mechanisms and drug development. Manual literature curation of biological pathways cannot keep up with the exponential growth of new discoveries in the literature. Large-scale language models (LLMs) trained on extensive text corpora contain rich biological information, and they can be mined as a biological knowledge graph. This study assesses 21 LLMs, including both application programming interface (API)-based models and open-source models in their capacities of retrieving biological knowledge. The evaluation focuses on predicting gene regulatory relations (activation, inhibition, and phosphorylation) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway components. Results indicated a significant disparity in model performance. API-based models GPT-4 and Claude-Pro showed superior performance, with an F1 score of 0.4448 and 0.4386 for the gene regulatory relation prediction, and a Jaccard similarity index of 0.2778 and 0.2657 for the KEGG pathway prediction, respectively. Open-source models lagged behind their API-based counterparts, whereas Falcon-180b and llama2-7b had the highest F1 scores of 0.2787 and 0.1923 in gene regulatory relations, respectively. The KEGG pathway recognition had a Jaccard similarity index of 0.2237 for Falcon-180b and 0.2207 for llama2-7b. Our study suggests that LLMs are informative in gene network analysis and pathway mapping, but their effectiveness varies, necessitating careful model selection. This work also provides a case study and insight into using LLMs das knowledge graphs. Our code is publicly available at the website of GitHub (Muh-aza).</p>","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"12 4","pages":"360-374"},"PeriodicalIF":1.4,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11446478/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142373183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}