Pub Date : 2024-12-01Epub Date: 2024-06-27DOI: 10.1002/qub2.67
Jinge Wang, Zien Cheng, Qiuming Yao, Li Liu, Dong Xu, Gangqing Hu
The year 2023 marked a significant surge in the exploration of applying large language model chatbots, notably Chat Generative Pre-trained Transformer (ChatGPT), across various disciplines. We surveyed the application of ChatGPT in bioinformatics and biomedical informatics throughout the year, covering omics, genetics, biomedical text mining, drug discovery, biomedical image understanding, bioinformatics programming, and bioinformatics education. Our survey delineates the current strengths and limitations of this chatbot in bioinformatics and offers insights into potential avenues for future developments.
{"title":"Bioinformatics and biomedical informatics with ChatGPT: Year one review.","authors":"Jinge Wang, Zien Cheng, Qiuming Yao, Li Liu, Dong Xu, Gangqing Hu","doi":"10.1002/qub2.67","DOIUrl":"10.1002/qub2.67","url":null,"abstract":"<p><p>The year 2023 marked a significant surge in the exploration of applying large language model chatbots, notably Chat Generative Pre-trained Transformer (ChatGPT), across various disciplines. We surveyed the application of ChatGPT in bioinformatics and biomedical informatics throughout the year, covering omics, genetics, biomedical text mining, drug discovery, biomedical image understanding, bioinformatics programming, and bioinformatics education. Our survey delineates the current strengths and limitations of this chatbot in bioinformatics and offers insights into potential avenues for future developments.</p>","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":null,"pages":null},"PeriodicalIF":0.6,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11446534/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142373184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-01Epub Date: 2024-06-21DOI: 10.1002/qub2.57
Muhammad Azam, Yibo Chen, Micheal Olaolu Arowolo, Haowang Liu, Mihail Popescu, Dong Xu
Understanding complex biological pathways, including gene-gene interactions and gene regulatory networks, is critical for exploring disease mechanisms and drug development. Manual literature curation of biological pathways cannot keep up with the exponential growth of new discoveries in the literature. Large-scale language models (LLMs) trained on extensive text corpora contain rich biological information, and they can be mined as a biological knowledge graph. This study assesses 21 LLMs, including both application programming interface (API)-based models and open-source models in their capacities of retrieving biological knowledge. The evaluation focuses on predicting gene regulatory relations (activation, inhibition, and phosphorylation) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway components. Results indicated a significant disparity in model performance. API-based models GPT-4 and Claude-Pro showed superior performance, with an F1 score of 0.4448 and 0.4386 for the gene regulatory relation prediction, and a Jaccard similarity index of 0.2778 and 0.2657 for the KEGG pathway prediction, respectively. Open-source models lagged behind their API-based counterparts, whereas Falcon-180b and llama2-7b had the highest F1 scores of 0.2787 and 0.1923 in gene regulatory relations, respectively. The KEGG pathway recognition had a Jaccard similarity index of 0.2237 for Falcon-180b and 0.2207 for llama2-7b. Our study suggests that LLMs are informative in gene network analysis and pathway mapping, but their effectiveness varies, necessitating careful model selection. This work also provides a case study and insight into using LLMs das knowledge graphs. Our code is publicly available at the website of GitHub (Muh-aza).
了解复杂的生物通路,包括基因与基因之间的相互作用和基因调控网络,对于探索疾病机理和药物开发至关重要。生物通路的人工文献整理跟不上文献中新发现的指数级增长。在大量文本语料库中训练的大规模语言模型(LLM)包含丰富的生物信息,可以作为生物知识图谱进行挖掘。本研究评估了 21 种 LLM,包括基于应用编程接口(API)的模型和开源模型,以评估它们检索生物知识的能力。评估的重点是预测基因调控关系(激活、抑制和磷酸化)以及《京都基因组百科全书》(KEGG)通路成分。结果表明,模型性能存在明显差异。基于 API 的模型 GPT-4 和 Claude-Pro 表现优异,基因调控关系预测的 F1 分数分别为 0.4448 和 0.4386,KEGG 通路预测的 Jaccard 相似度指数分别为 0.2778 和 0.2657。开源模型落后于基于 API 的模型,而 Falcon-180b 和 llama2-7b 在基因调控关系方面的 F1 分数最高,分别为 0.2787 和 0.1923。在 KEGG 通路识别中,Falcon-180b 和 llama2-7b 的 Jaccard 相似度指数分别为 0.2237 和 0.2207。我们的研究表明,LLMs 在基因网络分析和通路图绘制中具有参考价值,但其有效性各不相同,因此需要谨慎选择模型。这项工作还为使用 LLMs das 知识图谱提供了案例研究和见解。我们的代码可在 GitHub 网站(Muh-aza)上公开获取。
{"title":"A comprehensive evaluation of large language models in mining gene relations and pathway knowledge.","authors":"Muhammad Azam, Yibo Chen, Micheal Olaolu Arowolo, Haowang Liu, Mihail Popescu, Dong Xu","doi":"10.1002/qub2.57","DOIUrl":"10.1002/qub2.57","url":null,"abstract":"<p><p>Understanding complex biological pathways, including gene-gene interactions and gene regulatory networks, is critical for exploring disease mechanisms and drug development. Manual literature curation of biological pathways cannot keep up with the exponential growth of new discoveries in the literature. Large-scale language models (LLMs) trained on extensive text corpora contain rich biological information, and they can be mined as a biological knowledge graph. This study assesses 21 LLMs, including both application programming interface (API)-based models and open-source models in their capacities of retrieving biological knowledge. The evaluation focuses on predicting gene regulatory relations (activation, inhibition, and phosphorylation) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway components. Results indicated a significant disparity in model performance. API-based models GPT-4 and Claude-Pro showed superior performance, with an F1 score of 0.4448 and 0.4386 for the gene regulatory relation prediction, and a Jaccard similarity index of 0.2778 and 0.2657 for the KEGG pathway prediction, respectively. Open-source models lagged behind their API-based counterparts, whereas Falcon-180b and llama2-7b had the highest F1 scores of 0.2787 and 0.1923 in gene regulatory relations, respectively. The KEGG pathway recognition had a Jaccard similarity index of 0.2237 for Falcon-180b and 0.2207 for llama2-7b. Our study suggests that LLMs are informative in gene network analysis and pathway mapping, but their effectiveness varies, necessitating careful model selection. This work also provides a case study and insight into using LLMs das knowledge graphs. Our code is publicly available at the website of GitHub (Muh-aza).</p>","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":null,"pages":null},"PeriodicalIF":0.6,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11446478/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142373183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Transformer‐based foundation models such as ChatGPTs have revolutionized our daily life and affected many fields including bioinformatics. In this perspective, we first discuss about the direct application of textual foundation models on bioinformatics tasks, focusing on how to make the most out of canonical large language models and mitigate their inherent flaws. Meanwhile, we go through the transformer‐based, bioinformatics‐tailored foundation models for both sequence and non‐sequence data. In particular, we envision the further development directions as well as challenges for bioinformatics foundation models.
{"title":"Foundation models for bioinformatics","authors":"Ziyu Chen, Lin Wei, Ge Gao","doi":"10.1002/qub2.69","DOIUrl":"https://doi.org/10.1002/qub2.69","url":null,"abstract":"Transformer‐based foundation models such as ChatGPTs have revolutionized our daily life and affected many fields including bioinformatics. In this perspective, we first discuss about the direct application of textual foundation models on bioinformatics tasks, focusing on how to make the most out of canonical large language models and mitigate their inherent flaws. Meanwhile, we go through the transformer‐based, bioinformatics‐tailored foundation models for both sequence and non‐sequence data. In particular, we envision the further development directions as well as challenges for bioinformatics foundation models.","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":null,"pages":null},"PeriodicalIF":0.6,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141806077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Li, Xiaonan Ren, Haochen Yu, Tao Sun, Shuangge Ma
Deep learning has been increasingly popular in omics data analysis. Recent works incorporating variable selection into deep learning have greatly enhanced the model’s interpretability. However, because deep learning desires a large sample size, the existing methods may result in uncertain findings when the dataset has a small sample size, commonly seen in omics data analysis. With the explosion and availability of omics data from multiple populations/studies, the existing methods naively pool them into one dataset to enhance the sample size while ignoring that variable structures can differ across datasets, which might lead to inaccurate variable selection results. We propose a penalized integrative deep neural network (PIN) to simultaneously select important variables from multiple datasets. PIN directly aggregates multiple datasets as input and considers both homogeneity and heterogeneity situations among multiple datasets in an integrative analysis framework. Results from extensive simulation studies and applications of PIN to gene expression datasets from elders with different cognitive statuses or ovarian cancer patients at different stages demonstrate that PIN outperforms existing methods with considerably improved performance among multiple datasets. The source code is freely available on Github (rucliyang/PINFunc). We speculate that the proposed PIN method will promote the identification of disease‐related important variables based on multiple studies/datasets from diverse origins.
{"title":"A penalized integrative deep neural network for variable selection among multiple omics datasets","authors":"Yang Li, Xiaonan Ren, Haochen Yu, Tao Sun, Shuangge Ma","doi":"10.1002/qub2.51","DOIUrl":"https://doi.org/10.1002/qub2.51","url":null,"abstract":"Deep learning has been increasingly popular in omics data analysis. Recent works incorporating variable selection into deep learning have greatly enhanced the model’s interpretability. However, because deep learning desires a large sample size, the existing methods may result in uncertain findings when the dataset has a small sample size, commonly seen in omics data analysis. With the explosion and availability of omics data from multiple populations/studies, the existing methods naively pool them into one dataset to enhance the sample size while ignoring that variable structures can differ across datasets, which might lead to inaccurate variable selection results. We propose a penalized integrative deep neural network (PIN) to simultaneously select important variables from multiple datasets. PIN directly aggregates multiple datasets as input and considers both homogeneity and heterogeneity situations among multiple datasets in an integrative analysis framework. Results from extensive simulation studies and applications of PIN to gene expression datasets from elders with different cognitive statuses or ovarian cancer patients at different stages demonstrate that PIN outperforms existing methods with considerably improved performance among multiple datasets. The source code is freely available on Github (rucliyang/PINFunc). We speculate that the proposed PIN method will promote the identification of disease‐related important variables based on multiple studies/datasets from diverse origins.","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141372161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mutational signatures refer to distinct patterns of DNA mutations that occur in a specific context or under certain conditions. It is a powerful tool to describe cancer etiology. We conducted a study to show cancer heterogeneity and cancer specificity from the aspect of mutational signatures through collinearity analysis and machine learning techniques. Through thorough training and independent validation, our results show that while the majority of the mutational signatures are distinct, similarities between certain mutational signature pairs can be observed through both mutation patterns and mutational signature abundance. The observation can potentially assist to determine the etiology of yet elusive mutational signatures. Further analysis using machine learning approaches demonstrated moderate mutational signature cancer specificity. Skin cancer among all cancer types demonstrated the strongest mutational signature specificity.
{"title":"Comprehensive cross cancer analyses reveal mutational signature cancer specificity","authors":"Rui Xin, Limin Jiang, Hui Yu, Fengyao Yan, Jijun Tang, Yan Guo","doi":"10.1002/qub2.49","DOIUrl":"https://doi.org/10.1002/qub2.49","url":null,"abstract":"Mutational signatures refer to distinct patterns of DNA mutations that occur in a specific context or under certain conditions. It is a powerful tool to describe cancer etiology. We conducted a study to show cancer heterogeneity and cancer specificity from the aspect of mutational signatures through collinearity analysis and machine learning techniques. Through thorough training and independent validation, our results show that while the majority of the mutational signatures are distinct, similarities between certain mutational signature pairs can be observed through both mutation patterns and mutational signature abundance. The observation can potentially assist to determine the etiology of yet elusive mutational signatures. Further analysis using machine learning approaches demonstrated moderate mutational signature cancer specificity. Skin cancer among all cancer types demonstrated the strongest mutational signature specificity.","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141383956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yahan Li, Xinrui Cai, J. Shang, Yuanyuan Zhang, Jinxing Liu
Epistasis is a ubiquitous phenomenon in genetics, and is considered to be one of main factors in current efforts to unveil missing heritability of complex diseases. Simulation data is crucial for evaluating epistasis detection tools in genome‐wide association studies (GWAS). Existing simulators normally suffer from two limitations: absence of support for high‐order epistasis models containing multiple single nucleotide polymorphisms (SNPs), and inability to generate simulation SNP data independently. In this study, we proposed a simulator SimHOEPI, which is capable of calculating penetrance tables of high‐order epistasis models depending on either prevalence or heritability, and uses a resampling strategy to generate simulation data independently. Highlights of SimHOEPI are the preservation of realistic minor allele frequencies in sampling data, the accurate calculation and embedding of high‐order epistasis models, and acceptable simulation time. A series of experiments were carried out to verify these properties from different aspects. Experimental results show that SimHOEPI can generate simulation SNP data independently with high‐order epistasis models, implying that it might be an alternative simulator for GWAS.
外显率是遗传学中无处不在的现象,被认为是目前揭示复杂疾病缺失遗传性的主要因素之一。模拟数据对于评估全基因组关联研究(GWAS)中的外显子检测工具至关重要。现有的模拟器通常存在两个局限:不支持包含多个单核苷酸多态性(SNP)的高阶表观模型,以及无法独立生成模拟 SNP 数据。在这项研究中,我们提出了一种模拟器 SimHOEPI,它能够根据患病率或遗传率计算高阶外显率模型的渗透率表,并使用重采样策略独立生成模拟数据。SimHOEPI 的亮点是在采样数据中保留了真实的小等位基因频率,精确计算和嵌入高阶外显率模型,以及可接受的模拟时间。为了从不同方面验证这些特性,我们进行了一系列实验。实验结果表明,SimHOEPI 可以独立生成具有高阶外显率模型的模拟 SNP 数据,这意味着它可以成为 GWAS 的替代模拟器。
{"title":"SimHOEPI: A resampling simulator for generating single nucleotide polymorphism data with a high‐order epistasis model","authors":"Yahan Li, Xinrui Cai, J. Shang, Yuanyuan Zhang, Jinxing Liu","doi":"10.1002/qub2.42","DOIUrl":"https://doi.org/10.1002/qub2.42","url":null,"abstract":"Epistasis is a ubiquitous phenomenon in genetics, and is considered to be one of main factors in current efforts to unveil missing heritability of complex diseases. Simulation data is crucial for evaluating epistasis detection tools in genome‐wide association studies (GWAS). Existing simulators normally suffer from two limitations: absence of support for high‐order epistasis models containing multiple single nucleotide polymorphisms (SNPs), and inability to generate simulation SNP data independently. In this study, we proposed a simulator SimHOEPI, which is capable of calculating penetrance tables of high‐order epistasis models depending on either prevalence or heritability, and uses a resampling strategy to generate simulation data independently. Highlights of SimHOEPI are the preservation of realistic minor allele frequencies in sampling data, the accurate calculation and embedding of high‐order epistasis models, and acceptable simulation time. A series of experiments were carried out to verify these properties from different aspects. Experimental results show that SimHOEPI can generate simulation SNP data independently with high‐order epistasis models, implying that it might be an alternative simulator for GWAS.","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140695859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chenrui Qin, Tong Xu, Xuejin Zhao, Yeqing Zong, Haoqian M. Zhang, Chunbo Lou, Ouyang Qi, Long Qian
Although the principles of synthetic biology were initially established in model bacteria, microbial producers, extremophiles and gut microbes have now emerged as valuable prokaryotic chassis for biological engineering. Extending the host range in which designed circuits can function reliably and predictably presents a major challenge for the concept of synthetic biology to materialize. In this work, we systematically characterized the cross‐species universality of two transcriptional regulatory modules—the T7 RNA polymerase activator module and the repressors module—in three non‐model microbes. We found striking linear relationships in circuit activities among different organisms for both modules. Parametrized model fitting revealed host non‐specific parameters defining the universality of both modules. Lastly, a genetic NOT gate and a band‐pass filter circuit were constructed from these modules and tested in non‐model organisms. Combined models employing host non‐specific parameters were successful in quantitatively predicting circuit behaviors, underscoring the potential of universal biological parts and predictive modeling in synthetic bioengineering.
虽然合成生物学的原理最初是在模式细菌中确立的,但现在微生物生产者、嗜极生物和肠道微生物已成为生物工程的重要原核生物底盘。要实现合成生物学的概念,就必须扩大宿主范围,使设计的电路能在其中可靠、可预测地发挥作用。在这项工作中,我们系统地描述了两种转录调控模块--T7 RNA 聚合酶激活模块和抑制模块--在三种非模式微生物中的跨物种通用性。我们发现这两个模块在不同生物体内的电路活动具有显著的线性关系。参数化模型拟合揭示了确定这两个模块普遍性的宿主非特异性参数。最后,我们利用这些模块构建了一个遗传 NOT 门和一个带通滤波器电路,并在非模式生物中进行了测试。采用宿主非特异性参数的组合模型成功地定量预测了电路行为,凸显了通用生物部件和预测模型在合成生物工程中的潜力。
{"title":"Functional predictability of universal gene circuits in diverse microbial hosts","authors":"Chenrui Qin, Tong Xu, Xuejin Zhao, Yeqing Zong, Haoqian M. Zhang, Chunbo Lou, Ouyang Qi, Long Qian","doi":"10.1002/qub2.41","DOIUrl":"https://doi.org/10.1002/qub2.41","url":null,"abstract":"Although the principles of synthetic biology were initially established in model bacteria, microbial producers, extremophiles and gut microbes have now emerged as valuable prokaryotic chassis for biological engineering. Extending the host range in which designed circuits can function reliably and predictably presents a major challenge for the concept of synthetic biology to materialize. In this work, we systematically characterized the cross‐species universality of two transcriptional regulatory modules—the T7 RNA polymerase activator module and the repressors module—in three non‐model microbes. We found striking linear relationships in circuit activities among different organisms for both modules. Parametrized model fitting revealed host non‐specific parameters defining the universality of both modules. Lastly, a genetic NOT gate and a band‐pass filter circuit were constructed from these modules and tested in non‐model organisms. Combined models employing host non‐specific parameters were successful in quantitatively predicting circuit behaviors, underscoring the potential of universal biological parts and predictive modeling in synthetic bioengineering.","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140704855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Effective clinical trials are necessary for understanding medical advances but early termination of trials can result in unnecessary waste of resources. Survival models can be used to predict survival probabilities in such trials. However, survival data from clinical trials are sparse, and DeepSurv cannot accurately capture their effective features, making the models weak in generalization and decreasing their prediction accuracy. In this paper, we propose a survival prediction model for clinical trial completion based on the combination of denoising autoencoder (DAE) and DeepSurv models. The DAE is used to obtain a robust representation of features by breaking the loop of raw features after autoencoder training, and then the robust features are provided to DeepSurv as input for training. The clinical trial dataset for training the model was obtained from the ClinicalTrials.gov dataset. A study of clinical trial completion in pregnant women was conducted in response to the fact that many current clinical trials exclude pregnant women. The experimental results showed that the denoising autoencoder and deep survival regression (DAE‐DSR) model was able to extract meaningful and robust features for survival analysis; the C‐index of the training and test datasets were 0.74 and 0.75 respectively. Compared with the Cox proportional hazards model and DeepSurv model, the survival analysis curves obtained by using DAE‐DSR model had more prominent features, and the model was more robust and performed better in actual prediction.
{"title":"A clinical trial termination prediction model based on denoising autoencoder and deep survival regression","authors":"Huamei Qi, Wenhui Yang, Wenqin Zou, Yuxuan Hu","doi":"10.1002/qub2.43","DOIUrl":"https://doi.org/10.1002/qub2.43","url":null,"abstract":"Effective clinical trials are necessary for understanding medical advances but early termination of trials can result in unnecessary waste of resources. Survival models can be used to predict survival probabilities in such trials. However, survival data from clinical trials are sparse, and DeepSurv cannot accurately capture their effective features, making the models weak in generalization and decreasing their prediction accuracy. In this paper, we propose a survival prediction model for clinical trial completion based on the combination of denoising autoencoder (DAE) and DeepSurv models. The DAE is used to obtain a robust representation of features by breaking the loop of raw features after autoencoder training, and then the robust features are provided to DeepSurv as input for training. The clinical trial dataset for training the model was obtained from the ClinicalTrials.gov dataset. A study of clinical trial completion in pregnant women was conducted in response to the fact that many current clinical trials exclude pregnant women. The experimental results showed that the denoising autoencoder and deep survival regression (DAE‐DSR) model was able to extract meaningful and robust features for survival analysis; the C‐index of the training and test datasets were 0.74 and 0.75 respectively. Compared with the Cox proportional hazards model and DeepSurv model, the survival analysis curves obtained by using DAE‐DSR model had more prominent features, and the model was more robust and performed better in actual prediction.","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140711890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The identification of tumor driver genes facilitates accurate cancer diagnosis and treatment, playing a key role in precision oncology, along with gene signaling, regulation, and their interaction with protein complexes. To tackle the challenge of distinguishing driver genes from a large number of genomic data, we construct a feature extraction framework for discovering pan‐cancer driver genes based on multi‐omics data (mutations, gene expression, copy number variants, and DNA methylation) combined with protein–protein interaction (PPI) networks. Using a network propagation algorithm, we mine functional information among nodes in the PPI network, focusing on genes with weak node information to represent specific cancer information. From these functional features, we extract distribution features of pan‐cancer data, pan‐cancer TOPSIS features of functional features using the ideal solution method, and SetExpan features of pan‐cancer data from the gene functional features, a method to rank pan‐cancer data based on the average inverse rank. These features represent the common message of pan‐cancer. Finally, we use the lightGBM classification algorithm for gene prediction. Experimental results show that our method outperforms existing methods in terms of the area under the check precision‐recall curve (AUPRC) and demonstrates better performance across different PPI networks. This indicates our framework’s effectiveness in predicting potential cancer genes, offering valuable insights for the diagnosis and treatment of tumors.
肿瘤驱动基因的鉴定有助于癌症的准确诊断和治疗,在精准肿瘤学中发挥着关键作用,同时还涉及基因信号转导、调控及其与蛋白质复合物的相互作用。为了应对从大量基因组数据中区分驱动基因的挑战,我们构建了一个特征提取框架,用于发现基于多组学数据(突变、基因表达、拷贝数变异和DNA甲基化)和蛋白质-蛋白质相互作用(PPI)网络的泛癌症驱动基因。我们利用网络传播算法挖掘 PPI 网络中节点间的功能信息,重点关注节点信息较弱的基因,以代表特定的癌症信息。从这些功能特征中,我们提取了泛癌症数据的分布特征,利用理想解法提取了功能特征的泛癌症 TOPSIS 特征,并从基因功能特征中提取了泛癌症数据的 SetExpan 特征,这是一种基于平均逆等级对泛癌症数据进行排序的方法。这些特征代表了泛癌症的共同信息。最后,我们使用 lightGBM 分类算法进行基因预测。实验结果表明,我们的方法在检查精度-召回曲线下面积(AUPRC)方面优于现有方法,并在不同的 PPI 网络中表现出更好的性能。这表明我们的框架能有效预测潜在的癌症基因,为肿瘤的诊断和治疗提供有价值的见解。
{"title":"A feature extraction framework for discovering pan‐cancer driver genes based on multi‐omics data","authors":"Xiaomeng Xue, Feng Li, J. Shang, Lingyun Dai, Daohui Ge, Qianqian Ren","doi":"10.1002/qub2.40","DOIUrl":"https://doi.org/10.1002/qub2.40","url":null,"abstract":"The identification of tumor driver genes facilitates accurate cancer diagnosis and treatment, playing a key role in precision oncology, along with gene signaling, regulation, and their interaction with protein complexes. To tackle the challenge of distinguishing driver genes from a large number of genomic data, we construct a feature extraction framework for discovering pan‐cancer driver genes based on multi‐omics data (mutations, gene expression, copy number variants, and DNA methylation) combined with protein–protein interaction (PPI) networks. Using a network propagation algorithm, we mine functional information among nodes in the PPI network, focusing on genes with weak node information to represent specific cancer information. From these functional features, we extract distribution features of pan‐cancer data, pan‐cancer TOPSIS features of functional features using the ideal solution method, and SetExpan features of pan‐cancer data from the gene functional features, a method to rank pan‐cancer data based on the average inverse rank. These features represent the common message of pan‐cancer. Finally, we use the lightGBM classification algorithm for gene prediction. Experimental results show that our method outperforms existing methods in terms of the area under the check precision‐recall curve (AUPRC) and demonstrates better performance across different PPI networks. This indicates our framework’s effectiveness in predicting potential cancer genes, offering valuable insights for the diagnosis and treatment of tumors.","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140736823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The prediction of the interaction between a drug and a target is the most critical issue in the fields of drug development and repurposing. However, there are still two challenges in current deep learning research: (i) the structural information of drug molecules is not fully explored in most drug target studies, and the previous drug SMILES does not correspond well to effective drug molecules and (ii) exploration of the potential relationship between drugs and targets is in need of improvement. In this work, we use a new and better representation of the effective molecular graph structure, SELFIES. We propose a hybrid mechanism framework based on convolutional neural network and graph attention network to capture multi‐view feature information of drug and target molecular structures, and we aim to enhance the ability to capture interaction sites between a drug and a target. In this study, our experiments using two different datasets show that the GCARDTI model outperforms a variety of different model algorithms on different metrics. We also demonstrate the accuracy of our model through two case studies.
{"title":"GCARDTI: Drug–target interaction prediction based on a hybrid mechanism in drug SELFIES","authors":"Yinfei Feng, Yuanyuan Zhang, Zengqian Deng, Mimi Xiong","doi":"10.1002/qub2.39","DOIUrl":"https://doi.org/10.1002/qub2.39","url":null,"abstract":"The prediction of the interaction between a drug and a target is the most critical issue in the fields of drug development and repurposing. However, there are still two challenges in current deep learning research: (i) the structural information of drug molecules is not fully explored in most drug target studies, and the previous drug SMILES does not correspond well to effective drug molecules and (ii) exploration of the potential relationship between drugs and targets is in need of improvement. In this work, we use a new and better representation of the effective molecular graph structure, SELFIES. We propose a hybrid mechanism framework based on convolutional neural network and graph attention network to capture multi‐view feature information of drug and target molecular structures, and we aim to enhance the ability to capture interaction sites between a drug and a target. In this study, our experiments using two different datasets show that the GCARDTI model outperforms a variety of different model algorithms on different metrics. We also demonstrate the accuracy of our model through two case studies.","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140766626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}