首页 > 最新文献

Briefings in bioinformatics最新文献

英文 中文
CAPE: a deep learning framework with Chaos-Attention net for Promoter Evolution. CAPE:利用混沌-注意力网的深度学习框架,用于促进者进化。
IF 6.8 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-07-25 DOI: 10.1093/bib/bbae398
Ruohan Ren, Hongyu Yu, Jiahao Teng, Sihui Mao, Zixuan Bian, Yangtianze Tao, Stephen S-T Yau

Predicting the strength of promoters and guiding their directed evolution is a crucial task in synthetic biology. This approach significantly reduces the experimental costs in conventional promoter engineering. Previous studies employing machine learning or deep learning methods have shown some success in this task, but their outcomes were not satisfactory enough, primarily due to the neglect of evolutionary information. In this paper, we introduce the Chaos-Attention net for Promoter Evolution (CAPE) to address the limitations of existing methods. We comprehensively extract evolutionary information within promoters using merged chaos game representation and process the overall information with modified DenseNet and Transformer structures. Our model achieves state-of-the-art results on two kinds of distinct tasks related to prokaryotic promoter strength prediction. The incorporation of evolutionary information enhances the model's accuracy, with transfer learning further extending its adaptability. Furthermore, experimental results confirm CAPE's efficacy in simulating in silico directed evolution of promoters, marking a significant advancement in predictive modeling for prokaryotic promoter strength. Our paper also presents a user-friendly website for the practical implementation of in silico directed evolution on promoters. The source code implemented in this study and the instructions on accessing the website can be found in our GitHub repository https://github.com/BobYHY/CAPE.

预测启动子的强度并引导其定向进化是合成生物学的一项重要任务。这种方法大大降低了传统启动子工程的实验成本。以往采用机器学习或深度学习方法的研究在这项任务中取得了一些成功,但其结果还不够令人满意,主要原因是忽略了进化信息。本文针对现有方法的局限性,引入了启动子进化混沌注意力网(CAPE)。我们使用合并的混沌博弈表示法全面提取启动子中的进化信息,并使用改进的 DenseNet 和 Transformer 结构处理整体信息。我们的模型在与原核生物启动子强度预测相关的两种不同任务中取得了最先进的结果。进化信息的融入提高了模型的准确性,而迁移学习则进一步扩展了模型的适应性。此外,实验结果证实了 CAPE 在模拟启动子硅学定向进化方面的功效,标志着原核生物启动子强度预测建模的重大进展。我们的论文还介绍了一个用户友好型网站,用于实际实现启动子的硅学定向进化。本研究中实现的源代码和访问网站的说明可在我们的 GitHub 存储库 https://github.com/BobYHY/CAPE 中找到。
{"title":"CAPE: a deep learning framework with Chaos-Attention net for Promoter Evolution.","authors":"Ruohan Ren, Hongyu Yu, Jiahao Teng, Sihui Mao, Zixuan Bian, Yangtianze Tao, Stephen S-T Yau","doi":"10.1093/bib/bbae398","DOIUrl":"10.1093/bib/bbae398","url":null,"abstract":"<p><p>Predicting the strength of promoters and guiding their directed evolution is a crucial task in synthetic biology. This approach significantly reduces the experimental costs in conventional promoter engineering. Previous studies employing machine learning or deep learning methods have shown some success in this task, but their outcomes were not satisfactory enough, primarily due to the neglect of evolutionary information. In this paper, we introduce the Chaos-Attention net for Promoter Evolution (CAPE) to address the limitations of existing methods. We comprehensively extract evolutionary information within promoters using merged chaos game representation and process the overall information with modified DenseNet and Transformer structures. Our model achieves state-of-the-art results on two kinds of distinct tasks related to prokaryotic promoter strength prediction. The incorporation of evolutionary information enhances the model's accuracy, with transfer learning further extending its adaptability. Furthermore, experimental results confirm CAPE's efficacy in simulating in silico directed evolution of promoters, marking a significant advancement in predictive modeling for prokaryotic promoter strength. Our paper also presents a user-friendly website for the practical implementation of in silico directed evolution on promoters. The source code implemented in this study and the instructions on accessing the website can be found in our GitHub repository https://github.com/BobYHY/CAPE.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11311715/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141905966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multiview representation learning for identification of novel cancer genes and their causative biological mechanisms. 用于识别新型癌症基因及其致病生物机制的多视图表示学习。
IF 6.8 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-07-25 DOI: 10.1093/bib/bbae418
Jianye Yang, Haitao Fu, Feiyang Xue, Menglu Li, Yuyang Wu, Zhanhui Yu, Haohui Luo, Jing Gong, Xiaohui Niu, Wen Zhang

Tumorigenesis arises from the dysfunction of cancer genes, leading to uncontrolled cell proliferation through various mechanisms. Establishing a complete cancer gene catalogue will make precision oncology possible. Although existing methods based on graph neural networks (GNN) are effective in identifying cancer genes, they fall short in effectively integrating data from multiple views and interpreting predictive outcomes. To address these shortcomings, an interpretable representation learning framework IMVRL-GCN is proposed to capture both shared and specific representations from multiview data, offering significant insights into the identification of cancer genes. Experimental results demonstrate that IMVRL-GCN outperforms state-of-the-art cancer gene identification methods and several baselines. Furthermore, IMVRL-GCN is employed to identify a total of 74 high-confidence novel cancer genes, and multiview data analysis highlights the pivotal roles of shared, mutation-specific, and structure-specific representations in discriminating distinctive cancer genes. Exploration of the mechanisms behind their discriminative capabilities suggests that shared representations are strongly associated with gene functions, while mutation-specific and structure-specific representations are linked to mutagenic propensity and functional synergy, respectively. Finally, our in-depth analyses of these candidates suggest potential insights for individualized treatments: afatinib could counteract many mutation-driven risks, and targeting interactions with cancer gene SRC is a reasonable strategy to mitigate interaction-induced risks for NR3C1, RXRA, HNF4A, and SP1.

肿瘤发生源于癌基因功能失调,通过各种机制导致细胞增殖失控。建立完整的癌症基因目录将使精准肿瘤学成为可能。虽然现有的基于图神经网络(GNN)的方法能有效识别癌症基因,但它们在有效整合多视图数据和解释预测结果方面存在不足。为了解决这些不足,我们提出了一种可解释的表征学习框架 IMVRL-GCN,以从多视图数据中捕捉共享表征和特定表征,为癌症基因的识别提供重要见解。实验结果表明,IMVRL-GCN 的性能优于最先进的癌症基因识别方法和几种基线方法。此外,IMVRL-GCN 还用于识别了 74 个高置信度的新型癌症基因,多视图数据分析凸显了共享表征、突变特异表征和结构特异表征在鉴别独特癌症基因中的关键作用。对其鉴别能力背后机制的探索表明,共享表征与基因功能密切相关,而突变特异性和结构特异性表征则分别与诱变倾向和功能协同相关。最后,我们对这些候选基因的深入分析为个体化治疗提供了潜在的启示:阿法替尼可以抵消许多突变驱动的风险,而针对与癌基因SRC的相互作用是减轻NR3C1、RXRA、HNF4A和SP1相互作用诱导风险的合理策略。
{"title":"Multiview representation learning for identification of novel cancer genes and their causative biological mechanisms.","authors":"Jianye Yang, Haitao Fu, Feiyang Xue, Menglu Li, Yuyang Wu, Zhanhui Yu, Haohui Luo, Jing Gong, Xiaohui Niu, Wen Zhang","doi":"10.1093/bib/bbae418","DOIUrl":"10.1093/bib/bbae418","url":null,"abstract":"<p><p>Tumorigenesis arises from the dysfunction of cancer genes, leading to uncontrolled cell proliferation through various mechanisms. Establishing a complete cancer gene catalogue will make precision oncology possible. Although existing methods based on graph neural networks (GNN) are effective in identifying cancer genes, they fall short in effectively integrating data from multiple views and interpreting predictive outcomes. To address these shortcomings, an interpretable representation learning framework IMVRL-GCN is proposed to capture both shared and specific representations from multiview data, offering significant insights into the identification of cancer genes. Experimental results demonstrate that IMVRL-GCN outperforms state-of-the-art cancer gene identification methods and several baselines. Furthermore, IMVRL-GCN is employed to identify a total of 74 high-confidence novel cancer genes, and multiview data analysis highlights the pivotal roles of shared, mutation-specific, and structure-specific representations in discriminating distinctive cancer genes. Exploration of the mechanisms behind their discriminative capabilities suggests that shared representations are strongly associated with gene functions, while mutation-specific and structure-specific representations are linked to mutagenic propensity and functional synergy, respectively. Finally, our in-depth analyses of these candidates suggest potential insights for individualized treatments: afatinib could counteract many mutation-driven risks, and targeting interactions with cancer gene SRC is a reasonable strategy to mitigate interaction-induced risks for NR3C1, RXRA, HNF4A, and SP1.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11361854/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142104466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ReadCurrent: a VDCNN-based tool for fast and accurate nanopore selective sequencing. ReadCurrent:基于 VDCNN 的快速准确纳米孔选择性测序工具。
IF 6.8 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-07-25 DOI: 10.1093/bib/bbae435
Kechen Fan, Mengfan Li, Jiarong Zhang, Zihan Xie, Daguang Jiang, Xiaochen Bo, Dongsheng Zhao, Shenghui Shi, Ming Ni

Nanopore selective sequencing allows the targeted sequencing of DNA of interest using computational approaches rather than experimental methods such as targeted multiplex polymerase chain reaction or hybridization capture. Compared to sequence-alignment strategies, deep learning (DL) models for classifying target and nontarget DNA provide large speed advantages. However, the relatively low accuracy of these DL-based tools hinders their application in nanopore selective sequencing. Here, we present a DL-based tool named ReadCurrent for nanopore selective sequencing, which takes electric currents as inputs. ReadCurrent employs a modified very deep convolutional neural network (VDCNN) architecture, enabling significantly lower computational costs for training and quicker inference compared to conventional VDCNN. We evaluated the performance of ReadCurrent across 10 nanopore sequencing datasets spanning human, yeasts, bacteria, and viruses. We observed that ReadCurrent achieved a mean accuracy of 98.57% for classification, outperforming four other DL-based selective sequencing methods. In experimental validation that selectively sequenced microbial DNA from human DNA, ReadCurrent achieved an enrichment ratio of 2.85, which was higher than the 2.7 ratio achieved by MinKNOW using the sequence-alignment strategy. In summary, ReadCurrent can rapidly classify target and nontarget DNA with high accuracy, providing an alternative in the toolbox for nanopore selective sequencing. ReadCurrent is available at https://github.com/Ming-Ni-Group/ReadCurrent.

通过纳米孔选择性测序技术,可以利用计算方法而不是实验方法(如靶向多聚酶链反应或杂交捕获)对感兴趣的DNA进行靶向测序。与序列比对策略相比,深度学习(DL)模型在对目标 DNA 和非目标 DNA 进行分类方面具有很大的速度优势。然而,这些基于深度学习的工具准确性相对较低,阻碍了它们在纳米孔选择性测序中的应用。在此,我们介绍一种基于 DL 的工具 ReadCurrent,它以电流作为输入,用于纳米孔选择性测序。ReadCurrent 采用了改进的深度卷积神经网络(VDCNN)架构,与传统的 VDCNN 相比,训练计算成本大大降低,推理速度更快。我们在人类、酵母、细菌和病毒等 10 个纳米孔测序数据集上评估了 ReadCurrent 的性能。我们发现,ReadCurrent 的平均分类准确率达到 98.57%,优于其他四种基于 DL 的选择性测序方法。在从人类 DNA 中选择性测序微生物 DNA 的实验验证中,ReadCurrent 实现了 2.85 的富集比,高于 MinKNOW 使用序列比对策略实现的 2.7 的富集比。总之,ReadCurrent 可以快速、高精度地对目标 DNA 和非目标 DNA 进行分类,为纳米孔选择性测序提供了另一种工具箱。ReadCurrent 可在 https://github.com/Ming-Ni-Group/ReadCurrent 上获取。
{"title":"ReadCurrent: a VDCNN-based tool for fast and accurate nanopore selective sequencing.","authors":"Kechen Fan, Mengfan Li, Jiarong Zhang, Zihan Xie, Daguang Jiang, Xiaochen Bo, Dongsheng Zhao, Shenghui Shi, Ming Ni","doi":"10.1093/bib/bbae435","DOIUrl":"10.1093/bib/bbae435","url":null,"abstract":"<p><p>Nanopore selective sequencing allows the targeted sequencing of DNA of interest using computational approaches rather than experimental methods such as targeted multiplex polymerase chain reaction or hybridization capture. Compared to sequence-alignment strategies, deep learning (DL) models for classifying target and nontarget DNA provide large speed advantages. However, the relatively low accuracy of these DL-based tools hinders their application in nanopore selective sequencing. Here, we present a DL-based tool named ReadCurrent for nanopore selective sequencing, which takes electric currents as inputs. ReadCurrent employs a modified very deep convolutional neural network (VDCNN) architecture, enabling significantly lower computational costs for training and quicker inference compared to conventional VDCNN. We evaluated the performance of ReadCurrent across 10 nanopore sequencing datasets spanning human, yeasts, bacteria, and viruses. We observed that ReadCurrent achieved a mean accuracy of 98.57% for classification, outperforming four other DL-based selective sequencing methods. In experimental validation that selectively sequenced microbial DNA from human DNA, ReadCurrent achieved an enrichment ratio of 2.85, which was higher than the 2.7 ratio achieved by MinKNOW using the sequence-alignment strategy. In summary, ReadCurrent can rapidly classify target and nontarget DNA with high accuracy, providing an alternative in the toolbox for nanopore selective sequencing. ReadCurrent is available at https://github.com/Ming-Ni-Group/ReadCurrent.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11370629/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142124845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
xSiGra: explainable model for single-cell spatial data elucidation. xSiGra:用于单细胞空间数据阐释的可解释模型。
IF 6.8 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-07-25 DOI: 10.1093/bib/bbae388
Aishwarya Budhkar, Ziyang Tang, Xiang Liu, Xuhong Zhang, Jing Su, Qianqian Song

Recent advancements in spatial imaging technologies have revolutionized the acquisition of high-resolution multichannel images, gene expressions, and spatial locations at the single-cell level. Our study introduces xSiGra, an interpretable graph-based AI model, designed to elucidate interpretable features of identified spatial cell types, by harnessing multimodal features from spatial imaging technologies. By constructing a spatial cellular graph with immunohistology images and gene expression as node attributes, xSiGra employs hybrid graph transformer models to delineate spatial cell types. Additionally, xSiGra integrates a novel variant of gradient-weighted class activation mapping component to uncover interpretable features, including pivotal genes and cells for various cell types, thereby facilitating deeper biological insights from spatial data. Through rigorous benchmarking against existing methods, xSiGra demonstrates superior performance across diverse spatial imaging datasets. Application of xSiGra on a lung tumor slice unveils the importance score of cells, illustrating that cellular activity is not solely determined by itself but also impacted by neighboring cells. Moreover, leveraging the identified interpretable genes, xSiGra reveals endothelial cell subset interacting with tumor cells, indicating its heterogeneous underlying mechanisms within complex cellular interactions.

空间成像技术的最新进展彻底改变了单细胞水平的高分辨率多通道图像、基因表达和空间位置的获取。我们的研究引入了 xSiGra,这是一种基于可解释图谱的人工智能模型,旨在通过利用空间成像技术的多模态特征,阐明已识别空间细胞类型的可解释特征。xSiGra 以免疫组织学图像和基因表达作为节点属性,构建了空间细胞图,并采用混合图转换器模型来划分空间细胞类型。此外,xSiGra 还集成了梯度加权类激活映射组件的新型变体,以发现可解释的特征,包括各种细胞类型的关键基因和细胞,从而促进从空间数据中获得更深入的生物学见解。通过与现有方法进行严格的基准测试,xSiGra 在各种空间成像数据集上都表现出了卓越的性能。xSiGra 在肺部肿瘤切片上的应用揭示了细胞的重要性得分,说明细胞活动不仅由其自身决定,还受到邻近细胞的影响。此外,利用已确定的可解释基因,xSiGra 还揭示了与肿瘤细胞相互作用的内皮细胞亚群,显示了复杂细胞相互作用中的异质性潜在机制。
{"title":"xSiGra: explainable model for single-cell spatial data elucidation.","authors":"Aishwarya Budhkar, Ziyang Tang, Xiang Liu, Xuhong Zhang, Jing Su, Qianqian Song","doi":"10.1093/bib/bbae388","DOIUrl":"10.1093/bib/bbae388","url":null,"abstract":"<p><p>Recent advancements in spatial imaging technologies have revolutionized the acquisition of high-resolution multichannel images, gene expressions, and spatial locations at the single-cell level. Our study introduces xSiGra, an interpretable graph-based AI model, designed to elucidate interpretable features of identified spatial cell types, by harnessing multimodal features from spatial imaging technologies. By constructing a spatial cellular graph with immunohistology images and gene expression as node attributes, xSiGra employs hybrid graph transformer models to delineate spatial cell types. Additionally, xSiGra integrates a novel variant of gradient-weighted class activation mapping component to uncover interpretable features, including pivotal genes and cells for various cell types, thereby facilitating deeper biological insights from spatial data. Through rigorous benchmarking against existing methods, xSiGra demonstrates superior performance across diverse spatial imaging datasets. Application of xSiGra on a lung tumor slice unveils the importance score of cells, illustrating that cellular activity is not solely determined by itself but also impacted by neighboring cells. Moreover, leveraging the identified interpretable genes, xSiGra reveals endothelial cell subset interacting with tumor cells, indicating its heterogeneous underlying mechanisms within complex cellular interactions.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11312371/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141905985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HTINet2: herb-target prediction via knowledge graph embedding and residual-like graph neural network. HTINet2:通过知识图嵌入和类残差图神经网络进行草药目标预测。
IF 6.8 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-07-25 DOI: 10.1093/bib/bbae414
Pengbo Duan, Kuo Yang, Xin Su, Shuyue Fan, Xin Dong, Fenghui Zhang, Xianan Li, Xiaoyan Xing, Qiang Zhu, Jian Yu, Xuezhong Zhou

Target identification is one of the crucial tasks in drug research and development, as it aids in uncovering the action mechanism of herbs/drugs and discovering new therapeutic targets. Although multiple algorithms of herb target prediction have been proposed, due to the incompleteness of clinical knowledge and the limitation of unsupervised models, accurate identification for herb targets still faces huge challenges of data and models. To address this, we proposed a deep learning-based target prediction framework termed HTINet2, which designed three key modules, namely, traditional Chinese medicine (TCM) and clinical knowledge graph embedding, residual graph representation learning, and supervised target prediction. In the first module, we constructed a large-scale knowledge graph that covers the TCM properties and clinical treatment knowledge of herbs, and designed a component of deep knowledge embedding to learn the deep knowledge embedding of herbs and targets. In the remaining two modules, we designed a residual-like graph convolution network to capture the deep interactions among herbs and targets, and a Bayesian personalized ranking loss to conduct supervised training and target prediction. Finally, we designed comprehensive experiments, of which comparison with baselines indicated the excellent performance of HTINet2 (HR@10 increased by 122.7% and NDCG@10 by 35.7%), ablation experiments illustrated the positive effect of our designed modules of HTINet2, and case study demonstrated the reliability of the predicted targets of Artemisia annua and Coptis chinensis based on the knowledge base, literature, and molecular docking.

靶点识别是药物研发的关键任务之一,因为它有助于揭示中草药/药物的作用机制和发现新的治疗靶点。虽然目前已经提出了多种药材靶点预测算法,但由于临床知识的不完整性和无监督模型的局限性,药材靶点的准确识别仍然面临着数据和模型的巨大挑战。针对这一问题,我们提出了基于深度学习的靶标预测框架HTINet2,该框架设计了三个关键模块,即中药与临床知识图嵌入、残差图表示学习和有监督靶标预测。在第一个模块中,我们构建了涵盖药材中医属性和临床治疗知识的大规模知识图谱,并设计了深度知识嵌入组件来学习药材和靶标的深度知识嵌入。在其余两个模块中,我们设计了一个类残差图卷积网络来捕捉药材和靶标之间的深度交互,并设计了一个贝叶斯个性化排序损失来进行监督训练和靶标预测。最后,我们设计了综合实验,其中与基线的比较表明HTINet2的性能优异(HR@10提高了122.7%,NDCG@10提高了35.7%),消融实验说明了我们设计的HTINet2模块的积极作用,案例研究证明了基于知识库、文献和分子对接预测的青蒿和黄连靶标的可靠性。
{"title":"HTINet2: herb-target prediction via knowledge graph embedding and residual-like graph neural network.","authors":"Pengbo Duan, Kuo Yang, Xin Su, Shuyue Fan, Xin Dong, Fenghui Zhang, Xianan Li, Xiaoyan Xing, Qiang Zhu, Jian Yu, Xuezhong Zhou","doi":"10.1093/bib/bbae414","DOIUrl":"10.1093/bib/bbae414","url":null,"abstract":"<p><p>Target identification is one of the crucial tasks in drug research and development, as it aids in uncovering the action mechanism of herbs/drugs and discovering new therapeutic targets. Although multiple algorithms of herb target prediction have been proposed, due to the incompleteness of clinical knowledge and the limitation of unsupervised models, accurate identification for herb targets still faces huge challenges of data and models. To address this, we proposed a deep learning-based target prediction framework termed HTINet2, which designed three key modules, namely, traditional Chinese medicine (TCM) and clinical knowledge graph embedding, residual graph representation learning, and supervised target prediction. In the first module, we constructed a large-scale knowledge graph that covers the TCM properties and clinical treatment knowledge of herbs, and designed a component of deep knowledge embedding to learn the deep knowledge embedding of herbs and targets. In the remaining two modules, we designed a residual-like graph convolution network to capture the deep interactions among herbs and targets, and a Bayesian personalized ranking loss to conduct supervised training and target prediction. Finally, we designed comprehensive experiments, of which comparison with baselines indicated the excellent performance of HTINet2 (HR@10 increased by 122.7% and NDCG@10 by 35.7%), ablation experiments illustrated the positive effect of our designed modules of HTINet2, and case study demonstrated the reliability of the predicted targets of Artemisia annua and Coptis chinensis based on the knowledge base, literature, and molecular docking.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11341278/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142035303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated Escherichia coli protein solubility dataset. PLM_Sol:利用更新的大肠杆菌蛋白质溶解度数据集对多种蛋白质语言模型进行基准测试,从而预测蛋白质的溶解度。
IF 6.8 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-07-25 DOI: 10.1093/bib/bbae404
Xuechun Zhang, Xiaoxuan Hu, Tongtong Zhang, Ling Yang, Chunhong Liu, Ning Xu, Haoyi Wang, Wen Sun

Protein solubility plays a crucial role in various biotechnological, industrial, and biomedical applications. With the reduction in sequencing and gene synthesis costs, the adoption of high-throughput experimental screening coupled with tailored bioinformatic prediction has witnessed a rapidly growing trend for the development of novel functional enzymes of interest (EOI). High protein solubility rates are essential in this process and accurate prediction of solubility is a challenging task. As deep learning technology continues to evolve, attention-based protein language models (PLMs) can extract intrinsic information from protein sequences to a greater extent. Leveraging these models along with the increasing availability of protein solubility data inferred from structural database like the Protein Data Bank holds great potential to enhance the prediction of protein solubility. In this study, we curated an Updated Escherichia coli protein Solubility DataSet (UESolDS) and employed a combination of multiple PLMs and classification layers to predict protein solubility. The resulting best-performing model, named Protein Language Model-based protein Solubility prediction model (PLM_Sol), demonstrated significant improvements over previous reported models, achieving a notable 6.4% increase in accuracy, 9.0% increase in F1_score, and 11.1% increase in Matthews correlation coefficient score on the independent test set. Moreover, additional evaluation utilizing our in-house synthesized protein resource as test data, encompassing diverse types of enzymes, also showcased the good performance of PLM_Sol. Overall, PLM_Sol exhibited consistent and promising performance across both independent test set and experimental set, thereby making it well suited for facilitating large-scale EOI studies. PLM_Sol is available as a standalone program and as an easy-to-use model at https://zenodo.org/doi/10.5281/zenodo.10675340.

蛋白质溶解度在各种生物技术、工业和生物医学应用中发挥着至关重要的作用。随着测序和基因合成成本的降低,采用高通量实验筛选加上量身定制的生物信息学预测,开发新型功能性酶(EOI)的趋势迅速增长。在这一过程中,高蛋白质溶解度至关重要,而准确预测溶解度是一项具有挑战性的任务。随着深度学习技术的不断发展,基于注意力的蛋白质语言模型(PLM)可以更大程度地从蛋白质序列中提取内在信息。利用这些模型以及从结构数据库(如蛋白质数据库)中推断出的越来越多的蛋白质溶解度数据,可以大大提高蛋白质溶解度的预测能力。在这项研究中,我们策划了一个更新的大肠杆菌蛋白质溶解度数据集(UESolDS),并采用多种 PLM 和分类层的组合来预测蛋白质的溶解度。最终得出的表现最佳的模型被命名为基于蛋白质语言模型的蛋白质溶解度预测模型(PLM_Sol),与之前报道的模型相比有显著改进,在独立测试集上的准确率显著提高了 6.4%,F1_score 提高了 9.0%,马修斯相关系数提高了 11.1%。此外,利用我们内部合成的蛋白质资源作为测试数据进行的额外评估也显示了 PLM_Sol 的良好性能。总之,PLM_Sol在独立测试集和实验集上都表现出了一致和良好的性能,因此非常适合用于大规模EOI研究。PLM_Sol是一个独立的程序,也是一个易于使用的模型,可在https://zenodo.org/doi/10.5281/zenodo.10675340。
{"title":"PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated Escherichia coli protein solubility dataset.","authors":"Xuechun Zhang, Xiaoxuan Hu, Tongtong Zhang, Ling Yang, Chunhong Liu, Ning Xu, Haoyi Wang, Wen Sun","doi":"10.1093/bib/bbae404","DOIUrl":"10.1093/bib/bbae404","url":null,"abstract":"<p><p>Protein solubility plays a crucial role in various biotechnological, industrial, and biomedical applications. With the reduction in sequencing and gene synthesis costs, the adoption of high-throughput experimental screening coupled with tailored bioinformatic prediction has witnessed a rapidly growing trend for the development of novel functional enzymes of interest (EOI). High protein solubility rates are essential in this process and accurate prediction of solubility is a challenging task. As deep learning technology continues to evolve, attention-based protein language models (PLMs) can extract intrinsic information from protein sequences to a greater extent. Leveraging these models along with the increasing availability of protein solubility data inferred from structural database like the Protein Data Bank holds great potential to enhance the prediction of protein solubility. In this study, we curated an Updated Escherichia coli protein Solubility DataSet (UESolDS) and employed a combination of multiple PLMs and classification layers to predict protein solubility. The resulting best-performing model, named Protein Language Model-based protein Solubility prediction model (PLM_Sol), demonstrated significant improvements over previous reported models, achieving a notable 6.4% increase in accuracy, 9.0% increase in F1_score, and 11.1% increase in Matthews correlation coefficient score on the independent test set. Moreover, additional evaluation utilizing our in-house synthesized protein resource as test data, encompassing diverse types of enzymes, also showcased the good performance of PLM_Sol. Overall, PLM_Sol exhibited consistent and promising performance across both independent test set and experimental set, thereby making it well suited for facilitating large-scale EOI studies. PLM_Sol is available as a standalone program and as an easy-to-use model at https://zenodo.org/doi/10.5281/zenodo.10675340.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11343611/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142046354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Herb-CMap: a multimodal fusion framework for deciphering the mechanisms of action in traditional Chinese medicine using Suhuang antitussive capsule as a case study. Herb-CMap:以苏黄止咳胶囊为例解读中药作用机制的多模态融合框架。
IF 6.8 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-07-25 DOI: 10.1093/bib/bbae362
Yinyin Wang, Yihang Sui, Jiaqi Yao, Hong Jiang, Qimeng Tian, Yun Tang, Yongyu Ou, Jing Tang, Ninghua Tan

Herbal medicines, particularly traditional Chinese medicines (TCMs), are a rich source of natural products with significant therapeutic potential. However, understanding their mechanisms of action is challenging due to the complexity of their multi-ingredient compositions. We introduced Herb-CMap, a multimodal fusion framework leveraging protein-protein interactions and herb-perturbed gene expression signatures. Utilizing a network-based heat diffusion algorithm, Herb-CMap creates a connectivity map linking herb perturbations to their therapeutic targets, thereby facilitating the prioritization of active ingredients. As a case study, we applied Herb-CMap to Suhuang antitussive capsule (Suhuang), a TCM formula used for treating cough variant asthma (CVA). Using in vivo rat models, our analysis established the transcriptomic signatures of Suhuang and identified its key compounds, such as quercetin and luteolin, and their target genes, including IL17A, PIK3CB, PIK3CD, AKT1, and TNF. These drug-target interactions inhibit the IL-17 signaling pathway and deactivate PI3K, AKT, and NF-κB, effectively reducing lung inflammation and alleviating CVA. The study demonstrates the efficacy of Herb-CMap in elucidating the molecular mechanisms of herbal medicines, offering valuable insights for advancing drug discovery in TCM.

中草药,尤其是传统中药(TCM),是具有巨大治疗潜力的天然产品的丰富来源。然而,由于其多种成分组成的复杂性,了解其作用机制具有挑战性。我们介绍了 Herb-CMap,这是一个利用蛋白质-蛋白质相互作用和草药扰动基因表达特征的多模态融合框架。利用基于网络的热扩散算法,Herb-CMap 创建了一个连接图,将草药扰动与其治疗靶点联系起来,从而促进了活性成分的优先排序。作为一项案例研究,我们将 Herb-CMap 应用于治疗咳嗽变异性哮喘(CVA)的中药配方--苏黄止咳胶囊(苏黄)。通过使用体内大鼠模型,我们的分析建立了苏黄的转录组特征,并确定了其主要化合物(如槲皮素和木犀草素)及其靶基因,包括 IL17A、PIK3CB、PIK3CD、AKT1 和 TNF。这些药物-靶点相互作用抑制了 IL-17 信号通路,并使 PI3K、AKT 和 NF-κB 失活,从而有效减轻肺部炎症并缓解 CVA。该研究证明了Herb-CMap在阐明中药分子机制方面的功效,为推动中药新药研发提供了宝贵的见解。
{"title":"Herb-CMap: a multimodal fusion framework for deciphering the mechanisms of action in traditional Chinese medicine using Suhuang antitussive capsule as a case study.","authors":"Yinyin Wang, Yihang Sui, Jiaqi Yao, Hong Jiang, Qimeng Tian, Yun Tang, Yongyu Ou, Jing Tang, Ninghua Tan","doi":"10.1093/bib/bbae362","DOIUrl":"10.1093/bib/bbae362","url":null,"abstract":"<p><p>Herbal medicines, particularly traditional Chinese medicines (TCMs), are a rich source of natural products with significant therapeutic potential. However, understanding their mechanisms of action is challenging due to the complexity of their multi-ingredient compositions. We introduced Herb-CMap, a multimodal fusion framework leveraging protein-protein interactions and herb-perturbed gene expression signatures. Utilizing a network-based heat diffusion algorithm, Herb-CMap creates a connectivity map linking herb perturbations to their therapeutic targets, thereby facilitating the prioritization of active ingredients. As a case study, we applied Herb-CMap to Suhuang antitussive capsule (Suhuang), a TCM formula used for treating cough variant asthma (CVA). Using in vivo rat models, our analysis established the transcriptomic signatures of Suhuang and identified its key compounds, such as quercetin and luteolin, and their target genes, including IL17A, PIK3CB, PIK3CD, AKT1, and TNF. These drug-target interactions inhibit the IL-17 signaling pathway and deactivate PI3K, AKT, and NF-κB, effectively reducing lung inflammation and alleviating CVA. The study demonstrates the efficacy of Herb-CMap in elucidating the molecular mechanisms of herbal medicines, offering valuable insights for advancing drug discovery in TCM.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11285169/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141787330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CHAI: consensus clustering through similarity matrix integration for cell-type identification. CHAI:通过相似性矩阵整合进行共识聚类,用于细胞类型鉴定。
IF 6.8 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-07-25 DOI: 10.1093/bib/bbae411
Musaddiq K Lodi, Muzammil Lodi, Kezie Osei, Vaishnavi Ranganathan, Priscilla Hwang, Preetam Ghosh

Several methods have been developed to computationally predict cell-types for single cell RNA sequencing (scRNAseq) data. As methods are developed, a common problem for investigators has been identifying the best method they should apply to their specific use-case. To address this challenge, we present CHAI (consensus Clustering tHrough similArIty matrix integratIon for single cell-type identification), a wisdom of crowds approach for scRNAseq clustering. CHAI presents two competing methods which aggregate the clustering results from seven state-of-the-art clustering methods: CHAI-AvgSim and CHAI-SNF. CHAI-AvgSim and CHAI-SNF demonstrate superior performance across several benchmarking datasets. Furthermore, both CHAI methods outperform the most recent consensus clustering method, SAME-clustering. We demonstrate CHAI's practical use case by identifying a leader tumor cell cluster enriched with CDH3. CHAI provides a platform for multiomic integration, and we demonstrate CHAI-SNF to have improved performance when including spatial transcriptomics data. CHAI overcomes previous limitations by incorporating the most recent and top performing scRNAseq clustering algorithms into the aggregation framework. It is also an intuitive and easily customizable R package where users may add their own clustering methods to the pipeline, or down-select just the ones they want to use for the clustering aggregation. This ensures that as more advanced clustering algorithms are developed, CHAI will remain useful to the community as a generalized framework. CHAI is available as an open source R package on GitHub: https://github.com/lodimk2/chai.

目前已开发出多种方法,用于计算预测单细胞 RNA 测序(scRNAseq)数据的细胞类型。随着方法的开发,研究人员面临的一个共同问题是如何确定适用于其特定用途的最佳方法。为解决这一难题,我们提出了一种用于 scRNAseq 聚类的众智方法--CHAI(通过相似矩阵整合进行单细胞类型鉴定的共识聚类)。CHAI 提出了两种相互竞争的方法,它们汇总了七种最先进聚类方法的聚类结果:CHAI-AvgSim 和 CHAI-SNF。在多个基准数据集上,CHAI-AvgSim 和 CHAI-SNF 都表现出卓越的性能。此外,两种 CHAI 方法的性能均优于最新的共识聚类方法 SAME-clustering。我们通过识别一个富含 CDH3 的领袖肿瘤细胞群,展示了 CHAI 的实际应用案例。CHAI 为多组学整合提供了一个平台,我们证明了 CHAI-SNF 在包含空间转录组学数据时性能的提高。CHAI 克服了以往的局限性,将最新、性能最好的 scRNAseq 聚类算法纳入聚合框架。它还是一个直观且易于定制的 R 软件包,用户可将自己的聚类方法添加到管道中,或向下选择他们想用于聚类聚合的方法。这确保了在开发出更先进的聚类算法时,CHAI 仍能作为一个通用框架为社区提供帮助。CHAI 在 GitHub 上以开源 R 包的形式提供:https://github.com/lodimk2/chai。
{"title":"CHAI: consensus clustering through similarity matrix integration for cell-type identification.","authors":"Musaddiq K Lodi, Muzammil Lodi, Kezie Osei, Vaishnavi Ranganathan, Priscilla Hwang, Preetam Ghosh","doi":"10.1093/bib/bbae411","DOIUrl":"10.1093/bib/bbae411","url":null,"abstract":"<p><p>Several methods have been developed to computationally predict cell-types for single cell RNA sequencing (scRNAseq) data. As methods are developed, a common problem for investigators has been identifying the best method they should apply to their specific use-case. To address this challenge, we present CHAI (consensus Clustering tHrough similArIty matrix integratIon for single cell-type identification), a wisdom of crowds approach for scRNAseq clustering. CHAI presents two competing methods which aggregate the clustering results from seven state-of-the-art clustering methods: CHAI-AvgSim and CHAI-SNF. CHAI-AvgSim and CHAI-SNF demonstrate superior performance across several benchmarking datasets. Furthermore, both CHAI methods outperform the most recent consensus clustering method, SAME-clustering. We demonstrate CHAI's practical use case by identifying a leader tumor cell cluster enriched with CDH3. CHAI provides a platform for multiomic integration, and we demonstrate CHAI-SNF to have improved performance when including spatial transcriptomics data. CHAI overcomes previous limitations by incorporating the most recent and top performing scRNAseq clustering algorithms into the aggregation framework. It is also an intuitive and easily customizable R package where users may add their own clustering methods to the pipeline, or down-select just the ones they want to use for the clustering aggregation. This ensures that as more advanced clustering algorithms are developed, CHAI will remain useful to the community as a generalized framework. CHAI is available as an open source R package on GitHub: https://github.com/lodimk2/chai.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11359802/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142104460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA. 大型语言模型可根据无细胞 DNA 的末端基因图谱对癌症做出高度准确的诊断。
IF 6.8 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-07-25 DOI: 10.1093/bib/bbae430
Jilei Liu, Hongru Shen, Kexin Chen, Xiangchun Li

Instruction-tuned large language models (LLMs) demonstrate exceptional ability to align with human intentions. We present an LLM-based model-instruction-tuned LLM for assessment of cancer (iLLMAC)-that can detect cancer using cell-free deoxyribonucleic acid (cfDNA) end-motif profiles. Developed on plasma cfDNA sequencing data from 1135 cancer patients and 1106 controls across three datasets, iLLMAC achieved area under the receiver operating curve (AUROC) of 0.866 [95% confidence interval (CI), 0.773-0.959] for cancer diagnosis and 0.924 (95% CI, 0.841-1.0) for hepatocellular carcinoma (HCC) detection using 16 end-motifs. Performance increased with more motifs, reaching 0.886 (95% CI, 0.794-0.977) and 0.956 (95% CI, 0.89-1.0) for cancer diagnosis and HCC detection, respectively, with 64 end-motifs. On an external-testing set, iLLMAC achieved AUROC of 0.912 (95% CI, 0.849-0.976) for cancer diagnosis and 0.938 (95% CI, 0.885-0.992) for HCC detection with 64 end-motifs, significantly outperforming benchmarked methods. Furthermore, iLLMAC achieved high classification performance on datasets with bisulfite and 5-hydroxymethylcytosine sequencing. Our study highlights the effectiveness of LLM-based instruction-tuning for cfDNA-based cancer detection.

经过指令调谐的大型语言模型(LLM)在与人类意图保持一致方面表现出非凡的能力。我们介绍了一种基于 LLM 的模型--用于癌症评估的指令调谐 LLM(iLLMAC)--该模型可以利用无细胞脱氧核糖核酸(cfDNA)末端修饰词剖面检测癌症。iLLMAC 是根据来自三个数据集的 1135 名癌症患者和 1106 名对照者的血浆 cfDNA 测序数据开发的,使用 16 个末端基因,其癌症诊断的接收者操作曲线下面积 (AUROC) 为 0.866 [95% 置信区间 (CI),0.773-0.959],肝细胞癌 (HCC) 检测的接收者操作曲线下面积 (AUROC) 为 0.924 (95% CI,0.841-1.0)。随着端点数目的增加,性能也随之提高,在使用 64 个端点时,癌症诊断和 HCC 检测的性能分别达到 0.886(95% CI,0.794-0.977)和 0.956(95% CI,0.89-1.0)。在外部测试集上,iLLMAC 的癌症诊断 AUROC 为 0.912(95% CI,0.849-0.976),HCC 检测 AUROC 为 0.938(95% CI,0.885-0.992),64 个 end-motifs,明显优于基准方法。此外,iLLMAC 在采用亚硫酸氢盐和 5-羟甲基胞嘧啶测序的数据集上也取得了很高的分类性能。我们的研究凸显了基于 LLM 的指令调整在基于 cfDNA 的癌症检测中的有效性。
{"title":"Large language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA.","authors":"Jilei Liu, Hongru Shen, Kexin Chen, Xiangchun Li","doi":"10.1093/bib/bbae430","DOIUrl":"10.1093/bib/bbae430","url":null,"abstract":"<p><p>Instruction-tuned large language models (LLMs) demonstrate exceptional ability to align with human intentions. We present an LLM-based model-instruction-tuned LLM for assessment of cancer (iLLMAC)-that can detect cancer using cell-free deoxyribonucleic acid (cfDNA) end-motif profiles. Developed on plasma cfDNA sequencing data from 1135 cancer patients and 1106 controls across three datasets, iLLMAC achieved area under the receiver operating curve (AUROC) of 0.866 [95% confidence interval (CI), 0.773-0.959] for cancer diagnosis and 0.924 (95% CI, 0.841-1.0) for hepatocellular carcinoma (HCC) detection using 16 end-motifs. Performance increased with more motifs, reaching 0.886 (95% CI, 0.794-0.977) and 0.956 (95% CI, 0.89-1.0) for cancer diagnosis and HCC detection, respectively, with 64 end-motifs. On an external-testing set, iLLMAC achieved AUROC of 0.912 (95% CI, 0.849-0.976) for cancer diagnosis and 0.938 (95% CI, 0.885-0.992) for HCC detection with 64 end-motifs, significantly outperforming benchmarked methods. Furthermore, iLLMAC achieved high classification performance on datasets with bisulfite and 5-hydroxymethylcytosine sequencing. Our study highlights the effectiveness of LLM-based instruction-tuning for cfDNA-based cancer detection.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11367762/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142104465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Synthetic lethal connectivity and graph transformer improve synthetic lethality prediction. 合成致死连通性和图转换器改进了合成致死预测。
IF 6.8 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-07-25 DOI: 10.1093/bib/bbae425
Kunjie Fan, Birkan Gökbağ, Shan Tang, Shangjia Li, Yirui Huang, Lingling Wang, Lijun Cheng, Lang Li

Synthetic lethality (SL) has shown great promise for the discovery of novel targets in cancer. CRISPR double-knockout (CDKO) technologies can only screen several hundred genes and their combinations, but not genome-wide. Therefore, good SL prediction models are highly needed for genes and gene pairs selection in CDKO experiments. However, lack of scalable SL properties prevents generalizability of SL interactions to out-of-sample data, thereby hindering modeling efforts. In this paper, we recognize that SL connectivity is a scalable and generalizable SL property. We develop a novel two-step multilayer encoder for individual sample-specific SL prediction model (MLEC-iSL), which predicts SL connectivity first and SL interactions subsequently. MLEC-iSL has three encoders, namely, gene, graph, and transformer encoders. MLEC-iSL achieves high SL prediction performance in K562 (AUPR, 0.73; AUC, 0.72) and Jurkat (AUPR, 0.73; AUC, 0.71) cells, while no existing methods exceed 0.62 AUPR and AUC. The prediction performance of MLEC-iSL is validated in a CDKO experiment in 22Rv1 cells, yielding a 46.8% SL rate among 987 selected gene pairs. The screen also reveals SL dependency between apoptosis and mitosis cell death pathways.

合成致死(SL)技术在发现癌症新靶点方面大有可为。CRISPR双基因敲除(CDKO)技术只能筛选几百个基因及其组合,而不能筛选全基因组。因此,在 CDKO 实验中选择基因和基因对时非常需要良好的 SL 预测模型。然而,由于缺乏可扩展的 SL 特性,SL 相互作用无法推广到样本外数据,从而阻碍了建模工作。在本文中,我们认识到 SL 连接性是一种可扩展、可推广的 SL 属性。我们为个体样本特异性 SL 预测模型(MLEC-iSL)开发了一种新颖的两步多层编码器,它首先预测 SL 连接性,然后预测 SL 相互作用。MLEC-iSL 有三个编码器,即基因编码器、图编码器和转换器编码器。MLEC-iSL 在 K562(AUPR,0.73;AUC,0.72)和 Jurkat(AUPR,0.73;AUC,0.71)细胞中实现了较高的 SL 预测性能,而现有方法的 AUPR 和 AUC 均未超过 0.62。MLEC-iSL 的预测性能在 22Rv1 细胞的 CDKO 实验中得到了验证,在 987 个选定的基因对中,SL 率为 46.8%。筛选还揭示了细胞凋亡和有丝分裂细胞死亡途径之间的SL依赖性。
{"title":"Synthetic lethal connectivity and graph transformer improve synthetic lethality prediction.","authors":"Kunjie Fan, Birkan Gökbağ, Shan Tang, Shangjia Li, Yirui Huang, Lingling Wang, Lijun Cheng, Lang Li","doi":"10.1093/bib/bbae425","DOIUrl":"10.1093/bib/bbae425","url":null,"abstract":"<p><p>Synthetic lethality (SL) has shown great promise for the discovery of novel targets in cancer. CRISPR double-knockout (CDKO) technologies can only screen several hundred genes and their combinations, but not genome-wide. Therefore, good SL prediction models are highly needed for genes and gene pairs selection in CDKO experiments. However, lack of scalable SL properties prevents generalizability of SL interactions to out-of-sample data, thereby hindering modeling efforts. In this paper, we recognize that SL connectivity is a scalable and generalizable SL property. We develop a novel two-step multilayer encoder for individual sample-specific SL prediction model (MLEC-iSL), which predicts SL connectivity first and SL interactions subsequently. MLEC-iSL has three encoders, namely, gene, graph, and transformer encoders. MLEC-iSL achieves high SL prediction performance in K562 (AUPR, 0.73; AUC, 0.72) and Jurkat (AUPR, 0.73; AUC, 0.71) cells, while no existing methods exceed 0.62 AUPR and AUC. The prediction performance of MLEC-iSL is validated in a CDKO experiment in 22Rv1 cells, yielding a 46.8% SL rate among 987 selected gene pairs. The screen also reveals SL dependency between apoptosis and mitosis cell death pathways.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11361842/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142104468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Briefings in bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1