Current Bioinformatics最新文献_第3页

Enhancing Drug Peptide Sequence Prediction Using Multi-view Feature Fusion Learning 利用多视角特征融合学习加强药物多肽序列预测

IF 4 3区生物学 Q3 BIOCHEMICAL RESEARCH METHODS

Current Bioinformatics

Pub Date : 2024-05-27 DOI: 10.2174/0115748936294345240510112941

Junyu Zhang, Ronglin Lu, Hongmei Zhou, Xinbo Jiang

Background: Currently, various types of peptides have broad implications for human health and disease. Some drug peptides play significant roles in sensory science, drug research, and cancer biology. The prediction and classification of peptide sequences are of significant importance to various industries. However, predicting peptide sequences through biological experiments is a time-consuming and expensive process. Moreover, the task of protein sequence classification and prediction faces challenges due to the high dimensionality, nonlinearity, and irregularity of protein sequence data, along with the presence of numerous unknown or unlabeled protein sequences. Therefore, an accurate and efficient method for predicting peptide classification is necessary. Methods: In our work, we used two pre-trained models to extract sequence features, TextCNN (Convolutional Neural Networks for Text Classification) and Transformer. We extracted the overall semantic information of the sequences using Transformer Encoder and extracted the local semantic information between sequences using TextCNN and concatenated them into a new feature. Finally, we used the concatenated feature for classification prediction. To validate this approach, we conducted experiments on the BP dataset, THP dataset and DPP-IV dataset and compared them with some pre-trained models. Results: Since TextCNN and Transformer Encoder extract features from different perspectives, the concatenated feature contains multi-view information, which improves the accuracy of the peptide predictor. Conclusion: Ultimately, our model demonstrated superior metrics, highlighting its efficacy in peptide sequence prediction and classification.

背景：目前，各种类型的肽对人类健康和疾病有着广泛的影响。一些药物肽在感官科学、药物研究和癌症生物学中发挥着重要作用。肽序列的预测和分类对各行各业都具有重要意义。然而，通过生物实验预测肽序列是一个耗时且昂贵的过程。此外，由于蛋白质序列数据的高维性、非线性和不规则性，以及存在大量未知或未标记的蛋白质序列，蛋白质序列分类和预测任务面临着挑战。因此，需要一种准确高效的多肽分类预测方法。方法：在我们的工作中，我们使用了两种预先训练好的模型来提取序列特征，即 TextCNN（用于文本分类的卷积神经网络）和 Transformer。我们使用 Transformer 编码器提取序列的整体语义信息，使用 TextCNN 提取序列间的局部语义信息，并将它们串联成一个新特征。最后，我们使用串联特征进行分类预测。为了验证这种方法，我们在 BP 数据集、THP 数据集和 DPP-IV 数据集上进行了实验，并与一些预先训练好的模型进行了比较。实验结果由于 TextCNN 和 Transformer Encoder 从不同角度提取特征，因此合并特征包含了多视角信息，从而提高了肽预测器的准确性。结论最终，我们的模型展示了卓越的指标，突出了其在肽序列预测和分类方面的功效。

{"title":"Enhancing Drug Peptide Sequence Prediction Using Multi-view Feature Fusion Learning","authors":"Junyu Zhang, Ronglin Lu, Hongmei Zhou, Xinbo Jiang","doi":"10.2174/0115748936294345240510112941","DOIUrl":"https://doi.org/10.2174/0115748936294345240510112941","url":null,"abstract":"Background: Currently, various types of peptides have broad implications for human health and disease. Some drug peptides play significant roles in sensory science, drug research, and cancer biology. The prediction and classification of peptide sequences are of significant importance to various industries. However, predicting peptide sequences through biological experiments is a time-consuming and expensive process. Moreover, the task of protein sequence classification and prediction faces challenges due to the high dimensionality, nonlinearity, and irregularity of protein sequence data, along with the presence of numerous unknown or unlabeled protein sequences. Therefore, an accurate and efficient method for predicting peptide classification is necessary. Methods: In our work, we used two pre-trained models to extract sequence features, TextCNN (Convolutional Neural Networks for Text Classification) and Transformer. We extracted the overall semantic information of the sequences using Transformer Encoder and extracted the local semantic information between sequences using TextCNN and concatenated them into a new feature. Finally, we used the concatenated feature for classification prediction. To validate this approach, we conducted experiments on the BP dataset, THP dataset and DPP-IV dataset and compared them with some pre-trained models. Results: Since TextCNN and Transformer Encoder extract features from different perspectives, the concatenated feature contains multi-view information, which improves the accuracy of the peptide predictor. Conclusion: Ultimately, our model demonstrated superior metrics, highlighting its efficacy in peptide sequence prediction and classification.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"23 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141168940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Validating the Distinctiveness of the Omicron Lineage within the SARSCov-2 based on Protein Language Models 基于蛋白质语言模型验证 SARSCov-2 中 Omicron 系的独特性

IF 4 3区生物学 Q3 BIOCHEMICAL RESEARCH METHODS

Current Bioinformatics

Pub Date : 2024-04-30 DOI: 10.2174/0115748936291075240409080924

Ke Dong, Jingyang Gao

Introduction: Variants of concern were identified in severe acute respiratory syndrome coronavirus 2, namely Alpha, Beta, Gamma, Delta, and Omicron. This study explores the mutations of the Omicron lineage and its differences from other lineages through a protein language model. Methods: By inputting the severe acute respiratory syndrome coronavirus 2 wild-type sequence into the protein language model evolving pre-trained models-1v, this study obtained the score for each position mutating to other amino acids and calculated the overall trend of a new variant of concern mutation scores. objective: Analyze the differences in the number of Omicron amino acid mutations compared to the other four VOC mutations using statistical methods, and use the protein language model esm-1v to analyze the specificity of Omicron amino acid mutations. Results: It is found that when the proportion of unobserved mutations to observed mutations is 4:15, Omicron still generates a large number of newly emerging mutations. It was found that the overall score for the Omicron family is low, and the overall ranking for the Omicron family is low. Conclusion: Mutations in the Omicron lineage are different from amino acid mutations in other lineages. The findings of this paper deepen the understanding of the spatial distribution of spike protein amino acid mutations and overall trends of newly emerging mutations corresponding to different variants of concern. This also provides insights into simulating the evolution of the Omicron lineage.

导言：在严重急性呼吸系统综合征冠状病毒 2 中发现了令人担忧的变异，即 Alpha、Beta、Gamma、Delta 和 Omicron。本研究通过蛋白质语言模型探索 Omicron 系的变异及其与其他系的差异。研究方法通过将严重急性呼吸道综合征冠状病毒 2 野生型序列输入进化预训练模型-1v 的蛋白质语言模型，本研究获得了变异为其他氨基酸的每个位置的得分，并计算了关注变异得分的新变体的整体趋势。目标：分析变异为其他氨基酸的新变体的数量差异：利用统计学方法分析与其他四种 VOC 突变相比，Omicron 氨基酸突变数量的差异，并利用蛋白质语言模型 esm-1v 分析 Omicron 氨基酸突变的特异性。结果发现发现当未观察到的突变与观察到的突变的比例为 4:15 时，Omicron 仍会产生大量新出现的突变。研究发现，Omicron 家族的总体得分较低，Omicron 家族的总体排名也较低。结论Omicron 系的突变不同于其他系的氨基酸突变。本文的研究结果加深了人们对尖峰蛋白氨基酸突变的空间分布和新出现的突变的总体趋势的理解，这些突变对应于不同的关注变体。这也为模拟 Omicron 品系的进化提供了启示。

{"title":"Validating the Distinctiveness of the Omicron Lineage within the SARSCov-2 based on Protein Language Models","authors":"Ke Dong, Jingyang Gao","doi":"10.2174/0115748936291075240409080924","DOIUrl":"https://doi.org/10.2174/0115748936291075240409080924","url":null,"abstract":"Introduction: Variants of concern were identified in severe acute respiratory syndrome coronavirus 2, namely Alpha, Beta, Gamma, Delta, and Omicron. This study explores the mutations of the Omicron lineage and its differences from other lineages through a protein language model. Methods: By inputting the severe acute respiratory syndrome coronavirus 2 wild-type sequence into the protein language model evolving pre-trained models-1v, this study obtained the score for each position mutating to other amino acids and calculated the overall trend of a new variant of concern mutation scores. objective: Analyze the differences in the number of Omicron amino acid mutations compared to the other four VOC mutations using statistical methods, and use the protein language model esm-1v to analyze the specificity of Omicron amino acid mutations. Results: It is found that when the proportion of unobserved mutations to observed mutations is 4:15, Omicron still generates a large number of newly emerging mutations. It was found that the overall score for the Omicron family is low, and the overall ranking for the Omicron family is low. Conclusion: Mutations in the Omicron lineage are different from amino acid mutations in other lineages. The findings of this paper deepen the understanding of the spatial distribution of spike protein amino acid mutations and overall trends of newly emerging mutations corresponding to different variants of concern. This also provides insights into simulating the evolution of the Omicron lineage.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"63 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comparative Analysis of Deep Generative Model for Industrial Enzyme Design 用于工业酶设计的深度生成模型对比分析

IF 4 3区生物学 Q3 BIOCHEMICAL RESEARCH METHODS

Current Bioinformatics

Pub Date : 2024-04-16 DOI: 10.2174/0115748936303223240404043202

Beibei Zhang, Qiaozhen Meng, Chengwei Ai, Guihua Duan, Ercheng Wang, Fei Guo

: Although enzymes have the advantage of efficient catalysis, natural enzymes lack stability in industrial environments and do not even meet the required catalytic reactions. This prompted us to urgently de novo design new enzymes. Computational design is a powerful tool, allowing rapid and efficient exploration of sequence space and facilitating the design of novel enzymes tailored to specific conditions and requirements. It is beneficial to de novo design industrial enzymes using computational methods. Currently, only one tool explicitly designed for the enzyme-only generation performs unsatisfactorily. We have selected several general protein sequence design tools and systematically evaluated their effectiveness when applied to specific industrial enzymes. We investigated the literature related to protein generation. We summarized the computational methods used for sequence generation into three categories: structure-conditional sequence generation, sequence generation without structural constraints, and co-generation of sequence and structure. To effectively evaluate the ability of six computational tools to generate enzyme sequences, we first constructed a luciferase dataset named Luc_64. Then we assessed the quality of enzyme sequences generated by these methods on this dataset, including amino acid distribution, EC number validation, etc. We also assessed sequences generated by structure-based methods on existing public datasets using sequence recovery rates and root-mean-square deviation (RMSD) from a sequence and structure perspective. In the functionality dataset, Luc_64, ABACUS-R, and ProteinMPNN stood out for producing sequences with amino acid distributions and functionalities closely matching those of naturally occurring luciferase enzymes, suggesting their effectiveness in preserving essential enzymatic characteristics. Across both benchmark datasets, ABACUS-R and ProteinMPNN, have also exhibited the highest sequence recovery rates, indicating their superior ability to generate sequences closely resembling the original enzyme structures. Our study provides a crucial reference for researchers selecting appropriate enzyme sequence design tools, highlighting the strengths and limitations of each tool in generating accurate and functional enzyme sequences. ProteinMPNN and ABACUS-R emerged as the most effective tools in our evaluation, offering high accuracy in sequence recovery and RMSD and maintaining the functional integrity of enzymes through accurate amino acid distribution. Meanwhile, the performance of protein general tools for migration to specific industrial enzymes was fairly evaluated on our specific industrial enzyme benchmark.

:虽然酶具有高效催化的优势，但天然酶在工业环境中缺乏稳定性，甚至无法满足所需的催化反应。这促使我们急需从头设计新的酶。计算设计是一种强大的工具，可以快速有效地探索序列空间，促进设计出适合特定条件和要求的新型酶。利用计算方法重新设计工业酶是有益的。目前，只有一种明确为酶生成而设计的工具表现不尽如人意。我们选择了几种通用的蛋白质序列设计工具，并系统地评估了它们应用于特定工业酶的效果。我们调查了与蛋白质生成相关的文献。我们将用于序列生成的计算方法归纳为三类：有结构条件的序列生成、无结构约束的序列生成以及序列和结构的共同生成。为了有效评估六种计算工具生成酶序列的能力，我们首先构建了一个名为 Luc_64 的荧光素酶数据集。然后，我们评估了这些方法在该数据集上生成的酶序列的质量，包括氨基酸分布、EC编号验证等。我们还从序列和结构的角度，使用序列恢复率和均方根偏差（RMSD）评估了基于结构的方法在现有公共数据集上生成的序列。在功能性数据集中，Luc_64、ABACUS-R 和 ProteinMPNN 所生成的序列的氨基酸分布和功能与天然荧光素酶的氨基酸分布和功能非常接近，这表明它们能有效保留酶的基本特征。在这两个基准数据集中，ABACUS-R 和 ProteinMPNN 还表现出最高的序列恢复率，这表明它们具有生成与原始酶结构非常相似的序列的卓越能力。我们的研究为研究人员选择合适的酶序列设计工具提供了重要参考，突出了每种工具在生成准确和功能性酶序列方面的优势和局限性。在我们的评估中，ProteinMPNN 和 ABACUS-R 成为最有效的工具，它们在序列恢复和 RMSD 方面具有很高的准确性，并通过精确的氨基酸分布保持了酶功能的完整性。同时，在特定工业酶基准上对蛋白质通用工具迁移到特定工业酶的性能进行了公平评估。

{"title":"Comparative Analysis of Deep Generative Model for Industrial Enzyme Design","authors":"Beibei Zhang, Qiaozhen Meng, Chengwei Ai, Guihua Duan, Ercheng Wang, Fei Guo","doi":"10.2174/0115748936303223240404043202","DOIUrl":"https://doi.org/10.2174/0115748936303223240404043202","url":null,"abstract":": Although enzymes have the advantage of efficient catalysis, natural enzymes lack stability in industrial environments and do not even meet the required catalytic reactions. This prompted us to urgently de novo design new enzymes. Computational design is a powerful tool, allowing rapid and efficient exploration of sequence space and facilitating the design of novel enzymes tailored to specific conditions and requirements. It is beneficial to de novo design industrial enzymes using computational methods. Currently, only one tool explicitly designed for the enzyme-only generation performs unsatisfactorily. We have selected several general protein sequence design tools and systematically evaluated their effectiveness when applied to specific industrial enzymes. We investigated the literature related to protein generation. We summarized the computational methods used for sequence generation into three categories: structure-conditional sequence generation, sequence generation without structural constraints, and co-generation of sequence and structure. To effectively evaluate the ability of six computational tools to generate enzyme sequences, we first constructed a luciferase dataset named Luc_64. Then we assessed the quality of enzyme sequences generated by these methods on this dataset, including amino acid distribution, EC number validation, etc. We also assessed sequences generated by structure-based methods on existing public datasets using sequence recovery rates and root-mean-square deviation (RMSD) from a sequence and structure perspective. In the functionality dataset, Luc_64, ABACUS-R, and ProteinMPNN stood out for producing sequences with amino acid distributions and functionalities closely matching those of naturally occurring luciferase enzymes, suggesting their effectiveness in preserving essential enzymatic characteristics. Across both benchmark datasets, ABACUS-R and ProteinMPNN, have also exhibited the highest sequence recovery rates, indicating their superior ability to generate sequences closely resembling the original enzyme structures. Our study provides a crucial reference for researchers selecting appropriate enzyme sequence design tools, highlighting the strengths and limitations of each tool in generating accurate and functional enzyme sequences. ProteinMPNN and ABACUS-R emerged as the most effective tools in our evaluation, offering high accuracy in sequence recovery and RMSD and maintaining the functional integrity of enzymes through accurate amino acid distribution. Meanwhile, the performance of protein general tools for migration to specific industrial enzymes was fairly evaluated on our specific industrial enzyme benchmark.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"35 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140582835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Effective Method to Identify Cooperation Driver Gene Sets 识别合作驱动基因组的有效方法

IF 4 3区生物学 Q3 BIOCHEMICAL RESEARCH METHODS

Current Bioinformatics

Pub Date : 2024-04-15 DOI: 10.2174/0115748936293238240313081211

Wei Zhang, Yifu Zeng, Bihai Zhao, Jie Xiong, Tuanfei Zhu, Jingjing Wang, Guiji Li, Lei Wang

Background: In cancer genomics research, identifying driver genes is a challenging task. Detecting cancer-driver genes can further our understanding of cancer risk factors and promote the development of personalized treatments. Gene mutations show mutual exclusivity and cooccur, and most of the existing methods focus on identifying driver pathways or driver gene sets through the study of mutual exclusivity, that is functionally redundant gene sets. Moreover, less research on cooperation genes with co-occurring mutations has been conducted. Objective: We propose an effective method that combines the two characteristics of genes, cooccurring mutations and the coordinated regulation of proliferation genes, to explore cooperation driver genes. Methods: This study is divided into three stages: (1) constructing a binary gene mutation matrix; (2) combining mutation co-occurrence characteristics to identify the candidate cooperation gene sets; and (3) constructing a gene regulation network to screen the cooperation gene sets that perform synergistically regulating proliferation. Results: The method performance is evaluated on three TCGA cancer datasets, and the experiments showed that it can detect effective cooperation driver gene sets. In further investigations, it was determined that the discovered set of co-driver genes could be used to generate prognostic classifications, which could be biologically significant and provide complementary information to the cancer genome. Conclusion: Our approach is effective in identifying sets of cancer cooperation driver genes, and the results can be used as clinical markers to stratify patients.

背景：在癌症基因组学研究中，确定驱动基因是一项具有挑战性的任务。检测癌症驱动基因可以进一步了解癌症风险因素，促进个性化治疗的开发。基因突变具有互斥性和共存性，现有方法大多侧重于通过研究互斥性（即功能冗余基因集）来识别驱动通路或驱动基因集。此外，对同时发生突变的合作基因的研究较少。研究目的我们提出了一种有效的方法，结合基因的共生突变和增殖基因的协调调控这两个特点来探索合作驱动基因。方法：本研究分为三个阶段：（1）构建二元基因突变矩阵；（2）结合突变共现特征确定候选合作基因集；（3）构建基因调控网络筛选出协同调控增殖的合作基因集。结果：在三个 TCGA 癌症数据集上评估了该方法的性能，实验表明它能检测出有效的合作驱动基因集。在进一步研究中，发现的合作驱动基因集可用于生成预后分类，这可能具有生物学意义，并为癌症基因组提供补充信息。结论我们的方法能有效识别癌症合作驱动基因集，其结果可作为临床标记对患者进行分层。

{"title":"An Effective Method to Identify Cooperation Driver Gene Sets","authors":"Wei Zhang, Yifu Zeng, Bihai Zhao, Jie Xiong, Tuanfei Zhu, Jingjing Wang, Guiji Li, Lei Wang","doi":"10.2174/0115748936293238240313081211","DOIUrl":"https://doi.org/10.2174/0115748936293238240313081211","url":null,"abstract":"Background: In cancer genomics research, identifying driver genes is a challenging task. Detecting cancer-driver genes can further our understanding of cancer risk factors and promote the development of personalized treatments. Gene mutations show mutual exclusivity and cooccur, and most of the existing methods focus on identifying driver pathways or driver gene sets through the study of mutual exclusivity, that is functionally redundant gene sets. Moreover, less research on cooperation genes with co-occurring mutations has been conducted. Objective: We propose an effective method that combines the two characteristics of genes, cooccurring mutations and the coordinated regulation of proliferation genes, to explore cooperation driver genes. Methods: This study is divided into three stages: (1) constructing a binary gene mutation matrix; (2) combining mutation co-occurrence characteristics to identify the candidate cooperation gene sets; and (3) constructing a gene regulation network to screen the cooperation gene sets that perform synergistically regulating proliferation. Results: The method performance is evaluated on three TCGA cancer datasets, and the experiments showed that it can detect effective cooperation driver gene sets. In further investigations, it was determined that the discovered set of co-driver genes could be used to generate prognostic classifications, which could be biologically significant and provide complementary information to the cancer genome. Conclusion: Our approach is effective in identifying sets of cancer cooperation driver genes, and the results can be used as clinical markers to stratify patients.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140582893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integrated Somatic Mutation Network Diffusion Model for Stratification of Breast Cancer into Different Metabolic Mutation Subtypes 将乳腺癌分为不同代谢突变亚型的体细胞突变网络扩散综合模型

IF 4 3区生物学 Q3 BIOCHEMICAL RESEARCH METHODS

Current Bioinformatics

Pub Date : 2024-04-15 DOI: 10.2174/0115748936298012240322091111

Dongqing Su, Honghao Li, Tao Wang, Min Zou, Haodong Wei, Yuqiang Xiong, Hongmei Sun, Shiyuan Wang, Qilemuge Xi, Yongchun Zuo, Lei Yang

Background: Mutations in metabolism-related genes in somatic cells potentially lead to disruption of metabolic pathways, which results in patients exhibiting different molecular and pathological features. background: Mutations in metabolism-related genes in somatic cells potentially lead to disruption of metabolic pathways, which results in patients exhibiting different molecular and pathological features. Objective: In this study, we focused on somatic mutation data to investigate the significance of metabolic mutation typing in guiding the prognosis and treatment of breast cancer patients. objective: In this study, we focused on somatic mutation data to investigate the significance of metabolic mutation typing in guiding the prognosis and treatment of breast cancer patients. Methods: The somatic mutation profile of breast cancer patients was analyzed and smoothed by utilizing a network diffusion model within the protein-protein interaction network to construct a comprehensive somatic mutation network diffusion profile. Subsequently, a deep clustering approach was employed to explore metabolic mutation typing in breast cancer based on integrated metabolic pathway information and the somatic mutation network diffusion profile. In addition, we employed deep neural networks and machine learning prediction models to assess the feasibility of predicting drug responses through somatic mutation network diffusion profiles. Results: Significant differences in prognosis and metabolic heterogeneity were observed among the different metabolic mutation subtypes, characterized by distinct alterations in metabolic pathways and genetic mutations, and these mutational features offered potential targets for subtype-specific therapies. Furthermore, there was a strong consistency between the results of the drug response prediction model constructed on the somatic mutation network diffusion profile and the actual observed drug responses. Conclusion: Metabolic mutation typing of cancer assists in guiding patient prognosis and treatment.

背景：体细胞中代谢相关基因的突变可能导致代谢途径的中断，从而使患者表现出不同的分子和病理特征：体细胞中代谢相关基因的突变可能导致代谢途径的中断，从而使患者表现出不同的分子和病理特征。研究目的在本研究中，我们以体细胞突变数据为重点，研究代谢突变分型对乳腺癌患者预后和治疗的指导意义：本研究以体细胞突变数据为重点，探讨代谢突变分型对乳腺癌患者预后和治疗的指导意义。方法利用蛋白质-蛋白质相互作用网络中的网络扩散模型对乳腺癌患者的体细胞突变情况进行分析和平滑处理，从而构建一个全面的体细胞突变网络扩散图。随后，根据综合代谢通路信息和体细胞突变网络扩散图谱，采用深度聚类方法探索乳腺癌的代谢突变分型。此外，我们还采用了深度神经网络和机器学习预测模型来评估通过体细胞突变网络扩散图谱预测药物反应的可行性。结果不同的代谢突变亚型在预后和代谢异质性方面存在显著差异，这些亚型以代谢途径和基因突变的不同改变为特征，这些突变特征为亚型特异性疗法提供了潜在靶点。此外，根据体细胞突变网络扩散特征构建的药物反应预测模型的结果与实际观察到的药物反应之间具有很强的一致性。结论癌症代谢突变分型有助于指导患者的预后和治疗。

{"title":"Integrated Somatic Mutation Network Diffusion Model for Stratification of Breast Cancer into Different Metabolic Mutation Subtypes","authors":"Dongqing Su, Honghao Li, Tao Wang, Min Zou, Haodong Wei, Yuqiang Xiong, Hongmei Sun, Shiyuan Wang, Qilemuge Xi, Yongchun Zuo, Lei Yang","doi":"10.2174/0115748936298012240322091111","DOIUrl":"https://doi.org/10.2174/0115748936298012240322091111","url":null,"abstract":"Background: Mutations in metabolism-related genes in somatic cells potentially lead to disruption of metabolic pathways, which results in patients exhibiting different molecular and pathological features. background: Mutations in metabolism-related genes in somatic cells potentially lead to disruption of metabolic pathways, which results in patients exhibiting different molecular and pathological features. Objective: In this study, we focused on somatic mutation data to investigate the significance of metabolic mutation typing in guiding the prognosis and treatment of breast cancer patients. objective: In this study, we focused on somatic mutation data to investigate the significance of metabolic mutation typing in guiding the prognosis and treatment of breast cancer patients. Methods: The somatic mutation profile of breast cancer patients was analyzed and smoothed by utilizing a network diffusion model within the protein-protein interaction network to construct a comprehensive somatic mutation network diffusion profile. Subsequently, a deep clustering approach was employed to explore metabolic mutation typing in breast cancer based on integrated metabolic pathway information and the somatic mutation network diffusion profile. In addition, we employed deep neural networks and machine learning prediction models to assess the feasibility of predicting drug responses through somatic mutation network diffusion profiles. Results: Significant differences in prognosis and metabolic heterogeneity were observed among the different metabolic mutation subtypes, characterized by distinct alterations in metabolic pathways and genetic mutations, and these mutational features offered potential targets for subtype-specific therapies. Furthermore, there was a strong consistency between the results of the drug response prediction model constructed on the somatic mutation network diffusion profile and the actual observed drug responses. Conclusion: Metabolic mutation typing of cancer assists in guiding patient prognosis and treatment.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"33 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140582916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GB5mCPred: Cross-species 5mc Site Predictor Based on Bootstrap-based Stochastic Gradient Boosting Method for Poaceae GB5mCPred：基于 Bootstrap 的随机梯度提升法的 Poaceae 跨物种 5mc 位点预测器

IF 4 3区生物学 Q3 BIOCHEMICAL RESEARCH METHODS

Current Bioinformatics

Pub Date : 2024-04-15 DOI: 10.2174/0115748936285544231221113226

Dipro Sinha, Tanwy Dasmandal, Md Yeasin, D.C Mishra, Anil Rai, Sunil Archak

Background: One of the most prevalent epigenetic alterations in all three kingdoms of life is 5mC, which plays a part in a wide range of biological functions. Although in-vitro techniques are more effective in detecting epigenetic alterations, they are time and money-intensive. Artificial intelligence-based in silico approaches have been used to overcome these obstacles. background: One of the most prevalent epigenetic alterations in all three kingdoms of life is 5mC, which plays a part in a wide range of biological functions. Although in-vitro techniques are more effective in detecting epigenetic alterations, they are time and money intensive. Artificial intelligence-based in silico approaches have been used to overcome these obstacles. Aim: This study aimed to develop an ML-based predictor for the detection of 5mC sites in Poaceae. Objective: The objective of this study was the evaluation of machine learning and deep learning models for the prediction of 5mC sites in rice. Method: In this study, the vectorization of DNA sequences has been performed using three distinct feature sets- Oligo Nucleotide Frequencies (k = 2), Mono-nucleotide Binary Encoding, and Chemical Properties of Nucleotides. Two deep learning models, long short-term memory (LSTM) and Bidirectional LSTM (Bi-LSTM), as well as nine machine learning models, including random forest, gradient boosting, naïve bayes, regression tree, k-Nearest neighbour, support vector machine, adaboost, multiple logistic regression, and artificial neural network, were investigated. Also, bootstrap resampling was used to build more efficient models along with a hybrid feature selection module for dimensional reduction and removal of irrelevant features of the vector space. Result: Random Forest gains the maximum accuracy, specificity and MCC, i.e., 92.6%, 86.41% and 0.84. Gradient Boosting obtained the maximum sensitivity, i.e., 96.85%. The Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) technique showed that the best three models were Random Forest, Gradient Boosting, and Support Vector Machine in terms of accurate prediction of 5mC sites in rice. We developed an R-package, ‘GB5mCPred,’ and it is available in CRAN (https://cran.r-project.org/web/packages/GB5mcPred/index.html). Also, a user-friendly prediction server was made based on this algorithm (http://cabgrid.res.in:5474/). Conclusion: With nearly equal TOPSIS scores, Random Forest, Gradient Boosting, and Support Vector Machine ended up being the best three models. The major rationale may be found in their architectural design since they are gradual learning models that can capture the 5mC sites more correctly than other learning models.

背景：5mC 是生命三大领域中最普遍的表观遗传改变之一，它在多种生物功能中发挥着作用。虽然体外技术能更有效地检测表观遗传学改变，但需要耗费大量时间和金钱。基于人工智能的硅学方法被用来克服这些障碍：5mC 是生命三大领域中最普遍的表观遗传学改变之一，它在广泛的生物功能中发挥着作用。虽然体外技术能更有效地检测表观遗传学改变，但耗时耗钱。基于人工智能的硅学方法被用来克服这些障碍。目的：本研究旨在开发一种基于 ML 的预测器，用于检测 Poaceae 中的 5mC 位点。目标本研究旨在评估用于预测水稻中 5mC 位点的机器学习和深度学习模型。研究方法在本研究中，使用三个不同的特征集对 DNA 序列进行了矢量化--寡核苷酸频率（k = 2）、单核苷酸二进制编码和核苷酸的化学性质。研究了两种深度学习模型--长短期记忆（LSTM）和双向 LSTM（Bi-LSTM），以及九种机器学习模型，包括随机森林、梯度提升、奈夫贝叶斯、回归树、k-近邻、支持向量机、adaboost、多元逻辑回归和人工神经网络。此外，还使用了引导重采样来建立更有效的模型，并使用混合特征选择模块来降低维度和去除向量空间中的无关特征。结果随机森林获得了最高的准确率、特异性和 MCC，即 92.6%、86.41% 和 0.84。梯度提升技术获得了最高灵敏度，即 96.85%。与理想解相似度排序技术（TOPSIS）显示，在准确预测水稻 5mC 位点方面，随机森林、梯度提升和支持向量机这三个模型最佳。我们开发了一个名为 "GB5mCPred "的 R 包，并将其发布在 CRAN (https://cran.r-project.org/web/packages/GB5mcPred/index.html) 上。此外，我们还基于该算法开发了一个用户友好型预测服务器 (http://cabgrid.res.in:5474/)。结论随机森林、梯度提升和支持向量机的 TOPSIS 分数几乎相等，最终成为最佳的三种模型。主要原因可能在于它们的架构设计，因为它们是渐进式学习模型，能比其他学习模型更正确地捕捉 5mC 位点。

{"title":"GB5mCPred: Cross-species 5mc Site Predictor Based on Bootstrap-based Stochastic Gradient Boosting Method for Poaceae","authors":"Dipro Sinha, Tanwy Dasmandal, Md Yeasin, D.C Mishra, Anil Rai, Sunil Archak","doi":"10.2174/0115748936285544231221113226","DOIUrl":"https://doi.org/10.2174/0115748936285544231221113226","url":null,"abstract":"Background: One of the most prevalent epigenetic alterations in all three kingdoms of life is 5mC, which plays a part in a wide range of biological functions. Although in-vitro techniques are more effective in detecting epigenetic alterations, they are time and money-intensive. Artificial intelligence-based in silico approaches have been used to overcome these obstacles. background: One of the most prevalent epigenetic alterations in all three kingdoms of life is 5mC, which plays a part in a wide range of biological functions. Although in-vitro techniques are more effective in detecting epigenetic alterations, they are time and money intensive. Artificial intelligence-based in silico approaches have been used to overcome these obstacles. Aim: This study aimed to develop an ML-based predictor for the detection of 5mC sites in Poaceae. Objective: The objective of this study was the evaluation of machine learning and deep learning models for the prediction of 5mC sites in rice. Method: In this study, the vectorization of DNA sequences has been performed using three distinct feature sets- Oligo Nucleotide Frequencies (k = 2), Mono-nucleotide Binary Encoding, and Chemical Properties of Nucleotides. Two deep learning models, long short-term memory (LSTM) and Bidirectional LSTM (Bi-LSTM), as well as nine machine learning models, including random forest, gradient boosting, naïve bayes, regression tree, k-Nearest neighbour, support vector machine, adaboost, multiple logistic regression, and artificial neural network, were investigated. Also, bootstrap resampling was used to build more efficient models along with a hybrid feature selection module for dimensional reduction and removal of irrelevant features of the vector space. Result: Random Forest gains the maximum accuracy, specificity and MCC, i.e., 92.6%, 86.41% and 0.84. Gradient Boosting obtained the maximum sensitivity, i.e., 96.85%. The Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) technique showed that the best three models were Random Forest, Gradient Boosting, and Support Vector Machine in terms of accurate prediction of 5mC sites in rice. We developed an R-package, ‘GB5mCPred,’ and it is available in CRAN (https://cran.r-project.org/web/packages/GB5mcPred/index.html). Also, a user-friendly prediction server was made based on this algorithm (http://cabgrid.res.in:5474/). Conclusion: With nearly equal TOPSIS scores, Random Forest, Gradient Boosting, and Support Vector Machine ended up being the best three models. The major rationale may be found in their architectural design since they are gradual learning models that can capture the 5mC sites more correctly than other learning models.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"17 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140583275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DiffSeqMol: A Non-Autoregressive Diffusion-Based Approach for Molecular Sequence Generation and Optimization DiffSeqMol：基于非自回归扩散的分子序列生成和优化方法

IF 4 3区生物学 Q3 BIOCHEMICAL RESEARCH METHODS

Current Bioinformatics

Pub Date : 2024-04-03 DOI: 10.2174/0115748936285493240307071916

Zixu Wang, Yangyang Chen, Xiulan Guo, Yayang Li, Pengyong Li, Chunyan Li, Xiucai Ye, Tetsuya Sakurai

Background: The application of deep generative models for molecular discovery has witnessed a significant surge in recent years. Currently, the field of molecular generation and molecular optimization is predominantly governed by autoregressive models regardless of how molecular data is represented. However, an emerging paradigm in the generation domain is diffusion models, which treat data non-autoregressively and have achieved significant breakthroughs in areas such as image generation. Methods: The potential and capability of diffusion models in molecular generation and optimization tasks remain largely unexplored. In order to investigate the potential applicability of diffusion models in the domain of molecular exploration, we proposed DiffSeqMol, a molecular sequence generation model, underpinned by diffusion process. Results & Discussion: DiffSeqMol distinguishes itself from traditional autoregressive methods by its capacity to draw samples from random noise and direct generating the entire molecule. Through experiment evaluations, we demonstrated that DiffSeqMol can achieve, even surpass, the performance of established state-of-the-art models on unconditional generation tasks and molecular optimization tasks. Conclusion: Taken together, our results show that DiffSeqMol can be considered a promising molecular generation method. It opens new pathways to traverse the expansive chemical space and to discover novel molecules.

背景：近年来，深度生成模型在分子发现领域的应用出现了大幅增长。目前，分子生成和分子优化领域主要采用自回归模型，而不管分子数据如何表示。然而，生成领域的一个新兴范例是扩散模型，它以非自回归的方式处理数据，并在图像生成等领域取得了重大突破。方法：扩散模型在分子生成和优化任务中的潜力和能力在很大程度上仍未得到探索。为了研究扩散模型在分子探索领域的潜在适用性，我们提出了以扩散过程为基础的分子序列生成模型 DiffSeqMol。结果与讨论DiffSeqMol 有别于传统的自回归方法，它能从随机噪声中提取样本，直接生成整个分子。通过实验评估，我们证明 DiffSeqMol 可以在无条件生成任务和分子优化任务上达到甚至超过现有的最先进模型的性能。结论综上所述，我们的研究结果表明，DiffSeqMol 可被视为一种前景广阔的分子生成方法。它为穿越广阔的化学空间和发现新分子开辟了新的途径。

{"title":"DiffSeqMol: A Non-Autoregressive Diffusion-Based Approach for Molecular Sequence Generation and Optimization","authors":"Zixu Wang, Yangyang Chen, Xiulan Guo, Yayang Li, Pengyong Li, Chunyan Li, Xiucai Ye, Tetsuya Sakurai","doi":"10.2174/0115748936285493240307071916","DOIUrl":"https://doi.org/10.2174/0115748936285493240307071916","url":null,"abstract":"Background: The application of deep generative models for molecular discovery has witnessed a significant surge in recent years. Currently, the field of molecular generation and molecular optimization is predominantly governed by autoregressive models regardless of how molecular data is represented. However, an emerging paradigm in the generation domain is diffusion models, which treat data non-autoregressively and have achieved significant breakthroughs in areas such as image generation. Methods: The potential and capability of diffusion models in molecular generation and optimization tasks remain largely unexplored. In order to investigate the potential applicability of diffusion models in the domain of molecular exploration, we proposed DiffSeqMol, a molecular sequence generation model, underpinned by diffusion process. Results & Discussion: DiffSeqMol distinguishes itself from traditional autoregressive methods by its capacity to draw samples from random noise and direct generating the entire molecule. Through experiment evaluations, we demonstrated that DiffSeqMol can achieve, even surpass, the performance of established state-of-the-art models on unconditional generation tasks and molecular optimization tasks. Conclusion: Taken together, our results show that DiffSeqMol can be considered a promising molecular generation method. It opens new pathways to traverse the expansive chemical space and to discover novel molecules.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"509 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140582953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Research on the Mechanism of Traditional Chinese Medicine Treatment for Diseases caused by Human Coronavirus COVID-19 中药治疗人类冠状病毒 COVID-19 引起的疾病的机理研究

IF 4 3区生物学 Q3 BIOCHEMICAL RESEARCH METHODS

Current Bioinformatics

Pub Date : 2024-04-02 DOI: 10.2174/0115748936292599240308102616

Xian-Fang Wang, Chong-Yang Ma, Zhi-Yong Du, Yi-Feng Liu, Shao-Hui Ma, Sang Yu, Rui-xia Jin, Dong-qing Wei

Background: Human coronaviruses are a large group of viruses that exist widely in nature and multiply through self-replication. Due to its suddenness and variability, it poses a great threat to global human health and is a major problem currently faced by the medical and health fields. background: Human coronaviruses are a large group of viruses that exist widely in nature and multiply through self-replication. Due to its suddenness and variability, it poses a great threat to global human health and is a major problem currently faced by the medical and health fields. Objective: COVID-19 is the seventh known coronavirus that can infect humans. The main purpose of this paper is to analyze the effective components and action targets of the Longyi Zhengqi formula and Lianhua Qingwen formula, study their mechanism of action in the treatment of new coronavirus pneumonia (new coronavirus pneumonia), compare the similarities and differences of their pharmacological effects, and obtain the pharmacodynamic mechanism of the two traditional Chinese medicine compounds. Method: Obtain the effective ingredients and targets of Longyi-Zhengqi Formula and Lianhua- Qingwen Formula from ETCM (Encyclopedia of Traditional Chinese Medicine) and other traditional Chinese medicine databases, use GeneCards database to obtain the relevant targets of COVID-19, and use Cytoscape software to build the component COVID-19 target network of Longyi-Zhengqi Formula and the component COVID-19 target network of Lianhua-Qingwen Formula. STRING was used to construct a protein interaction network and screen key targets. GO (Gene Ontology) was used for enrichment analysis and KEGG (Kyoto Encyclopedia of Genes and Genomes) was used for pathways to find out the targets and pathways related to the treatment of COVID-19. Results: In the GO enrichment analysis results, there are 106 biological processes, 31 cell localization and 28 molecular functions of the intersection PPI network targets of Longyi-Zhengqi Formula- COVID-19, 224 biological processes, 51 cell localization and 55 molecular functions of the intersection PPI network targets of Lianhua-Qingwen Formula-COVID-19. In the KEGG pathway analysis results, the number of targets of Longyi-Zhengqi Formula on the COVID-19 pathway is 7, and the number of targets of Lianhua-Qingwen Formula on the COVID-19 pathway is 19; In the regulation analysis results, Longyi-Zhengqi Formula achieves the effect of treating COVID-19 by regulating IL-6, and Lianhua-Qingwen Formula achieves the effect of treating pneumonia by regulating TLR4. Conclusion: This paper explores the mechanism of action of Longyi-Zhengqi Formula and Lianhua-Qingwen Formula in treating COVID-19 based on the method of network pharmacology, and provides a theoretical basis for traditional Chinese medicine to treat sudden diseases caused by human coronavirus in terms of drug targets and disease interactions. It has certain practical significance.

背景：人类冠状病毒是一大类病毒，广泛存在于自然界中，通过自我复制进行繁殖。由于其突发性和变异性，它对全球人类健康构成了巨大威胁，也是医疗卫生领域目前面临的一个主要问题：人类冠状病毒是一大类病毒，广泛存在于自然界中，通过自我复制进行繁殖。由于其突发性和变异性，它对全球人类健康构成了巨大威胁，也是医学和卫生领域目前面临的主要问题。目的：COVID-19 是已知的第七种可感染人类的冠状病毒。本文的主要目的是分析龙益正气方和莲花清心方的有效成分和作用靶点，研究其治疗新型冠状病毒肺炎（新型冠状病毒肺炎）的作用机制，比较其药理作用的异同，获得两种中药复方的药效学机制。方法从ETCM（Encyclopedia of Traditional Chinese Medicine）等中药数据库中获取龙益正气方和连花清瘟方的有效成分和靶点，利用GeneCards数据库获取COVID-19的相关靶点，利用Cytoscape软件构建龙益正气方的组分COVID-19靶点网络和连花清瘟方的组分COVID-19靶点网络。STRING 用于构建蛋白质相互作用网络和筛选关键靶标。利用GO（Gene Ontology）进行富集分析，利用KEGG（Kyoto Encyclopedia of Genes and Genomes）进行通路分析，寻找与COVID-19治疗相关的靶点和通路。结果在GO富集分析结果中，龙益正气方-COVID-19的交叉PPI网络靶点有106个生物过程、31个细胞定位和28个分子功能；莲花清心方-COVID-19的交叉PPI网络靶点有224个生物过程、51个细胞定位和55个分子功能。在KEGG通路分析结果中，龙益正气方在COVID-19通路上的靶点数为7个，连花清瘟方在COVID-19通路上的靶点数为19个；在调控分析结果中，龙益正气方通过调控IL-6达到治疗COVID-19的效果，连花清瘟方通过调控TLR4达到治疗肺炎的效果。结论本文基于网络药理学的方法，探讨了龙益正气方和连花清瘟方治疗COVID-19的作用机制，从药物靶点、疾病相互作用等方面为中药治疗人类冠状病毒所致突发性疾病提供了理论依据。具有一定的现实意义。

{"title":"Research on the Mechanism of Traditional Chinese Medicine Treatment for Diseases caused by Human Coronavirus COVID-19","authors":"Xian-Fang Wang, Chong-Yang Ma, Zhi-Yong Du, Yi-Feng Liu, Shao-Hui Ma, Sang Yu, Rui-xia Jin, Dong-qing Wei","doi":"10.2174/0115748936292599240308102616","DOIUrl":"https://doi.org/10.2174/0115748936292599240308102616","url":null,"abstract":"Background: Human coronaviruses are a large group of viruses that exist widely in nature and multiply through self-replication. Due to its suddenness and variability, it poses a great threat to global human health and is a major problem currently faced by the medical and health fields. background: Human coronaviruses are a large group of viruses that exist widely in nature and multiply through self-replication. Due to its suddenness and variability, it poses a great threat to global human health and is a major problem currently faced by the medical and health fields. Objective: COVID-19 is the seventh known coronavirus that can infect humans. The main purpose of this paper is to analyze the effective components and action targets of the Longyi Zhengqi formula and Lianhua Qingwen formula, study their mechanism of action in the treatment of new coronavirus pneumonia (new coronavirus pneumonia), compare the similarities and differences of their pharmacological effects, and obtain the pharmacodynamic mechanism of the two traditional Chinese medicine compounds. Method: Obtain the effective ingredients and targets of Longyi-Zhengqi Formula and Lianhua- Qingwen Formula from ETCM (Encyclopedia of Traditional Chinese Medicine) and other traditional Chinese medicine databases, use GeneCards database to obtain the relevant targets of COVID-19, and use Cytoscape software to build the component COVID-19 target network of Longyi-Zhengqi Formula and the component COVID-19 target network of Lianhua-Qingwen Formula. STRING was used to construct a protein interaction network and screen key targets. GO (Gene Ontology) was used for enrichment analysis and KEGG (Kyoto Encyclopedia of Genes and Genomes) was used for pathways to find out the targets and pathways related to the treatment of COVID-19. Results: In the GO enrichment analysis results, there are 106 biological processes, 31 cell localization and 28 molecular functions of the intersection PPI network targets of Longyi-Zhengqi Formula- COVID-19, 224 biological processes, 51 cell localization and 55 molecular functions of the intersection PPI network targets of Lianhua-Qingwen Formula-COVID-19. In the KEGG pathway analysis results, the number of targets of Longyi-Zhengqi Formula on the COVID-19 pathway is 7, and the number of targets of Lianhua-Qingwen Formula on the COVID-19 pathway is 19; In the regulation analysis results, Longyi-Zhengqi Formula achieves the effect of treating COVID-19 by regulating IL-6, and Lianhua-Qingwen Formula achieves the effect of treating pneumonia by regulating TLR4. Conclusion: This paper explores the mechanism of action of Longyi-Zhengqi Formula and Lianhua-Qingwen Formula in treating COVID-19 based on the method of network pharmacology, and provides a theoretical basis for traditional Chinese medicine to treat sudden diseases caused by human coronavirus in terms of drug targets and disease interactions. It has certain practical significance.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"15 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140582836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Novel Machine-learning Model to Classify Schizophrenia Using Methylation Data Based on Gene Expression 利用基于基因表达的甲基化数据对精神分裂症进行分类的新型机器学习模型

IF 4 3区生物学 Q3 BIOCHEMICAL RESEARCH METHODS

Current Bioinformatics

Pub Date : 2024-03-11 DOI: 10.2174/0115748936293407240222113019

Karthikeyan A. Vijayakumar, Gwang-Won Cho

Introduction: The recent advancement in artificial intelligence has compelled medical research to adapt the technologies. The abundance of molecular data and AI technology has helped in explaining various diseases, even cancers. Schizophrenia is a complex neuropsychological disease whose etiology is unknown. Several gene-wide association studies attempted to narrow down the cause of the disease but did not successfully point out the mechanism behind the disease. There are studies regarding the epigenetic changes in the schizophrenia disease condition, and a classification machine-learning model has been trained using the blood methylation data. Method: In this study, we have demonstrated a novel approach to elucidating the molecular cause of the disease. We used a two-step machine-learning approach to determine the causal molecular markers. By doing so, we developed classification models using both gene expression microarray and methylation microarray data. Result: Our models, because of our novel approach, achieved good classification accuracy with the available data size. We analyzed the important features, and they add up as evidence for the glutamate hypothesis of schizophrenia. Conclusion: In this way, we have demonstrated explaining a disease through machine learning models.

简介近年来，人工智能的发展迫使医学研究对技术进行调整。丰富的分子数据和人工智能技术有助于解释各种疾病，甚至癌症。精神分裂症是一种病因不明的复杂神经心理疾病。一些全基因关联研究试图缩小病因范围，但并未成功指出疾病背后的机制。目前已有关于精神分裂症疾病表观遗传变化的研究，并利用血液甲基化数据训练了一个分类机器学习模型。方法：在这项研究中，我们展示了一种阐明疾病分子原因的新方法。我们采用了两步机器学习法来确定致病分子标记。为此，我们利用基因表达微阵列和甲基化微阵列数据建立了分类模型。结果由于采用了新颖的方法，我们的模型在数据量有限的情况下实现了良好的分类准确性。我们分析了重要的特征，这些特征为精神分裂症的谷氨酸假说提供了证据。结论通过这种方式，我们证明了通过机器学习模型可以解释一种疾病。

{"title":"A Novel Machine-learning Model to Classify Schizophrenia Using Methylation Data Based on Gene Expression","authors":"Karthikeyan A. Vijayakumar, Gwang-Won Cho","doi":"10.2174/0115748936293407240222113019","DOIUrl":"https://doi.org/10.2174/0115748936293407240222113019","url":null,"abstract":"Introduction: The recent advancement in artificial intelligence has compelled medical research to adapt the technologies. The abundance of molecular data and AI technology has helped in explaining various diseases, even cancers. Schizophrenia is a complex neuropsychological disease whose etiology is unknown. Several gene-wide association studies attempted to narrow down the cause of the disease but did not successfully point out the mechanism behind the disease. There are studies regarding the epigenetic changes in the schizophrenia disease condition, and a classification machine-learning model has been trained using the blood methylation data. Method: In this study, we have demonstrated a novel approach to elucidating the molecular cause of the disease. We used a two-step machine-learning approach to determine the causal molecular markers. By doing so, we developed classification models using both gene expression microarray and methylation microarray data. Result: Our models, because of our novel approach, achieved good classification accuracy with the available data size. We analyzed the important features, and they add up as evidence for the glutamate hypothesis of schizophrenia. Conclusion: In this way, we have demonstrated explaining a disease through machine learning models.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"29 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140105243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Extended Feature Representation Technique for Predicting Sequenced-based Host-pathogen Protein-protein Interaction 预测基于序列的宿主-病原体蛋白质-蛋白质相互作用的扩展特征表示技术

IF 4 3区生物学 Q3 BIOCHEMICAL RESEARCH METHODS

Current Bioinformatics

Pub Date : 2024-03-11 DOI: 10.2174/0115748936286848240108074303

Jerry Emmanuel, Itunuoluwa Isewon, Grace Olasehinde, Jelili Oyelade

Background: The use of machine learning models in sequence-based Protein-Protein Interaction prediction typically requires the conversion of amino acid sequences into feature vectors. From the literature, two approaches have been used to achieve this transformation. These are referred to as the Independent Protein Feature (IPF) and Merged Protein Feature (MPF) extraction methods. As observed, studies have predominantly adopted the IPF approach, while others preferred the MPF method, in which host and pathogen sequences are concatenated before feature encoding. Objective: This presents the challenge of determining which approach should be adopted for improved HPPPI prediction. Therefore, this work introduces the Extended Protein Feature (EPF) method. Methods: The proposed method combines the predictive capabilities of IPF and MPF, extracting essential features, handling multicollinearity, and removing features with zero importance. EPF, IPF, and MPF were tested using bacteria, parasite, virus, and plant HPPPI datasets and were deployed to machine learning models, including Random Forest (RF), Support Vector Machine (SVM), Multilayer Perceptron (MLP), Naïve Bayes (NB), Logistic Regression (LR), and Deep Forest (DF). Results: The results indicated that MPF exhibited the lowest performance overall, whereas IPF performed better with decision tree-based models, such as RF and DF. In contrast, EPF demonstrated improved performance with SVM, LR, NB, and MLP and also yielded competitive results with DF and RF. Conclusion: In conclusion, the EPF approach developed in this study exhibits substantial improvements in four out of the six models evaluated. This suggests that EPF offers competitiveness with IPF and is particularly well-suited for traditional machine learning models.

背景：在基于序列的蛋白质-蛋白质相互作用预测中使用机器学习模型通常需要将氨基酸序列转换为特征向量。从文献来看，有两种方法可以实现这种转换。这两种方法被称为独立蛋白质特征（IPF）提取法和合并蛋白质特征（MPF）提取法。据观察，相关研究主要采用 IPF 方法，而其他研究则倾向于 MPF 方法，即在特征编码前将宿主和病原体序列合并。目标这就给确定采用哪种方法来改进 HPPPI 预测带来了挑战。因此，本研究引入了扩展蛋白质特征（EPF）方法。方法：所提出的方法结合了 IPF 和 MPF 的预测能力，提取了基本特征，处理了多重共线性，并删除了重要性为零的特征。使用细菌、寄生虫、病毒和植物 HPPPI 数据集测试了 EPF、IPF 和 MPF，并将其部署到机器学习模型中，包括随机森林 (RF)、支持向量机 (SVM)、多层感知器 (MLP)、奈夫贝叶斯 (NB)、逻辑回归 (LR) 和深度森林 (DF)。结果显示结果表明，MPF 的整体性能最低，而 IPF 在使用 RF 和 DF 等基于决策树的模型时表现更好。相比之下，EPF 在 SVM、LR、NB 和 MLP 中的性能有所提高，在 DF 和 RF 中也取得了具有竞争力的结果。结论总之，在本研究中开发的 EPF 方法在六个评估模型中的四个模型中都有显著改进。这表明 EPF 与 IPF 相比具有竞争力，尤其适合传统的机器学习模型。

{"title":"An Extended Feature Representation Technique for Predicting Sequenced-based Host-pathogen Protein-protein Interaction","authors":"Jerry Emmanuel, Itunuoluwa Isewon, Grace Olasehinde, Jelili Oyelade","doi":"10.2174/0115748936286848240108074303","DOIUrl":"https://doi.org/10.2174/0115748936286848240108074303","url":null,"abstract":"Background: The use of machine learning models in sequence-based Protein-Protein Interaction prediction typically requires the conversion of amino acid sequences into feature vectors. From the literature, two approaches have been used to achieve this transformation. These are referred to as the Independent Protein Feature (IPF) and Merged Protein Feature (MPF) extraction methods. As observed, studies have predominantly adopted the IPF approach, while others preferred the MPF method, in which host and pathogen sequences are concatenated before feature encoding. Objective: This presents the challenge of determining which approach should be adopted for improved HPPPI prediction. Therefore, this work introduces the Extended Protein Feature (EPF) method. Methods: The proposed method combines the predictive capabilities of IPF and MPF, extracting essential features, handling multicollinearity, and removing features with zero importance. EPF, IPF, and MPF were tested using bacteria, parasite, virus, and plant HPPPI datasets and were deployed to machine learning models, including Random Forest (RF), Support Vector Machine (SVM), Multilayer Perceptron (MLP), Naïve Bayes (NB), Logistic Regression (LR), and Deep Forest (DF). Results: The results indicated that MPF exhibited the lowest performance overall, whereas IPF performed better with decision tree-based models, such as RF and DF. In contrast, EPF demonstrated improved performance with SVM, LR, NB, and MLP and also yielded competitive results with DF and RF. Conclusion: In conclusion, the EPF approach developed in this study exhibits substantial improvements in four out of the six models evaluated. This suggests that EPF offers competitiveness with IPF and is particularly well-suited for traditional machine learning models.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"285 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140105363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0