首页 > 最新文献

Journal of Computational Biology最新文献

英文 中文
Building Explainable Graph Neural Network by Sparse Learning for the Drug-Protein Binding Prediction. 用稀疏学习构建可解释图神经网络用于药物-蛋白质结合预测。
IF 1.6 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-07-01 Epub Date: 2025-06-12 DOI: 10.1089/cmb.2025.0074
Yang Wang, Zanyu Shi, Pathum Weerawarna, Kun Huang, Timothy Richardson, Yijie Wang

Explainable Graph Neural Networks have been developed and applied to drug-protein binding prediction to identify the key chemical structures in a drug that have active interactions with the target proteins. However, the key structures identified by the current explainable Graph Neural Network (GNN) models are typically chemically invalid. Furthermore, a threshold must be manually selected to pinpoint the key structures from the rest. To overcome the limitations of the current explainable GNN models, we propose SLGNN, which stands for using Sparse Learning to Graph Neural Networks. It relies on using a chemical-substructure-based graph to represent a drug molecule. Furthermore, SLGNN incorporates generalized fused lasso with message-passing algorithms to identify connected subgraphs that are critical for the drug-protein binding prediction. Due to the use of the chemical-substructure-based graph, it is guaranteed that any subgraphs in a drug identified by SLGNN are chemically valid structures. These structures can be further interpreted as the key chemical structures for the drug to bind to the target protein. Our code is available at https://github.com/yw109iu/Explainable_GNN. We test SLGNN and the state-of-the-art competing methods on three real-world drug-protein binding datasets. We have demonstrated that the key structures identified by our SLGNN are chemically valid and have more predictive power.

可解释的图神经网络已被开发并应用于药物-蛋白质结合预测,以识别药物中与目标蛋白质有积极相互作用的关键化学结构。然而,当前可解释的图神经网络(GNN)模型所识别的关键结构通常在化学上无效。此外,必须手动选择一个阈值,以便从其他结构中精确定位关键结构。为了克服当前可解释的GNN模型的局限性,我们提出了SLGNN,即使用稀疏学习来绘制神经网络。它依赖于使用基于化学子结构的图来表示药物分子。此外,SLGNN结合了广义融合套索和消息传递算法来识别对药物-蛋白质结合预测至关重要的连接子图。由于使用了基于化学子结构的图,因此可以保证SLGNN识别的药物中的任何子图都是化学有效结构。这些结构可以进一步解释为药物与靶蛋白结合的关键化学结构。我们的代码可在https://github.com/yw109iu/Explainable_GNN上获得。我们在三个真实世界的药物-蛋白质结合数据集上测试了SLGNN和最先进的竞争方法。我们已经证明了我们的SLGNN识别的关键结构在化学上是有效的,并且具有更高的预测能力。
{"title":"Building Explainable Graph Neural Network by Sparse Learning for the Drug-Protein Binding Prediction.","authors":"Yang Wang, Zanyu Shi, Pathum Weerawarna, Kun Huang, Timothy Richardson, Yijie Wang","doi":"10.1089/cmb.2025.0074","DOIUrl":"10.1089/cmb.2025.0074","url":null,"abstract":"<p><p>Explainable Graph Neural Networks have been developed and applied to drug-protein binding prediction to identify the key chemical structures in a drug that have active interactions with the target proteins. However, the key structures identified by the current explainable Graph Neural Network (GNN) models are typically chemically invalid. Furthermore, a threshold must be manually selected to pinpoint the key structures from the rest. To overcome the limitations of the current explainable GNN models, we propose SLGNN, which stands for using Sparse Learning to Graph Neural Networks. It relies on using a chemical-substructure-based graph to represent a drug molecule. Furthermore, SLGNN incorporates generalized fused lasso with message-passing algorithms to identify connected subgraphs that are critical for the drug-protein binding prediction. Due to the use of the chemical-substructure-based graph, it is guaranteed that any subgraphs in a drug identified by SLGNN are chemically valid structures. These structures can be further interpreted as the key chemical structures for the drug to bind to the target protein. Our code is available at https://github.com/yw109iu/Explainable_GNN. We test SLGNN and the state-of-the-art competing methods on three real-world drug-protein binding datasets. We have demonstrated that the key structures identified by our SLGNN are chemically valid and have more predictive power.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"632-645"},"PeriodicalIF":1.6,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12259411/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144275028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Combined Topological Data Analysis and Geometric Deep Learning Reveal Niches by the Quantification of Protein Binding Pockets. 结合拓扑数据分析和几何深度学习揭示了量化蛋白质结合口袋的生态位。
IF 1.4 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-07-01 Epub Date: 2025-05-28 DOI: 10.1089/cmb.2025.0076
Peiran Jiang, Jose Lugo-Martinez

Protein pockets are essential for many proteins to carry out their functions. Locating and measuring protein pockets, as well as studying the anatomy of pockets, helps us further understand protein function. Most research studies focus on learning either local or global information from protein structures. However, there is a lack of studies that leverage the power of integrating both local and global representations of these structures. In this work, we combine topological data analysis (TDA) and geometric deep learning (GDL) to analyze the putative protein pockets of enzymes. TDA captures blueprints of the global topological invariant of protein pockets, whereas GDL decomposes the fingerprints into building blocks of these pockets. This integration of local and global views provides a comprehensive and complementary understanding of the protein structural motifs (niches for short) within protein pockets. We also analyze the distribution of the building blocks making up the pocket and profile the predictive power of coupling local and global representations for the task of discriminating between enzymes and nonenzymes, as well as predicting the enzyme class. We demonstrate that our representation learning framework for macromolecules is particularly useful when the structure is known, and the scenarios heavily rely on local and global information.

蛋白质口袋是许多蛋白质发挥其功能所必需的。定位和测量蛋白质口袋,以及研究口袋的解剖结构,有助于我们进一步了解蛋白质的功能。大多数研究集中于从蛋白质结构中学习局部或全局信息。然而,缺乏利用整合这些结构的地方和全球表征的力量的研究。在这项工作中,我们结合拓扑数据分析(TDA)和几何深度学习(GDL)来分析假定的酶的蛋白质口袋。TDA捕获蛋白质口袋的全局拓扑不变量的蓝图,而GDL将指纹分解为这些口袋的构建块。这种局部和全局观点的整合提供了对蛋白质口袋内蛋白质结构基序(简称小生境)的全面和互补的理解。我们还分析了组成口袋的构建块的分布,并描述了耦合局部和全局表示的预测能力,用于区分酶和非酶,以及预测酶类。我们证明,当结构已知时,我们的大分子表征学习框架特别有用,并且场景严重依赖于局部和全局信息。
{"title":"Combined Topological Data Analysis and Geometric Deep Learning Reveal Niches by the Quantification of Protein Binding Pockets.","authors":"Peiran Jiang, Jose Lugo-Martinez","doi":"10.1089/cmb.2025.0076","DOIUrl":"10.1089/cmb.2025.0076","url":null,"abstract":"<p><p>Protein pockets are essential for many proteins to carry out their functions. Locating and measuring protein pockets, as well as studying the anatomy of pockets, helps us further understand protein function. Most research studies focus on learning either local or global information from protein structures. However, there is a lack of studies that leverage the power of integrating both local and global representations of these structures. In this work, we combine topological data analysis (TDA) and geometric deep learning (GDL) to analyze the putative protein pockets of enzymes. TDA captures blueprints of the global topological invariant of protein pockets, whereas GDL decomposes the fingerprints into building blocks of these pockets. This integration of local and global views provides a comprehensive and complementary understanding of the protein structural motifs (<i>niches</i> for short) within protein pockets. We also analyze the distribution of the building blocks making up the pocket and profile the predictive power of coupling local and global representations for the task of discriminating between enzymes and nonenzymes, as well as predicting the enzyme class. We demonstrate that our representation learning framework for macromolecules is particularly useful when the structure is known, and the scenarios heavily rely on local and global information.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"659-674"},"PeriodicalIF":1.4,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144174109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CNB-MAC 2023 Special Issue. CNB-MAC 2023特刊。
IF 1.4 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-07-01 Epub Date: 2025-06-02 DOI: 10.1089/cmb.2025.0141
Anna Ritz
{"title":"CNB-MAC 2023 Special Issue.","authors":"Anna Ritz","doi":"10.1089/cmb.2025.0141","DOIUrl":"10.1089/cmb.2025.0141","url":null,"abstract":"","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"631"},"PeriodicalIF":1.4,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144208682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The 2nd International Workshop on Pattern Recognition in Healthcare Analytics 2023 Preface. 第二届医疗保健分析模式识别国际研讨会2023前言。
IF 1.4 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-06-01 Epub Date: 2025-05-13 DOI: 10.1089/cmb.2025.0117
Inci M Baytas
{"title":"The 2nd International Workshop on Pattern Recognition in Healthcare Analytics 2023 Preface.","authors":"Inci M Baytas","doi":"10.1089/cmb.2025.0117","DOIUrl":"10.1089/cmb.2025.0117","url":null,"abstract":"","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"557"},"PeriodicalIF":1.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143985486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FedOpenHAR: Federated Multitask Transfer Learning for Sensor-Based Human Activity Recognition. 基于传感器的人类活动识别的联邦多任务迁移学习。
IF 1.4 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-06-01 Epub Date: 2025-04-23 DOI: 10.1089/cmb.2024.0631
Egemen İşgÜder, Özlem Durmaz İncel

Wearable and mobile devices equipped with motion sensors offer important insights into user behavior. Machine learning and, more recently, deep learning techniques have been applied to analyze sensor data. Typically, the focus is on a single task, such as human activity recognition (HAR), and the data is processed centrally on a server or in the cloud. However, the same sensor data can be leveraged for multiple tasks, and distributed machine learning methods can be employed without the need for transmitting data to a central location. In this study, we introduce the FedOpenHAR framework, which explores federated transfer learning in a multitask setting for both sensor-based HAR and device position identification tasks. This approach utilizes transfer learning by training task-specific and personalized layers in a federated manner. The OpenHAR framework, which includes ten smaller datasets, is used for training the models. The main challenge is developing robust models that are applicable to both tasks across different datasets, which may contain only a subset of label types. Multiple experiments are conducted in the Flower federated learning environment using the DeepConvLSTM architecture. Results are presented for both federated and centralized training under various parameters and constraints. By employing transfer learning and training task-specific and personalized federated models, we achieve a higher accuracy (72.4%) compared to a fully centralized training approach (64.5%), and similar accuracy to a scenario where each client performs individual training in isolation (72.6%). However, the advantage of FedOpenHAR over individual training is that, when a new client joins with a new label type (representing a new task), it can begin training from the already existing common layer. Furthermore, if a new client wants to classify a new class in one of the existing tasks, FedOpenHAR allows training to begin directly from the task-specific layers.

配备运动传感器的可穿戴设备和移动设备提供了对用户行为的重要见解。机器学习和最近的深度学习技术已被应用于分析传感器数据。通常,重点放在单个任务上,例如人类活动识别(HAR),数据在服务器或云中集中处理。然而,相同的传感器数据可以用于多个任务,并且可以采用分布式机器学习方法,而无需将数据传输到中心位置。在本研究中,我们介绍了FedOpenHAR框架,该框架探索了基于传感器的HAR和设备位置识别任务的多任务设置中的联邦迁移学习。这种方法通过以联合方式训练任务特定层和个性化层来利用迁移学习。OpenHAR框架包括10个较小的数据集,用于训练模型。主要的挑战是开发适用于跨不同数据集的两个任务的健壮模型,这些数据集可能只包含标签类型的子集。使用DeepConvLSTM架构在Flower联邦学习环境中进行了多次实验。给出了在不同参数和约束条件下的联合训练和集中训练的结果。与完全集中的训练方法(64.5%)相比,通过使用迁移学习和训练任务特定的和个性化的联邦模型,我们实现了更高的准确率(72.4%),并且与每个客户端单独执行单独训练的场景(72.6%)的准确率相似。然而,FedOpenHAR相对于个人训练的优势在于,当一个新的客户端与一个新的标签类型(代表一个新的任务)连接时,它可以从已经存在的公共层开始训练。此外,如果一个新的客户端想要在一个现有的任务中分类一个新的类,FedOpenHAR允许直接从任务特定的层开始训练。
{"title":"FedOpenHAR: Federated Multitask Transfer Learning for Sensor-Based Human Activity Recognition.","authors":"Egemen İşgÜder, Özlem Durmaz İncel","doi":"10.1089/cmb.2024.0631","DOIUrl":"10.1089/cmb.2024.0631","url":null,"abstract":"<p><p>Wearable and mobile devices equipped with motion sensors offer important insights into user behavior. Machine learning and, more recently, deep learning techniques have been applied to analyze sensor data. Typically, the focus is on a single task, such as human activity recognition (HAR), and the data is processed centrally on a server or in the cloud. However, the same sensor data can be leveraged for multiple tasks, and distributed machine learning methods can be employed without the need for transmitting data to a central location. In this study, we introduce the FedOpenHAR framework, which explores federated transfer learning in a multitask setting for both sensor-based HAR and device position identification tasks. This approach utilizes transfer learning by training task-specific and personalized layers in a federated manner. The OpenHAR framework, which includes ten smaller datasets, is used for training the models. The main challenge is developing robust models that are applicable to both tasks across different datasets, which may contain only a subset of label types. Multiple experiments are conducted in the Flower federated learning environment using the DeepConvLSTM architecture. Results are presented for both federated and centralized training under various parameters and constraints. By employing transfer learning and training task-specific and personalized federated models, we achieve a higher accuracy (72.4%) compared to a fully centralized training approach (64.5%), and similar accuracy to a scenario where each client performs individual training in isolation (72.6%). However, the advantage of FedOpenHAR over individual training is that, when a new client joins with a new label type (representing a new task), it can begin training from the already existing common layer. Furthermore, if a new client wants to classify a new class in one of the existing tasks, FedOpenHAR allows training to begin directly from the task-specific layers.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"558-572"},"PeriodicalIF":1.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143972829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generating Heterogeneous Data on Gene Trees. 在基因树上生成异构数据。
IF 1.4 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-06-01 Epub Date: 2025-05-09 DOI: 10.1089/cmb.2024.0843
Martí Cortada Garcia, Adrià Diéguez Moscardó, Marta Casanellas

We introduce GenPhylo, a Python module that simulates nucleotide sequence data along a phylogeny avoiding the restriction of continuous-time Markov processes. GenPhylo uses directly a general Markov model and therefore naturally incorporates heterogeneity across lineages. We solve the challenge of generating transition matrices with a pre-given expected number of substitutions (the branch length information) by providing an algorithm that can be incorporated in other simulation software.

我们介绍了GenPhylo,一个Python模块,沿着系统发育模拟核苷酸序列数据,避免了连续时间马尔可夫过程的限制。GenPhylo直接使用一般的马尔可夫模型,因此自然地包含了跨谱系的异质性。我们通过提供一种可以纳入其他仿真软件的算法,解决了使用预先给定的期望替换数(分支长度信息)生成转移矩阵的挑战。
{"title":"Generating Heterogeneous Data on Gene Trees.","authors":"Martí Cortada Garcia, Adrià Diéguez Moscardó, Marta Casanellas","doi":"10.1089/cmb.2024.0843","DOIUrl":"10.1089/cmb.2024.0843","url":null,"abstract":"<p><p>We introduce GenPhylo, a Python module that simulates nucleotide sequence data along a phylogeny avoiding the restriction of continuous-time Markov processes. GenPhylo uses directly a general Markov model and therefore naturally incorporates heterogeneity across lineages. We solve the challenge of generating transition matrices with a pre-given expected number of substitutions (the branch length information) by providing an algorithm that can be incorporated in other simulation software.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"626-630"},"PeriodicalIF":1.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143972956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generative Adversarial Networks for Neuroimage Translation. 神经图像翻译的生成对抗网络。
IF 1.4 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-06-01 Epub Date: 2024-12-27 DOI: 10.1089/cmb.2024.0635
Cassandra Czobit, Reza Samavi

Image-to-image translation has gained popularity in the medical field to transform images from one domain to another. Medical image synthesis via domain transformation is advantageous in its ability to augment an image dataset where images for a given class are limited. From the learning perspective, this process contributes to the data-oriented robustness of the model by inherently broadening the model's exposure to more diverse visual data and enabling it to learn more generalized features. In the case of generating additional neuroimages, it is advantageous to obtain unidentifiable medical data and augment smaller annotated datasets. This study proposes the development of a cycle-consistent generative adversarial network (CycleGAN) model for translating neuroimages from one field strength to another (e.g., 3 Tesla [T] to 1.5 T). This model was compared with a model based on a deep convolutional GAN model architecture. CycleGAN was able to generate the synthetic and reconstructed images with reasonable accuracy. The mapping function from the source (3 T) to the target domain (1.5 T) performed optimally with an average peak signal-to-noise ratio value of 25.69 ± 2.49 dB and a mean absolute error value of 2106.27 ± 1218.37. The codes for this study have been made publicly available in the following GitHub repository.a.

图像到图像的翻译是将图像从一个域转换到另一个域的一种方法,在医学领域得到了广泛的应用。通过域变换的医学图像合成在增强图像数据集的能力方面是有利的,其中给定类别的图像是有限的。从学习的角度来看,这个过程从本质上拓宽了模型对更多样化的视觉数据的接触,使其能够学习更多的广义特征,从而有助于模型面向数据的鲁棒性。在生成额外神经图像的情况下,获得无法识别的医疗数据和增加较小的注释数据集是有利的。本研究提出了一种循环一致生成对抗网络(CycleGAN)模型,用于将神经图像从一种场强转换为另一种场强(例如,3特斯拉[T]到1.5 T)。该模型与基于深度卷积GAN模型架构的模型进行了比较。CycleGAN能够以合理的精度生成合成和重建的图像。从源域(3 T)到目标域(1.5 T)的映射函数表现最佳,平均峰值信噪比值为25.69±2.49 dB,平均绝对误差值为2106.27±1218.37。本研究的代码已在以下GitHub存储库中公开提供。
{"title":"Generative Adversarial Networks for Neuroimage Translation.","authors":"Cassandra Czobit, Reza Samavi","doi":"10.1089/cmb.2024.0635","DOIUrl":"10.1089/cmb.2024.0635","url":null,"abstract":"<p><p>Image-to-image translation has gained popularity in the medical field to transform images from one domain to another. Medical image synthesis via domain transformation is advantageous in its ability to augment an image dataset where images for a given class are limited. From the learning perspective, this process contributes to the data-oriented robustness of the model by inherently broadening the model's exposure to more diverse visual data and enabling it to learn more generalized features. In the case of generating additional neuroimages, it is advantageous to obtain unidentifiable medical data and augment smaller annotated datasets. This study proposes the development of a cycle-consistent generative adversarial network (CycleGAN) model for translating neuroimages from one field strength to another (e.g., 3 Tesla [T] to 1.5 T). This model was compared with a model based on a deep convolutional GAN model architecture. CycleGAN was able to generate the synthetic and reconstructed images with reasonable accuracy. The mapping function from the source (3 T) to the target domain (1.5 T) performed optimally with an average peak signal-to-noise ratio value of 25.69 ± 2.49 dB and a mean absolute error value of 2106.27 ± 1218.37. The codes for this study have been made publicly available in the following GitHub repository.<sup>a</sup>.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"573-583"},"PeriodicalIF":1.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142894857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Effective Integration of Single-Cell Multi-Omics Data Using Improved Network-Based Integrative Clustering with Multigraph Regularization. 基于多图正则化的改进网络集成聚类的单细胞多组学数据有效集成。
IF 1.4 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-06-01 Epub Date: 2025-05-22 DOI: 10.1089/cmb.2023.0460
Shunqin Zhang, Wei Kong, Shuaiqun Wang, Kai Wei, Kun Liu, Gen Wen, Yaling Yu

The purpose of integrating different omics data is to study cellular heterogeneity at the level of transcriptional regulation from different gene levels, which can effectively identify cell types and reveal the pathogenesis of Alzheimer's disease (AD) from two perspectives. However, implementing such algorithms faces challenges such as high data noise levels, increased dimensionality, and computational complexity. In this study, multigraph regularization constraints were introduced in the network-based integrative clustering algorithm (MGR-NIC) to remove redundant features and keep the geometry structures underlying the data by fusing two types of data (snRNA-seq and snATAC-seq) of glial cells from AD samples. The effectiveness of the MGR-NIC algorithm was validated using both simulation datasets and real datasets derived from various tissues. The MGR-NIC algorithm can improve clustering accuracy by selecting features that better represent the dataset's structure. The clustering results obtained with the MGR-NIC algorithm show strong consistency with the clustering results inherent to the published DLPFC dataset, while the classification results generated using the NIC algorithm often lead to cluster overlap when applied to the DLPFC dataset. We will use the same state-of-the-art algorithms for a comprehensive evaluation with our proposed MGR-NIC algorithm, including NIC, scAI, Multi-Omics Factor Analysis v2, and JSNMF. MGR-NIC is the most stable and reliable method, implying its robustness across different datasets and its reliability in yielding consistent and accurate results.

整合不同组学数据的目的是从不同基因水平研究转录调控水平上的细胞异质性,可以有效识别细胞类型,从两个角度揭示阿尔茨海默病(AD)的发病机制。然而,实现这样的算法面临着诸如高数据噪声水平、增加的维数和计算复杂性等挑战。本研究在基于网络的整合聚类算法(mri - nic)中引入多图正则化约束,通过融合AD样本中两种类型的神经胶质细胞数据(snRNA-seq和snATAC-seq),去除冗余特征,保留数据的几何结构。通过模拟数据集和来自不同组织的真实数据集验证了mri - nic算法的有效性。mri - nic算法通过选择更能代表数据集结构的特征来提高聚类精度。mri -NIC算法得到的聚类结果与已发表的DLPFC数据集固有的聚类结果具有较强的一致性,而NIC算法产生的分类结果在应用于DLPFC数据集时往往会导致聚类重叠。我们将使用同样先进的算法对我们提出的mri -NIC算法进行综合评估,包括NIC、scAI、多组学因子分析v2和JSNMF。mri - nic是最稳定和可靠的方法,这意味着它在不同数据集上的鲁棒性以及它在产生一致和准确结果方面的可靠性。
{"title":"Effective Integration of Single-Cell Multi-Omics Data Using Improved Network-Based Integrative Clustering with Multigraph Regularization.","authors":"Shunqin Zhang, Wei Kong, Shuaiqun Wang, Kai Wei, Kun Liu, Gen Wen, Yaling Yu","doi":"10.1089/cmb.2023.0460","DOIUrl":"10.1089/cmb.2023.0460","url":null,"abstract":"<p><p>The purpose of integrating different omics data is to study cellular heterogeneity at the level of transcriptional regulation from different gene levels, which can effectively identify cell types and reveal the pathogenesis of Alzheimer's disease (AD) from two perspectives. However, implementing such algorithms faces challenges such as high data noise levels, increased dimensionality, and computational complexity. In this study, multigraph regularization constraints were introduced in the network-based integrative clustering algorithm (MGR-NIC) to remove redundant features and keep the geometry structures underlying the data by fusing two types of data (snRNA-seq and snATAC-seq) of glial cells from AD samples. The effectiveness of the MGR-NIC algorithm was validated using both simulation datasets and real datasets derived from various tissues. The MGR-NIC algorithm can improve clustering accuracy by selecting features that better represent the dataset's structure. The clustering results obtained with the MGR-NIC algorithm show strong consistency with the clustering results inherent to the published DLPFC dataset, while the classification results generated using the NIC algorithm often lead to cluster overlap when applied to the DLPFC dataset. We will use the same state-of-the-art algorithms for a comprehensive evaluation with our proposed MGR-NIC algorithm, including NIC, scAI, Multi-Omics Factor Analysis v2, and JSNMF. MGR-NIC is the most stable and reliable method, implying its robustness across different datasets and its reliability in yielding consistent and accurate results.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"601-614"},"PeriodicalIF":1.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144119822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring the Influence of Gene Networks on Driver Gene Classification. 探讨基因网络对驱动基因分类的影响。
IF 1.4 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-06-01 Epub Date: 2025-05-13 DOI: 10.1089/cmb.2025.0043
Paulo Henrique Ribeiro, Jorge Francisco Cutigi, Rodrigo Henrique Ramos, Cynthia de Oliveira Lage Ferreira, Adriane Feijo Evangelista, Adenilso da Silva Simao

Cancer is a complex disease caused by mutations in the genome of cells. Genetic mutations can be divided into driver mutations, which are significant for the initiation and progression of cancer, and passenger mutations, which have a neutral effect. In recent years, computational methods have been developed to identify driver genes. Some of these methods use data from gene networks to classify the genes. However, the impact of different gene networks on the performance of these methods remains unexplored. This article aims to analyze the influence of genetic networks in driver gene classification. We analyzed driver gene classification methods that use gene networks as input data, using different cancer mutation datasets and distinct gene networks. Computational methods show significant variation in their results when different gene networks are employed. The results highlight the need to carefully interpret driver gene classification and emphasize the importance of using different gene networks. These findings underline the necessity of developing more robust computational approaches that account for network variability, ensuring greater reliability in driver gene identification and its applications in cancer research.

癌症是一种由细胞基因组突变引起的复杂疾病。基因突变可分为驱动突变和乘客突变,前者对癌症的发生和发展具有重要意义,后者具有中性作用。近年来,计算方法已经发展到识别驱动基因。其中一些方法使用来自基因网络的数据来对基因进行分类。然而,不同的基因网络对这些方法的性能的影响仍未被探索。本文旨在分析遗传网络对驱动基因分类的影响。我们使用不同的癌症突变数据集和不同的基因网络,分析了使用基因网络作为输入数据的驱动基因分类方法。采用不同的基因网络时,计算方法的结果有显著差异。这些结果强调了仔细解释驱动基因分类的必要性,并强调了使用不同基因网络的重要性。这些发现强调了开发更强大的计算方法来解释网络变异性的必要性,确保驱动基因鉴定及其在癌症研究中的应用具有更高的可靠性。
{"title":"Exploring the Influence of Gene Networks on Driver Gene Classification.","authors":"Paulo Henrique Ribeiro, Jorge Francisco Cutigi, Rodrigo Henrique Ramos, Cynthia de Oliveira Lage Ferreira, Adriane Feijo Evangelista, Adenilso da Silva Simao","doi":"10.1089/cmb.2025.0043","DOIUrl":"10.1089/cmb.2025.0043","url":null,"abstract":"<p><p>Cancer is a complex disease caused by mutations in the genome of cells. Genetic mutations can be divided into driver mutations, which are significant for the initiation and progression of cancer, and passenger mutations, which have a neutral effect. In recent years, computational methods have been developed to identify driver genes. Some of these methods use data from gene networks to classify the genes. However, the impact of different gene networks on the performance of these methods remains unexplored. This article aims to analyze the influence of genetic networks in driver gene classification. We analyzed driver gene classification methods that use gene networks as input data, using different cancer mutation datasets and distinct gene networks. Computational methods show significant variation in their results when different gene networks are employed. The results highlight the need to carefully interpret driver gene classification and emphasize the importance of using different gene networks. These findings underline the necessity of developing more robust computational approaches that account for network variability, ensuring greater reliability in driver gene identification and its applications in cancer research.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"615-625"},"PeriodicalIF":1.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143994363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Traditional and Deep Machine Learning to Predict Emergency Room Triage Levels. 使用传统和深度机器学习预测急诊室分类水平。
IF 1.4 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-06-01 Epub Date: 2025-05-22 DOI: 10.1089/cmb.2024.0632
Mehmet Yıldırım, Savaş Sezik, Ayşe Başar

Accurate triage in emergency rooms is crucial for efficient patient care and resource allocation. We developed methods to predict triage levels using several traditional machine learning methods (logistic regression, random forest, XGBoost) and neural network deep learning-based approaches. These models were tested on a dataset from emergency department visits of patients at a local Turkish hospital; this dataset consists of both structured and unstructured data. Compared with previous work, our challenge was to build a predictive model that uses documents written in the Turkish language and that handles specific aspects of the Turkish medical system. Text embedding techniques such as Bag of Words, Word2Vec, and BERT-based embedding were used to process the unstructured patient complaints. We used a comprehensive set of features including patient history data and disease diagnosis within our predictive models, which included advanced neural network architectures such as convolutional neural networks, attention mechanisms, and long-short-term memory networks. Our results revealed that BERT embeddings significantly enhanced the performance of neural network models, while Word2Vec embeddings showed slight better results in traditional machine learning models. The most effective model was XGBoost combined with Word2Vec embeddings, achieving 86.7% AUC, 81.5% accuracy, and 68.7% weighted F1 score. We conclude that text embedding methods and machine learning methods are effective tools to predict emergency room triage levels. The integration of patient history into the models, alongside the strategic use of text embeddings, significantly improves predictive accuracy.

在急诊室进行准确的分诊对有效的病人护理和资源分配至关重要。我们开发了使用几种传统机器学习方法(逻辑回归、随机森林、XGBoost)和基于神经网络深度学习的方法来预测分类水平的方法。这些模型在土耳其当地一家医院急诊科就诊患者的数据集上进行了测试;该数据集由结构化和非结构化数据组成。与之前的工作相比,我们面临的挑战是建立一个预测模型,该模型使用土耳其语编写的文档,并处理土耳其医疗系统的特定方面。文本嵌入技术如Bag of Words、Word2Vec和基于bert的嵌入技术被用于处理非结构化的患者投诉。我们在预测模型中使用了包括患者病史数据和疾病诊断在内的一系列综合特征,其中包括卷积神经网络、注意力机制和长短期记忆网络等先进的神经网络架构。我们的研究结果表明,BERT嵌入显著提高了神经网络模型的性能,而Word2Vec嵌入在传统机器学习模型中表现稍好。最有效的模型是XGBoost结合Word2Vec嵌入,AUC达到86.7%,准确率达到81.5%,F1加权得分达到68.7%。我们得出结论,文本嵌入方法和机器学习方法是预测急诊室分诊水平的有效工具。将患者病史整合到模型中,以及策略性地使用文本嵌入,显著提高了预测的准确性。
{"title":"Using Traditional and Deep Machine Learning to Predict Emergency Room Triage Levels.","authors":"Mehmet Yıldırım, Savaş Sezik, Ayşe Başar","doi":"10.1089/cmb.2024.0632","DOIUrl":"10.1089/cmb.2024.0632","url":null,"abstract":"<p><p>Accurate triage in emergency rooms is crucial for efficient patient care and resource allocation. We developed methods to predict triage levels using several traditional machine learning methods (logistic regression, random forest, XGBoost) and neural network deep learning-based approaches. These models were tested on a dataset from emergency department visits of patients at a local Turkish hospital; this dataset consists of both structured and unstructured data. Compared with previous work, our challenge was to build a predictive model that uses documents written in the Turkish language and that handles specific aspects of the Turkish medical system. Text embedding techniques such as Bag of Words, Word2Vec, and BERT-based embedding were used to process the unstructured patient complaints. We used a comprehensive set of features including patient history data and disease diagnosis within our predictive models, which included advanced neural network architectures such as convolutional neural networks, attention mechanisms, and long-short-term memory networks. Our results revealed that BERT embeddings significantly enhanced the performance of neural network models, while Word2Vec embeddings showed slight better results in traditional machine learning models. The most effective model was XGBoost combined with Word2Vec embeddings, achieving 86.7% AUC, 81.5% accuracy, and 68.7% weighted F1 score. We conclude that text embedding methods and machine learning methods are effective tools to predict emergency room triage levels. The integration of patient history into the models, alongside the strategic use of text embeddings, significantly improves predictive accuracy.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"584-600"},"PeriodicalIF":1.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144119823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Computational Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1