Pub Date : 2025-07-01Epub Date: 2025-06-12DOI: 10.1089/cmb.2025.0074
Yang Wang, Zanyu Shi, Pathum Weerawarna, Kun Huang, Timothy Richardson, Yijie Wang
Explainable Graph Neural Networks have been developed and applied to drug-protein binding prediction to identify the key chemical structures in a drug that have active interactions with the target proteins. However, the key structures identified by the current explainable Graph Neural Network (GNN) models are typically chemically invalid. Furthermore, a threshold must be manually selected to pinpoint the key structures from the rest. To overcome the limitations of the current explainable GNN models, we propose SLGNN, which stands for using Sparse Learning to Graph Neural Networks. It relies on using a chemical-substructure-based graph to represent a drug molecule. Furthermore, SLGNN incorporates generalized fused lasso with message-passing algorithms to identify connected subgraphs that are critical for the drug-protein binding prediction. Due to the use of the chemical-substructure-based graph, it is guaranteed that any subgraphs in a drug identified by SLGNN are chemically valid structures. These structures can be further interpreted as the key chemical structures for the drug to bind to the target protein. Our code is available at https://github.com/yw109iu/Explainable_GNN. We test SLGNN and the state-of-the-art competing methods on three real-world drug-protein binding datasets. We have demonstrated that the key structures identified by our SLGNN are chemically valid and have more predictive power.
{"title":"Building Explainable Graph Neural Network by Sparse Learning for the Drug-Protein Binding Prediction.","authors":"Yang Wang, Zanyu Shi, Pathum Weerawarna, Kun Huang, Timothy Richardson, Yijie Wang","doi":"10.1089/cmb.2025.0074","DOIUrl":"10.1089/cmb.2025.0074","url":null,"abstract":"<p><p>Explainable Graph Neural Networks have been developed and applied to drug-protein binding prediction to identify the key chemical structures in a drug that have active interactions with the target proteins. However, the key structures identified by the current explainable Graph Neural Network (GNN) models are typically chemically invalid. Furthermore, a threshold must be manually selected to pinpoint the key structures from the rest. To overcome the limitations of the current explainable GNN models, we propose SLGNN, which stands for using Sparse Learning to Graph Neural Networks. It relies on using a chemical-substructure-based graph to represent a drug molecule. Furthermore, SLGNN incorporates generalized fused lasso with message-passing algorithms to identify connected subgraphs that are critical for the drug-protein binding prediction. Due to the use of the chemical-substructure-based graph, it is guaranteed that any subgraphs in a drug identified by SLGNN are chemically valid structures. These structures can be further interpreted as the key chemical structures for the drug to bind to the target protein. Our code is available at https://github.com/yw109iu/Explainable_GNN. We test SLGNN and the state-of-the-art competing methods on three real-world drug-protein binding datasets. We have demonstrated that the key structures identified by our SLGNN are chemically valid and have more predictive power.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"632-645"},"PeriodicalIF":1.6,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12259411/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144275028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-01Epub Date: 2025-05-28DOI: 10.1089/cmb.2025.0076
Peiran Jiang, Jose Lugo-Martinez
Protein pockets are essential for many proteins to carry out their functions. Locating and measuring protein pockets, as well as studying the anatomy of pockets, helps us further understand protein function. Most research studies focus on learning either local or global information from protein structures. However, there is a lack of studies that leverage the power of integrating both local and global representations of these structures. In this work, we combine topological data analysis (TDA) and geometric deep learning (GDL) to analyze the putative protein pockets of enzymes. TDA captures blueprints of the global topological invariant of protein pockets, whereas GDL decomposes the fingerprints into building blocks of these pockets. This integration of local and global views provides a comprehensive and complementary understanding of the protein structural motifs (niches for short) within protein pockets. We also analyze the distribution of the building blocks making up the pocket and profile the predictive power of coupling local and global representations for the task of discriminating between enzymes and nonenzymes, as well as predicting the enzyme class. We demonstrate that our representation learning framework for macromolecules is particularly useful when the structure is known, and the scenarios heavily rely on local and global information.
{"title":"Combined Topological Data Analysis and Geometric Deep Learning Reveal Niches by the Quantification of Protein Binding Pockets.","authors":"Peiran Jiang, Jose Lugo-Martinez","doi":"10.1089/cmb.2025.0076","DOIUrl":"10.1089/cmb.2025.0076","url":null,"abstract":"<p><p>Protein pockets are essential for many proteins to carry out their functions. Locating and measuring protein pockets, as well as studying the anatomy of pockets, helps us further understand protein function. Most research studies focus on learning either local or global information from protein structures. However, there is a lack of studies that leverage the power of integrating both local and global representations of these structures. In this work, we combine topological data analysis (TDA) and geometric deep learning (GDL) to analyze the putative protein pockets of enzymes. TDA captures blueprints of the global topological invariant of protein pockets, whereas GDL decomposes the fingerprints into building blocks of these pockets. This integration of local and global views provides a comprehensive and complementary understanding of the protein structural motifs (<i>niches</i> for short) within protein pockets. We also analyze the distribution of the building blocks making up the pocket and profile the predictive power of coupling local and global representations for the task of discriminating between enzymes and nonenzymes, as well as predicting the enzyme class. We demonstrate that our representation learning framework for macromolecules is particularly useful when the structure is known, and the scenarios heavily rely on local and global information.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"659-674"},"PeriodicalIF":1.4,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144174109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-01Epub Date: 2025-06-02DOI: 10.1089/cmb.2025.0141
Anna Ritz
{"title":"CNB-MAC 2023 Special Issue.","authors":"Anna Ritz","doi":"10.1089/cmb.2025.0141","DOIUrl":"10.1089/cmb.2025.0141","url":null,"abstract":"","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"631"},"PeriodicalIF":1.4,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144208682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-01Epub Date: 2025-05-13DOI: 10.1089/cmb.2025.0117
Inci M Baytas
{"title":"The 2nd International Workshop on Pattern Recognition in Healthcare Analytics 2023 Preface.","authors":"Inci M Baytas","doi":"10.1089/cmb.2025.0117","DOIUrl":"10.1089/cmb.2025.0117","url":null,"abstract":"","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"557"},"PeriodicalIF":1.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143985486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-01Epub Date: 2025-04-23DOI: 10.1089/cmb.2024.0631
Egemen İşgÜder, Özlem Durmaz İncel
Wearable and mobile devices equipped with motion sensors offer important insights into user behavior. Machine learning and, more recently, deep learning techniques have been applied to analyze sensor data. Typically, the focus is on a single task, such as human activity recognition (HAR), and the data is processed centrally on a server or in the cloud. However, the same sensor data can be leveraged for multiple tasks, and distributed machine learning methods can be employed without the need for transmitting data to a central location. In this study, we introduce the FedOpenHAR framework, which explores federated transfer learning in a multitask setting for both sensor-based HAR and device position identification tasks. This approach utilizes transfer learning by training task-specific and personalized layers in a federated manner. The OpenHAR framework, which includes ten smaller datasets, is used for training the models. The main challenge is developing robust models that are applicable to both tasks across different datasets, which may contain only a subset of label types. Multiple experiments are conducted in the Flower federated learning environment using the DeepConvLSTM architecture. Results are presented for both federated and centralized training under various parameters and constraints. By employing transfer learning and training task-specific and personalized federated models, we achieve a higher accuracy (72.4%) compared to a fully centralized training approach (64.5%), and similar accuracy to a scenario where each client performs individual training in isolation (72.6%). However, the advantage of FedOpenHAR over individual training is that, when a new client joins with a new label type (representing a new task), it can begin training from the already existing common layer. Furthermore, if a new client wants to classify a new class in one of the existing tasks, FedOpenHAR allows training to begin directly from the task-specific layers.
{"title":"FedOpenHAR: Federated Multitask Transfer Learning for Sensor-Based Human Activity Recognition.","authors":"Egemen İşgÜder, Özlem Durmaz İncel","doi":"10.1089/cmb.2024.0631","DOIUrl":"10.1089/cmb.2024.0631","url":null,"abstract":"<p><p>Wearable and mobile devices equipped with motion sensors offer important insights into user behavior. Machine learning and, more recently, deep learning techniques have been applied to analyze sensor data. Typically, the focus is on a single task, such as human activity recognition (HAR), and the data is processed centrally on a server or in the cloud. However, the same sensor data can be leveraged for multiple tasks, and distributed machine learning methods can be employed without the need for transmitting data to a central location. In this study, we introduce the FedOpenHAR framework, which explores federated transfer learning in a multitask setting for both sensor-based HAR and device position identification tasks. This approach utilizes transfer learning by training task-specific and personalized layers in a federated manner. The OpenHAR framework, which includes ten smaller datasets, is used for training the models. The main challenge is developing robust models that are applicable to both tasks across different datasets, which may contain only a subset of label types. Multiple experiments are conducted in the Flower federated learning environment using the DeepConvLSTM architecture. Results are presented for both federated and centralized training under various parameters and constraints. By employing transfer learning and training task-specific and personalized federated models, we achieve a higher accuracy (72.4%) compared to a fully centralized training approach (64.5%), and similar accuracy to a scenario where each client performs individual training in isolation (72.6%). However, the advantage of FedOpenHAR over individual training is that, when a new client joins with a new label type (representing a new task), it can begin training from the already existing common layer. Furthermore, if a new client wants to classify a new class in one of the existing tasks, FedOpenHAR allows training to begin directly from the task-specific layers.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"558-572"},"PeriodicalIF":1.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143972829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-01Epub Date: 2025-05-09DOI: 10.1089/cmb.2024.0843
Martí Cortada Garcia, Adrià Diéguez Moscardó, Marta Casanellas
We introduce GenPhylo, a Python module that simulates nucleotide sequence data along a phylogeny avoiding the restriction of continuous-time Markov processes. GenPhylo uses directly a general Markov model and therefore naturally incorporates heterogeneity across lineages. We solve the challenge of generating transition matrices with a pre-given expected number of substitutions (the branch length information) by providing an algorithm that can be incorporated in other simulation software.
{"title":"Generating Heterogeneous Data on Gene Trees.","authors":"Martí Cortada Garcia, Adrià Diéguez Moscardó, Marta Casanellas","doi":"10.1089/cmb.2024.0843","DOIUrl":"10.1089/cmb.2024.0843","url":null,"abstract":"<p><p>We introduce GenPhylo, a Python module that simulates nucleotide sequence data along a phylogeny avoiding the restriction of continuous-time Markov processes. GenPhylo uses directly a general Markov model and therefore naturally incorporates heterogeneity across lineages. We solve the challenge of generating transition matrices with a pre-given expected number of substitutions (the branch length information) by providing an algorithm that can be incorporated in other simulation software.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"626-630"},"PeriodicalIF":1.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143972956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-01Epub Date: 2024-12-27DOI: 10.1089/cmb.2024.0635
Cassandra Czobit, Reza Samavi
Image-to-image translation has gained popularity in the medical field to transform images from one domain to another. Medical image synthesis via domain transformation is advantageous in its ability to augment an image dataset where images for a given class are limited. From the learning perspective, this process contributes to the data-oriented robustness of the model by inherently broadening the model's exposure to more diverse visual data and enabling it to learn more generalized features. In the case of generating additional neuroimages, it is advantageous to obtain unidentifiable medical data and augment smaller annotated datasets. This study proposes the development of a cycle-consistent generative adversarial network (CycleGAN) model for translating neuroimages from one field strength to another (e.g., 3 Tesla [T] to 1.5 T). This model was compared with a model based on a deep convolutional GAN model architecture. CycleGAN was able to generate the synthetic and reconstructed images with reasonable accuracy. The mapping function from the source (3 T) to the target domain (1.5 T) performed optimally with an average peak signal-to-noise ratio value of 25.69 ± 2.49 dB and a mean absolute error value of 2106.27 ± 1218.37. The codes for this study have been made publicly available in the following GitHub repository.a.
{"title":"Generative Adversarial Networks for Neuroimage Translation.","authors":"Cassandra Czobit, Reza Samavi","doi":"10.1089/cmb.2024.0635","DOIUrl":"10.1089/cmb.2024.0635","url":null,"abstract":"<p><p>Image-to-image translation has gained popularity in the medical field to transform images from one domain to another. Medical image synthesis via domain transformation is advantageous in its ability to augment an image dataset where images for a given class are limited. From the learning perspective, this process contributes to the data-oriented robustness of the model by inherently broadening the model's exposure to more diverse visual data and enabling it to learn more generalized features. In the case of generating additional neuroimages, it is advantageous to obtain unidentifiable medical data and augment smaller annotated datasets. This study proposes the development of a cycle-consistent generative adversarial network (CycleGAN) model for translating neuroimages from one field strength to another (e.g., 3 Tesla [T] to 1.5 T). This model was compared with a model based on a deep convolutional GAN model architecture. CycleGAN was able to generate the synthetic and reconstructed images with reasonable accuracy. The mapping function from the source (3 T) to the target domain (1.5 T) performed optimally with an average peak signal-to-noise ratio value of 25.69 ± 2.49 dB and a mean absolute error value of 2106.27 ± 1218.37. The codes for this study have been made publicly available in the following GitHub repository.<sup>a</sup>.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"573-583"},"PeriodicalIF":1.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142894857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-01Epub Date: 2025-05-22DOI: 10.1089/cmb.2023.0460
Shunqin Zhang, Wei Kong, Shuaiqun Wang, Kai Wei, Kun Liu, Gen Wen, Yaling Yu
The purpose of integrating different omics data is to study cellular heterogeneity at the level of transcriptional regulation from different gene levels, which can effectively identify cell types and reveal the pathogenesis of Alzheimer's disease (AD) from two perspectives. However, implementing such algorithms faces challenges such as high data noise levels, increased dimensionality, and computational complexity. In this study, multigraph regularization constraints were introduced in the network-based integrative clustering algorithm (MGR-NIC) to remove redundant features and keep the geometry structures underlying the data by fusing two types of data (snRNA-seq and snATAC-seq) of glial cells from AD samples. The effectiveness of the MGR-NIC algorithm was validated using both simulation datasets and real datasets derived from various tissues. The MGR-NIC algorithm can improve clustering accuracy by selecting features that better represent the dataset's structure. The clustering results obtained with the MGR-NIC algorithm show strong consistency with the clustering results inherent to the published DLPFC dataset, while the classification results generated using the NIC algorithm often lead to cluster overlap when applied to the DLPFC dataset. We will use the same state-of-the-art algorithms for a comprehensive evaluation with our proposed MGR-NIC algorithm, including NIC, scAI, Multi-Omics Factor Analysis v2, and JSNMF. MGR-NIC is the most stable and reliable method, implying its robustness across different datasets and its reliability in yielding consistent and accurate results.
{"title":"Effective Integration of Single-Cell Multi-Omics Data Using Improved Network-Based Integrative Clustering with Multigraph Regularization.","authors":"Shunqin Zhang, Wei Kong, Shuaiqun Wang, Kai Wei, Kun Liu, Gen Wen, Yaling Yu","doi":"10.1089/cmb.2023.0460","DOIUrl":"10.1089/cmb.2023.0460","url":null,"abstract":"<p><p>The purpose of integrating different omics data is to study cellular heterogeneity at the level of transcriptional regulation from different gene levels, which can effectively identify cell types and reveal the pathogenesis of Alzheimer's disease (AD) from two perspectives. However, implementing such algorithms faces challenges such as high data noise levels, increased dimensionality, and computational complexity. In this study, multigraph regularization constraints were introduced in the network-based integrative clustering algorithm (MGR-NIC) to remove redundant features and keep the geometry structures underlying the data by fusing two types of data (snRNA-seq and snATAC-seq) of glial cells from AD samples. The effectiveness of the MGR-NIC algorithm was validated using both simulation datasets and real datasets derived from various tissues. The MGR-NIC algorithm can improve clustering accuracy by selecting features that better represent the dataset's structure. The clustering results obtained with the MGR-NIC algorithm show strong consistency with the clustering results inherent to the published DLPFC dataset, while the classification results generated using the NIC algorithm often lead to cluster overlap when applied to the DLPFC dataset. We will use the same state-of-the-art algorithms for a comprehensive evaluation with our proposed MGR-NIC algorithm, including NIC, scAI, Multi-Omics Factor Analysis v2, and JSNMF. MGR-NIC is the most stable and reliable method, implying its robustness across different datasets and its reliability in yielding consistent and accurate results.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"601-614"},"PeriodicalIF":1.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144119822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-01Epub Date: 2025-05-13DOI: 10.1089/cmb.2025.0043
Paulo Henrique Ribeiro, Jorge Francisco Cutigi, Rodrigo Henrique Ramos, Cynthia de Oliveira Lage Ferreira, Adriane Feijo Evangelista, Adenilso da Silva Simao
Cancer is a complex disease caused by mutations in the genome of cells. Genetic mutations can be divided into driver mutations, which are significant for the initiation and progression of cancer, and passenger mutations, which have a neutral effect. In recent years, computational methods have been developed to identify driver genes. Some of these methods use data from gene networks to classify the genes. However, the impact of different gene networks on the performance of these methods remains unexplored. This article aims to analyze the influence of genetic networks in driver gene classification. We analyzed driver gene classification methods that use gene networks as input data, using different cancer mutation datasets and distinct gene networks. Computational methods show significant variation in their results when different gene networks are employed. The results highlight the need to carefully interpret driver gene classification and emphasize the importance of using different gene networks. These findings underline the necessity of developing more robust computational approaches that account for network variability, ensuring greater reliability in driver gene identification and its applications in cancer research.
{"title":"Exploring the Influence of Gene Networks on Driver Gene Classification.","authors":"Paulo Henrique Ribeiro, Jorge Francisco Cutigi, Rodrigo Henrique Ramos, Cynthia de Oliveira Lage Ferreira, Adriane Feijo Evangelista, Adenilso da Silva Simao","doi":"10.1089/cmb.2025.0043","DOIUrl":"10.1089/cmb.2025.0043","url":null,"abstract":"<p><p>Cancer is a complex disease caused by mutations in the genome of cells. Genetic mutations can be divided into driver mutations, which are significant for the initiation and progression of cancer, and passenger mutations, which have a neutral effect. In recent years, computational methods have been developed to identify driver genes. Some of these methods use data from gene networks to classify the genes. However, the impact of different gene networks on the performance of these methods remains unexplored. This article aims to analyze the influence of genetic networks in driver gene classification. We analyzed driver gene classification methods that use gene networks as input data, using different cancer mutation datasets and distinct gene networks. Computational methods show significant variation in their results when different gene networks are employed. The results highlight the need to carefully interpret driver gene classification and emphasize the importance of using different gene networks. These findings underline the necessity of developing more robust computational approaches that account for network variability, ensuring greater reliability in driver gene identification and its applications in cancer research.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"615-625"},"PeriodicalIF":1.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143994363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-01Epub Date: 2025-05-22DOI: 10.1089/cmb.2024.0632
Mehmet Yıldırım, Savaş Sezik, Ayşe Başar
Accurate triage in emergency rooms is crucial for efficient patient care and resource allocation. We developed methods to predict triage levels using several traditional machine learning methods (logistic regression, random forest, XGBoost) and neural network deep learning-based approaches. These models were tested on a dataset from emergency department visits of patients at a local Turkish hospital; this dataset consists of both structured and unstructured data. Compared with previous work, our challenge was to build a predictive model that uses documents written in the Turkish language and that handles specific aspects of the Turkish medical system. Text embedding techniques such as Bag of Words, Word2Vec, and BERT-based embedding were used to process the unstructured patient complaints. We used a comprehensive set of features including patient history data and disease diagnosis within our predictive models, which included advanced neural network architectures such as convolutional neural networks, attention mechanisms, and long-short-term memory networks. Our results revealed that BERT embeddings significantly enhanced the performance of neural network models, while Word2Vec embeddings showed slight better results in traditional machine learning models. The most effective model was XGBoost combined with Word2Vec embeddings, achieving 86.7% AUC, 81.5% accuracy, and 68.7% weighted F1 score. We conclude that text embedding methods and machine learning methods are effective tools to predict emergency room triage levels. The integration of patient history into the models, alongside the strategic use of text embeddings, significantly improves predictive accuracy.
在急诊室进行准确的分诊对有效的病人护理和资源分配至关重要。我们开发了使用几种传统机器学习方法(逻辑回归、随机森林、XGBoost)和基于神经网络深度学习的方法来预测分类水平的方法。这些模型在土耳其当地一家医院急诊科就诊患者的数据集上进行了测试;该数据集由结构化和非结构化数据组成。与之前的工作相比,我们面临的挑战是建立一个预测模型,该模型使用土耳其语编写的文档,并处理土耳其医疗系统的特定方面。文本嵌入技术如Bag of Words、Word2Vec和基于bert的嵌入技术被用于处理非结构化的患者投诉。我们在预测模型中使用了包括患者病史数据和疾病诊断在内的一系列综合特征,其中包括卷积神经网络、注意力机制和长短期记忆网络等先进的神经网络架构。我们的研究结果表明,BERT嵌入显著提高了神经网络模型的性能,而Word2Vec嵌入在传统机器学习模型中表现稍好。最有效的模型是XGBoost结合Word2Vec嵌入,AUC达到86.7%,准确率达到81.5%,F1加权得分达到68.7%。我们得出结论,文本嵌入方法和机器学习方法是预测急诊室分诊水平的有效工具。将患者病史整合到模型中,以及策略性地使用文本嵌入,显著提高了预测的准确性。
{"title":"Using Traditional and Deep Machine Learning to Predict Emergency Room Triage Levels.","authors":"Mehmet Yıldırım, Savaş Sezik, Ayşe Başar","doi":"10.1089/cmb.2024.0632","DOIUrl":"10.1089/cmb.2024.0632","url":null,"abstract":"<p><p>Accurate triage in emergency rooms is crucial for efficient patient care and resource allocation. We developed methods to predict triage levels using several traditional machine learning methods (logistic regression, random forest, XGBoost) and neural network deep learning-based approaches. These models were tested on a dataset from emergency department visits of patients at a local Turkish hospital; this dataset consists of both structured and unstructured data. Compared with previous work, our challenge was to build a predictive model that uses documents written in the Turkish language and that handles specific aspects of the Turkish medical system. Text embedding techniques such as Bag of Words, Word2Vec, and BERT-based embedding were used to process the unstructured patient complaints. We used a comprehensive set of features including patient history data and disease diagnosis within our predictive models, which included advanced neural network architectures such as convolutional neural networks, attention mechanisms, and long-short-term memory networks. Our results revealed that BERT embeddings significantly enhanced the performance of neural network models, while Word2Vec embeddings showed slight better results in traditional machine learning models. The most effective model was XGBoost combined with Word2Vec embeddings, achieving 86.7% AUC, 81.5% accuracy, and 68.7% weighted F1 score. We conclude that text embedding methods and machine learning methods are effective tools to predict emergency room triage levels. The integration of patient history into the models, alongside the strategic use of text embeddings, significantly improves predictive accuracy.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"584-600"},"PeriodicalIF":1.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144119823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}