Pub Date : 2024-08-13DOI: 10.1109/TCBB.2024.3442669
Sachin Mathur, Hamid Mattoo, Ziv Bar-Joseph
Time series RNASeq studies can enable understanding of the dynamics of disease progression and treatment response in patients. They also provide information on biomarkers, activated and repressed pathways, and more. While useful, data from multiple patients is challenging to integrate due to the heterogeneity in treatment response among patients, and the small number of timepoints that are usually profiled. Due to the heterogeneity among patients, relying on the sampled time points to integrate data across individuals is challenging and does not lead to correct reconstruction of the response patterns. To address these challenges, we developed a new constrained based pseudotime ordering method for analyzing transcriptomics data in clinical and response studies. Our method allows the assignment of samples to their correct placement on the response curve while respecting the individual patient order. We use polynomials to represent gene expression over the duration of the study and an EM algorithm to determine parameters and locations. Application to three treatment response datasets shows that our method improves on prior methods and leads to accurate orderings that provide new biological insight on the disease and response. Code for the method is available at https://github.com/Sanofi-Public/ RDCS-bulkRNASeq-pseudo ordering.
时间序列 RNASeq 研究有助于了解患者的疾病进展动态和治疗反应。它们还能提供生物标记物、激活和抑制通路等方面的信息。来自多个患者的数据虽然有用,但由于患者之间治疗反应的异质性以及通常分析的时间点数量较少,整合这些数据具有挑战性。由于患者之间存在异质性,依靠采样时间点来整合不同个体的数据具有挑战性,而且无法正确重建反应模式。为了应对这些挑战,我们开发了一种新的基于约束的伪时间排序方法,用于分析临床和反应研究中的转录组学数据。我们的方法允许将样本分配到反应曲线上的正确位置,同时尊重患者的个体排序。我们使用多项式来表示研究期间的基因表达,并使用 EM 算法来确定参数和位置。对三个治疗反应数据集的应用表明,我们的方法改进了之前的方法,并能准确排序,为疾病和反应提供新的生物学见解。该方法的代码见 https://github.com/Sanofi-Public/ RDCS-bulkRNASeq-pseudo ordering。
{"title":"Constrained Pseudo-time Ordering for Clinical Transcriptomics Data.","authors":"Sachin Mathur, Hamid Mattoo, Ziv Bar-Joseph","doi":"10.1109/TCBB.2024.3442669","DOIUrl":"10.1109/TCBB.2024.3442669","url":null,"abstract":"<p><p>Time series RNASeq studies can enable understanding of the dynamics of disease progression and treatment response in patients. They also provide information on biomarkers, activated and repressed pathways, and more. While useful, data from multiple patients is challenging to integrate due to the heterogeneity in treatment response among patients, and the small number of timepoints that are usually profiled. Due to the heterogeneity among patients, relying on the sampled time points to integrate data across individuals is challenging and does not lead to correct reconstruction of the response patterns. To address these challenges, we developed a new constrained based pseudotime ordering method for analyzing transcriptomics data in clinical and response studies. Our method allows the assignment of samples to their correct placement on the response curve while respecting the individual patient order. We use polynomials to represent gene expression over the duration of the study and an EM algorithm to determine parameters and locations. Application to three treatment response datasets shows that our method improves on prior methods and leads to accurate orderings that provide new biological insight on the disease and response. Code for the method is available at https://github.com/Sanofi-Public/ RDCS-bulkRNASeq-pseudo ordering.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141975612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-12DOI: 10.1109/TCBB.2024.3440913
Yushan Qiu, Wensheng Chen, Wai-Ki Ching, Hongmin Cai, Hao Jiang, Quan Zou
Increasing evidence has indicated that RNA-binding proteins (RBPs) play an essential role in mediating alternative splicing (AS) events during epithelial-mesenchymal transition (EMT). However, due to the substantial cost and complexity of biological experiments, how AS events are regulated and influenced remains largely unknown. Thus, it is important to construct effective models for inferring hidden RBP-AS event associations during EMT process. In this paper, a novel and efficient model was developed to identify AS event-related candidate RBPs based on Adaptive Graph-based Multi-Label learning (AGML). In particular, we propose to adaptively learn a new affinity graph to capture the intrinsic structure of data for both RBPs and AS events. Multi-view similarity matrices are employed for maintaining the intrinsic structure and guiding the adaptive graph learning. We then simultaneously update the RBP and AS event associations that are predicted from both spaces by applying multi-label learning. The experimental results have shown that our AGML achieved AUC values of 0.9521 and 0.9873 by 5-fold and leave-one-out cross-validations, respectively, indicating the superiority and effectiveness of our proposed model. Furthermore, AGML can serve as an efficient and reliable tool for uncovering novel AS events-associated RBPs and is applicable for predicting the associations between other biological entities. The source code of AGML is available at https://github.com/yushanqiu/AGML.
{"title":"AGML: Adaptive Graph-based Multi-label Learning for Prediction of RBP and AS Event Associations During EMT.","authors":"Yushan Qiu, Wensheng Chen, Wai-Ki Ching, Hongmin Cai, Hao Jiang, Quan Zou","doi":"10.1109/TCBB.2024.3440913","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3440913","url":null,"abstract":"<p><p>Increasing evidence has indicated that RNA-binding proteins (RBPs) play an essential role in mediating alternative splicing (AS) events during epithelial-mesenchymal transition (EMT). However, due to the substantial cost and complexity of biological experiments, how AS events are regulated and influenced remains largely unknown. Thus, it is important to construct effective models for inferring hidden RBP-AS event associations during EMT process. In this paper, a novel and efficient model was developed to identify AS event-related candidate RBPs based on Adaptive Graph-based Multi-Label learning (AGML). In particular, we propose to adaptively learn a new affinity graph to capture the intrinsic structure of data for both RBPs and AS events. Multi-view similarity matrices are employed for maintaining the intrinsic structure and guiding the adaptive graph learning. We then simultaneously update the RBP and AS event associations that are predicted from both spaces by applying multi-label learning. The experimental results have shown that our AGML achieved AUC values of 0.9521 and 0.9873 by 5-fold and leave-one-out cross-validations, respectively, indicating the superiority and effectiveness of our proposed model. Furthermore, AGML can serve as an efficient and reliable tool for uncovering novel AS events-associated RBPs and is applicable for predicting the associations between other biological entities. The source code of AGML is available at https://github.com/yushanqiu/AGML.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141971037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-08DOI: 10.1109/TCBB.2024.3371808
Xiaokang Zhou;Carson K. Leung;Kevin I-Kai Wang;Giancarlo Fortino
Deep learning and big data analysis are among the most important research topics in the fields of biomedical applications and digital healthcare. With the fast development of artificial intelligence (AI) and Internets of Things (IoT) technologies, deep learning (DL) for big data analytics—including affective learning, reinforcement learning, and transfer learning—are widely applied to sense, learn, and interact with human health. Examples of biomedical applications include smart biomaterials, biomedical imaging, heartbeat/blood pressure measurement, and eye tracking. These biomedical applications collect healthcare data through remote sensors and transfer the data to a centralized system for analysis. With an enormous amount of historical data, DL and big data analysis technologies are able to identify potential linkage between features and possible risks, raise important decision for medical diagnosis, and provide precious advice for better healthcare treatment and lifestyle. Although significant progress has been made with AI, DL, and big data analytic technologies for medical and healthcare research, there remain gaps between the computer-aided treatment design and real-world healthcare demands. In addition, there are unexplored areas in the fields of healthcare and biomedical applications with cutting-edge AI and DL technologies. Hence, exploring the possibility of DL and big data analytics in the fields of biomedical applications and digital healthcare is in high demand.
{"title":"Editorial Deep Learning-Empowered Big Data Analytics in Biomedical Applications and Digital Healthcare","authors":"Xiaokang Zhou;Carson K. Leung;Kevin I-Kai Wang;Giancarlo Fortino","doi":"10.1109/TCBB.2024.3371808","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3371808","url":null,"abstract":"Deep learning and big data analysis are among the most important research topics in the fields of biomedical applications and digital healthcare. With the fast development of artificial intelligence (AI) and Internets of Things (IoT) technologies, deep learning (DL) for big data analytics—including affective learning, reinforcement learning, and transfer learning—are widely applied to sense, learn, and interact with human health. Examples of biomedical applications include smart biomaterials, biomedical imaging, heartbeat/blood pressure measurement, and eye tracking. These biomedical applications collect healthcare data through remote sensors and transfer the data to a centralized system for analysis. With an enormous amount of historical data, DL and big data analysis technologies are able to identify potential linkage between features and possible risks, raise important decision for medical diagnosis, and provide precious advice for better healthcare treatment and lifestyle. Although significant progress has been made with AI, DL, and big data analytic technologies for medical and healthcare research, there remain gaps between the computer-aided treatment design and real-world healthcare demands. In addition, there are unexplored areas in the fields of healthcare and biomedical applications with cutting-edge AI and DL technologies. Hence, exploring the possibility of DL and big data analytics in the fields of biomedical applications and digital healthcare is in high demand.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10631783","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141965894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the broad-spectrum and high-efficiency antibacterial activity, antimicrobial peptides (AMPs) and their functions have been studied in the field of drug discovery. Using biological experiments to detect the AMPs and corresponding activities require a high cost, whereas computational technologies do so for much less. Currently, most computational methods solve the identification of AMPs and their activities as two independent tasks, which ignore the relationship between them. Therefore, the combination and sharing of patterns for two tasks is a crucial problem that needs to be addressed. In this study, we propose a deep learning model, called DMAMP, for detecting AMPs and activities simultaneously, which is benefited from multi-task learning. The first stage is to utilize convolutional neural network models and residual blocks to extract the sharing hidden features from two related tasks. The next stage is to use two fully connected layers to learn the distinct information of two tasks. Meanwhile, the original evolutionary features from the peptide sequence are also fed to the predictor of the second task to complement the forgotten information. The experiments on the independent test dataset demonstrate that our method performs better than the single-task model with 4.28% of Matthews Correlation Coefficient (MCC) on the first task, and achieves 0.2627 of an average MCC which is higher than the single-task model and two existing methods for five activities on the second task. To understand whether features derived from the convolutional layers of models capture the differences between target classes, we visualize these high-dimensional features by projecting into 3D space. In addition, we show that our predictor has the ability to identify peptides that achieve activity against Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2). We hope that our proposed method can give new insights into the discovery of novel antiviral peptide drugs.
{"title":"DMAMP: A deep-learning model for detecting antimicrobial peptides and their multi-activities.","authors":"Qiaozhen Meng, Genlang Chen, Shixin Zheng, Yulai Lin, Bin Liu, Jijun Tang, Fei Guo","doi":"10.1109/TCBB.2024.3439541","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3439541","url":null,"abstract":"<p><p>Due to the broad-spectrum and high-efficiency antibacterial activity, antimicrobial peptides (AMPs) and their functions have been studied in the field of drug discovery. Using biological experiments to detect the AMPs and corresponding activities require a high cost, whereas computational technologies do so for much less. Currently, most computational methods solve the identification of AMPs and their activities as two independent tasks, which ignore the relationship between them. Therefore, the combination and sharing of patterns for two tasks is a crucial problem that needs to be addressed. In this study, we propose a deep learning model, called DMAMP, for detecting AMPs and activities simultaneously, which is benefited from multi-task learning. The first stage is to utilize convolutional neural network models and residual blocks to extract the sharing hidden features from two related tasks. The next stage is to use two fully connected layers to learn the distinct information of two tasks. Meanwhile, the original evolutionary features from the peptide sequence are also fed to the predictor of the second task to complement the forgotten information. The experiments on the independent test dataset demonstrate that our method performs better than the single-task model with 4.28% of Matthews Correlation Coefficient (MCC) on the first task, and achieves 0.2627 of an average MCC which is higher than the single-task model and two existing methods for five activities on the second task. To understand whether features derived from the convolutional layers of models capture the differences between target classes, we visualize these high-dimensional features by projecting into 3D space. In addition, we show that our predictor has the ability to identify peptides that achieve activity against Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2). We hope that our proposed method can give new insights into the discovery of novel antiviral peptide drugs.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141897345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-31DOI: 10.1109/TCBB.2024.3425644
Xiuhao Fu, Hao Duan, Xiaofeng Zang, Chunling Liu, Xingfeng Li, Qingchen Zhang, Zilong Zhang, Quan Zou, Feifei Cui
Tuberculosis has plagued mankind since ancient times, and the struggle between humans and tuberculosis continues. Mycobacterium tuberculosis is the leading cause of tuberculosis, infecting nearly one-third of the world's population. The rise of peptide drugs has created a new direction in the treatment of tuberculosis. Therefore, for the treatment of tuberculosis, the prediction of anti-tuberculosis peptides is crucial.This paper proposes an anti-tuberculosis peptide prediction method based on hybrid features and stacked ensemble learning. First, a random forest (RF) and extremely randomized tree (ERT) are selected as first-level learning of stacked ensembles. Then, the five best-performing feature encoding methods are selected to obtain the hybrid feature vector, and then the decision tree and recursive feature elimination (DT-RFE) are used to refine the hybrid feature vector. After selection, the optimal feature subset is used as the input of the stacked ensemble model. At the same time, logistic regression (LR) is used as a stacked ensemble secondary learner to build the final stacked ensemble model Hyb_SEnc. The prediction accuracy of Hyb_SEnc achieved 94.68% and 95.74% on the independent test sets of AntiTb_MD and AntiTb_RD, respectively. In addition, we provide a user-friendly Web server (http://www.bioailab. com/Hyb_SEnc). The source code is freely available at https://github.com/fxh1001/Hyb_SEnc.
{"title":"Hyb_SEnc: An Antituberculosis Peptide Predictor Based on a Hybrid Feature Vector and Stacked Ensemble Learning.","authors":"Xiuhao Fu, Hao Duan, Xiaofeng Zang, Chunling Liu, Xingfeng Li, Qingchen Zhang, Zilong Zhang, Quan Zou, Feifei Cui","doi":"10.1109/TCBB.2024.3425644","DOIUrl":"10.1109/TCBB.2024.3425644","url":null,"abstract":"<p><p>Tuberculosis has plagued mankind since ancient times, and the struggle between humans and tuberculosis continues. Mycobacterium tuberculosis is the leading cause of tuberculosis, infecting nearly one-third of the world's population. The rise of peptide drugs has created a new direction in the treatment of tuberculosis. Therefore, for the treatment of tuberculosis, the prediction of anti-tuberculosis peptides is crucial.This paper proposes an anti-tuberculosis peptide prediction method based on hybrid features and stacked ensemble learning. First, a random forest (RF) and extremely randomized tree (ERT) are selected as first-level learning of stacked ensembles. Then, the five best-performing feature encoding methods are selected to obtain the hybrid feature vector, and then the decision tree and recursive feature elimination (DT-RFE) are used to refine the hybrid feature vector. After selection, the optimal feature subset is used as the input of the stacked ensemble model. At the same time, logistic regression (LR) is used as a stacked ensemble secondary learner to build the final stacked ensemble model Hyb_SEnc. The prediction accuracy of Hyb_SEnc achieved 94.68% and 95.74% on the independent test sets of AntiTb_MD and AntiTb_RD, respectively. In addition, we provide a user-friendly Web server (http://www.bioailab. com/Hyb_SEnc). The source code is freely available at https://github.com/fxh1001/Hyb_SEnc.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141859585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-29DOI: 10.1109/TCBB.2024.3434992
Xiaoli Lin, Zhuang Yin, Xiaolong Zhang, Jing Hu
Accurate prediction of drug-drug interactions (DDIs) plays an important role in improving the efficiency of drug development and ensuring the safety of combination therapy. Most existing models rely on a single source of information to predict DDIs, and few models can perform tasks on biomedical knowledge graphs. This paper proposes a new hybrid method, namely Knowledge Graph Representation Learning and Feature Fusion (KGRLFF), to fully exploit the information from the biomedical knowledge graph and molecular structure of drugs to better predict DDIs. KGRLFF first uses a Bidirectional Random Walk sampling method based on the PageRank algorithm (BRWP) to obtain higher-order neighborhood information of drugs in the knowledge graph, including neighboring nodes, semantic relations, and higher-order information associated with triple facts. Then, an embedded representation learning model named Knowledge Graph-based Cyclic Recursive Aggregation (KGCRA) is used to learn the embedded representations of drugs by recursively propagating and aggregating messages with drugs as both the source and destination. In addition, the model learns the molecular structures of the drugs to obtain the structured features. Finally, a Feature Representation Fusion Strategy (FRFS) was developed to integrate embedded representations and structured feature representations. Experimental results showed that KGRLFF is feasible for predicting potential DDIs.
{"title":"KGRLFF: Detecting Drug-Drug Interactions Based on Knowledge Graph Representation Learning and Feature Fusion.","authors":"Xiaoli Lin, Zhuang Yin, Xiaolong Zhang, Jing Hu","doi":"10.1109/TCBB.2024.3434992","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3434992","url":null,"abstract":"<p><p>Accurate prediction of drug-drug interactions (DDIs) plays an important role in improving the efficiency of drug development and ensuring the safety of combination therapy. Most existing models rely on a single source of information to predict DDIs, and few models can perform tasks on biomedical knowledge graphs. This paper proposes a new hybrid method, namely Knowledge Graph Representation Learning and Feature Fusion (KGRLFF), to fully exploit the information from the biomedical knowledge graph and molecular structure of drugs to better predict DDIs. KGRLFF first uses a Bidirectional Random Walk sampling method based on the PageRank algorithm (BRWP) to obtain higher-order neighborhood information of drugs in the knowledge graph, including neighboring nodes, semantic relations, and higher-order information associated with triple facts. Then, an embedded representation learning model named Knowledge Graph-based Cyclic Recursive Aggregation (KGCRA) is used to learn the embedded representations of drugs by recursively propagating and aggregating messages with drugs as both the source and destination. In addition, the model learns the molecular structures of the drugs to obtain the structured features. Finally, a Feature Representation Fusion Strategy (FRFS) was developed to integrate embedded representations and structured feature representations. Experimental results showed that KGRLFF is feasible for predicting potential DDIs.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141792317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Predicting biomolecular interactions is significant for understanding biological systems. Most existing methods for link prediction are based on graph convolution. Although graph convolution methods are advantageous in extracting structure information of biomolecular interactions, two key challenges still remain. One is how to consider both the immediate and highorder neighbors. Another is how to reduce noise when aggregating high-order neighbors. To address these challenges, we propose a novel method, called mixed high-order graph convolution with filter network via LSTM and channel attention (HGLA), to predict biomolecular interactions. Firstly, the basic and high-order features are extracted respectively through the traditional graph convolutional network (GCN) and the two-layer Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing (MixHop). Secondly, these features are mixed and input into the filter network composed of LayerNorm, SENet and LSTM to generate filtered features, which are concatenated and used for link prediction. The advantages of HGLA are: 1) HGLA processes high-order features separately, rather than simply concatenating them; 2) HGLA better balances the basic features and high-order features; 3) HGLA effectively filters the noise from high-order neighbors. It outperforms state-ofthe-art networks on four benchmark datasets. The codes are available at https://github.com/zznb123/HGLA.
{"title":"HGLA: Biomolecular Interaction Prediction based on Mixed High-Order Graph Convolution with Filter Network via LSTM and Channel Attention.","authors":"Zhen Zhang, Zhaohong Deng, Ruibo Li, Wei Zhang, Qiongdan Lou, Kup-Sze Choi, Shitong Wang","doi":"10.1109/TCBB.2024.3434399","DOIUrl":"10.1109/TCBB.2024.3434399","url":null,"abstract":"<p><p>Predicting biomolecular interactions is significant for understanding biological systems. Most existing methods for link prediction are based on graph convolution. Although graph convolution methods are advantageous in extracting structure information of biomolecular interactions, two key challenges still remain. One is how to consider both the immediate and highorder neighbors. Another is how to reduce noise when aggregating high-order neighbors. To address these challenges, we propose a novel method, called mixed high-order graph convolution with filter network via LSTM and channel attention (HGLA), to predict biomolecular interactions. Firstly, the basic and high-order features are extracted respectively through the traditional graph convolutional network (GCN) and the two-layer Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing (MixHop). Secondly, these features are mixed and input into the filter network composed of LayerNorm, SENet and LSTM to generate filtered features, which are concatenated and used for link prediction. The advantages of HGLA are: 1) HGLA processes high-order features separately, rather than simply concatenating them; 2) HGLA better balances the basic features and high-order features; 3) HGLA effectively filters the noise from high-order neighbors. It outperforms state-ofthe-art networks on four benchmark datasets. The codes are available at https://github.com/zznb123/HGLA.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141765973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-26DOI: 10.1109/TCBB.2024.3434340
Fadi Shehadeh, LewisOscar Felix, Markos Kalligeros, Adnan Shehadeh, Beth Burgwyn Fuchs, Frederick M Ausubel, Paul P Sotiriadis, Eleftherios Mylonakis
Background: Antimicrobial resistance is a major public health threat, and new agents are needed. Computational approaches have been proposed to reduce the cost and time needed for compound screening.
Aims: A machine learning (ML) model was developed for the in silico screening of low molecular weight molecules.
Methods: We used the results of a high-throughput Caenorhabditis elegans methicillin-resistant Staphylococcus aureus (MRSA) liquid infection assay to develop ML models for compound prioritization and quality control.
Results: The compound prioritization model achieved an AUC of 0.795 with a sensitivity of 81% and a specificity of 70%. When applied to a validation set of 22,768 compounds, the model identified 81% of the active compounds identified by high-throughput screening (HTS) among only 30.6% of the total 22,768 compounds, resulting in a 2.67-fold increase in hit rate. When we retrained the model on all the compounds of the HTS dataset, it further identified 45 discordant molecules classified as non-hits by the HTS, with 42/45 (93%) having known antimicrobial activity.
Conclusion: Our ML approach can be used to increase HTS efficiency by reducing the number of compounds that need to be physically screened and identifying potential missed hits, making HTS more accessible and reducing barriers to entry.
{"title":"Machine learning-assisted high-throughput screening for Anti-MRSA compounds.","authors":"Fadi Shehadeh, LewisOscar Felix, Markos Kalligeros, Adnan Shehadeh, Beth Burgwyn Fuchs, Frederick M Ausubel, Paul P Sotiriadis, Eleftherios Mylonakis","doi":"10.1109/TCBB.2024.3434340","DOIUrl":"10.1109/TCBB.2024.3434340","url":null,"abstract":"<p><strong>Background: </strong>Antimicrobial resistance is a major public health threat, and new agents are needed. Computational approaches have been proposed to reduce the cost and time needed for compound screening.</p><p><strong>Aims: </strong>A machine learning (ML) model was developed for the in silico screening of low molecular weight molecules.</p><p><strong>Methods: </strong>We used the results of a high-throughput Caenorhabditis elegans methicillin-resistant Staphylococcus aureus (MRSA) liquid infection assay to develop ML models for compound prioritization and quality control.</p><p><strong>Results: </strong>The compound prioritization model achieved an AUC of 0.795 with a sensitivity of 81% and a specificity of 70%. When applied to a validation set of 22,768 compounds, the model identified 81% of the active compounds identified by high-throughput screening (HTS) among only 30.6% of the total 22,768 compounds, resulting in a 2.67-fold increase in hit rate. When we retrained the model on all the compounds of the HTS dataset, it further identified 45 discordant molecules classified as non-hits by the HTS, with 42/45 (93%) having known antimicrobial activity.</p><p><strong>Conclusion: </strong>Our ML approach can be used to increase HTS efficiency by reducing the number of compounds that need to be physically screened and identifying potential missed hits, making HTS more accessible and reducing barriers to entry.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141765974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-26DOI: 10.1109/TCBB.2024.3434461
Siyuan Guo, Jihong Guan, Shuigeng Zhou
In the past decade, Artificial Intelligence (AI) driven drug design and discovery has been a hot research topic in the AI area, where an important branch is molecule generation by generative models, from GAN-based models and VAE-based models to the latest diffusion-based models. However, most existing models pursue mainly the basic properties like validity and uniqueness of the generated molecules, a few go further to explicitly optimize one single important molecular property (e.g., QED or PlogP), which makes most generated molecules little usefulness in practice. In this paper, we present a novel approach to generating molecules with desirable properties, which expands the diffusion model framework with multiple innovative designs. The novelty is two-fold. On the one hand, considering that the structures of molecules are complex and diverse, and molecular properties are usually determined by some substructures (e.g., pharmacophores), we propose to perform diffusion on two structural levels: molecules and molecular fragments respectively, with which a mixed Gaussian distribution is obtained for the reverse diffusion process. To get desirable molecular fragments, we develop a novel electronic effect based fragmentation method. On the other hand, we introduce two ways to explicitly optimize multiple molecular properties under the diffusion model framework. First, as potential drug molecules must be chemically valid, we optimize molecular validity by an energy-guidance function. Second, since potential drug molecules should be desirable in various properties, we employ a multi-objective mechanism to optimize multiple molecular properties simultaneously. Extensive experiments with two benchmark datasets QM9 and ZINC250k show that the molecules generated by our proposed method have better validity, uniqueness, novelty, Fr´echet ChemNet Distance (FCD), QED, and PlogP than those generated by current SOTA models. The Code of D2L-OMP is available at https://github.com/bz99bz/D2L-OMP.
在过去十年中,人工智能(AI)驱动的药物设计与发现一直是人工智能领域的研究热点,其中一个重要分支是通过生成模型生成分子,从基于 GAN 的模型、基于 VAE 的模型到最新的基于扩散的模型。然而,大多数现有模型主要追求生成分子的有效性和唯一性等基本属性,少数模型则进一步明确优化某个重要的分子属性(如 QED 或 PlogP),这使得大多数生成的分子在实践中用处不大。在本文中,我们提出了一种生成具有理想特性的分子的新方法,通过多种创新设计扩展了扩散模型框架。新颖之处有两方面。一方面,考虑到分子结构复杂多样,而分子特性通常由一些子结构(如药理结构)决定,我们建议分别在分子和分子片段这两个结构层次上进行扩散,从而获得混合高斯分布的反向扩散过程。为了得到理想的分子片段,我们开发了一种基于电子效应的新型破碎方法。另一方面,我们介绍了在扩散模型框架下明确优化多种分子特性的两种方法。首先,由于潜在药物分子必须具有化学有效性,我们通过能量引导函数来优化分子有效性。其次,由于潜在药物分子应具有各种理想特性,我们采用了一种多目标机制来同时优化多种分子特性。用两个基准数据集 QM9 和 ZINC250k 进行的大量实验表明,我们提出的方法生成的分子在有效性、唯一性、新颖性、Fr´echet ChemNet Distance (FCD)、QED 和 PlogP 等方面都优于目前的 SOTA 模型。D2L-OMP 的代码见 https://github.com/bz99bz/D2L-OMP。
{"title":"Diffusing on Two Levels and Optimizing for Multiple Properties: A Novel Approach to Generating Molecules with Desirable Properties.","authors":"Siyuan Guo, Jihong Guan, Shuigeng Zhou","doi":"10.1109/TCBB.2024.3434461","DOIUrl":"10.1109/TCBB.2024.3434461","url":null,"abstract":"<p><p>In the past decade, Artificial Intelligence (AI) driven drug design and discovery has been a hot research topic in the AI area, where an important branch is molecule generation by generative models, from GAN-based models and VAE-based models to the latest diffusion-based models. However, most existing models pursue mainly the basic properties like validity and uniqueness of the generated molecules, a few go further to explicitly optimize one single important molecular property (e.g., QED or PlogP), which makes most generated molecules little usefulness in practice. In this paper, we present a novel approach to generating molecules with desirable properties, which expands the diffusion model framework with multiple innovative designs. The novelty is two-fold. On the one hand, considering that the structures of molecules are complex and diverse, and molecular properties are usually determined by some substructures (e.g., pharmacophores), we propose to perform diffusion on two structural levels: molecules and molecular fragments respectively, with which a mixed Gaussian distribution is obtained for the reverse diffusion process. To get desirable molecular fragments, we develop a novel electronic effect based fragmentation method. On the other hand, we introduce two ways to explicitly optimize multiple molecular properties under the diffusion model framework. First, as potential drug molecules must be chemically valid, we optimize molecular validity by an energy-guidance function. Second, since potential drug molecules should be desirable in various properties, we employ a multi-objective mechanism to optimize multiple molecular properties simultaneously. Extensive experiments with two benchmark datasets QM9 and ZINC250k show that the molecules generated by our proposed method have better validity, uniqueness, novelty, Fr´echet ChemNet Distance (FCD), QED, and PlogP than those generated by current SOTA models. The Code of D2L-OMP is available at https://github.com/bz99bz/D2L-OMP.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141765972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-24DOI: 10.1109/TCBB.2024.3432740
Gao-Fei Wang, Juan Wang, Shasha Yuan, Chun-Hou Zheng, Jin-Xing Liu
Since genomics was proposed, the exploration of genes has been the focus of research. The emergence of single-cell RNA sequencing (scRNA-seq) technology makes it possible to explore gene expression at the single-cell level. Due to the limitations of sequencing technology, the data contains a lot of noise. At the same time, it also has the characteristics of highdimensional and sparse. Clustering is a common method of analyzing scRNA-seq data. This paper proposes a novel singlecell clustering method called Robust Manifold Nonnegative LowRank Representation with Adaptive Total-Variation Regularization (MLRR-ATV). The Adaptive Total-Variation (ATV) regularization is introduced into Low-Rank Representation (LRR) model to reduce the influence of noise through gradient learning. Then, the linear and nonlinear manifold structures in the data are learned through Euclidean distance and cosine similarity, and more valuable information is retained. Because the model is non-convex, we use the Alternating Direction Method of Multipliers (ADMM) to optimize the model. We tested the performance of the MLRRATV model on eight real scRNA-seq datasets and selected nine state-of-the-art methods as comparison methods. The experimental results show that the performance of the MLRRATV model is better than the other nine methods.
{"title":"MLRR-ATV: A Robust Manifold Nonnegative LowRank Representation with Adaptive Total-Variation Regularization for scRNA-seq Data Clustering.","authors":"Gao-Fei Wang, Juan Wang, Shasha Yuan, Chun-Hou Zheng, Jin-Xing Liu","doi":"10.1109/TCBB.2024.3432740","DOIUrl":"10.1109/TCBB.2024.3432740","url":null,"abstract":"<p><p>Since genomics was proposed, the exploration of genes has been the focus of research. The emergence of single-cell RNA sequencing (scRNA-seq) technology makes it possible to explore gene expression at the single-cell level. Due to the limitations of sequencing technology, the data contains a lot of noise. At the same time, it also has the characteristics of highdimensional and sparse. Clustering is a common method of analyzing scRNA-seq data. This paper proposes a novel singlecell clustering method called Robust Manifold Nonnegative LowRank Representation with Adaptive Total-Variation Regularization (MLRR-ATV). The Adaptive Total-Variation (ATV) regularization is introduced into Low-Rank Representation (LRR) model to reduce the influence of noise through gradient learning. Then, the linear and nonlinear manifold structures in the data are learned through Euclidean distance and cosine similarity, and more valuable information is retained. Because the model is non-convex, we use the Alternating Direction Method of Multipliers (ADMM) to optimize the model. We tested the performance of the MLRRATV model on eight real scRNA-seq datasets and selected nine state-of-the-art methods as comparison methods. The experimental results show that the performance of the MLRRATV model is better than the other nine methods.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141758476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}