Pub Date : 2024-10-18DOI: 10.1186/s12859-024-05948-7
Cezary Turek, Márton Ölbei, Tamás Stirling, Gergely Fekete, Ervin Tasnádi, Leila Gul, Balázs Bohár, Balázs Papp, Wiktor Jurkowski, Eszter Ari
Traditional gene set enrichment analyses are typically limited to a few ontologies and do not account for the interdependence of gene sets or terms, resulting in overcorrected p-values. To address these challenges, we introduce mulea, an R package offering comprehensive overrepresentation and functional enrichment analysis. mulea employs a progressive empirical false discovery rate (eFDR) method, specifically designed for interconnected biological data, to accurately identify significant terms within diverse ontologies. mulea expands beyond traditional tools by incorporating a wide range of ontologies, encompassing Gene Ontology, pathways, regulatory elements, genomic locations, and protein domains. This flexibility enables researchers to tailor enrichment analysis to their specific questions, such as identifying enriched transcriptional regulators in gene expression data or overrepresented protein domains in protein sets. To facilitate seamless analysis, mulea provides gene sets (in standardised GMT format) for 27 model organisms, covering 22 ontology types from 16 databases and various identifiers resulting in almost 900 files. Additionally, the muleaData ExperimentData Bioconductor package simplifies access to these pre-defined ontologies. Finally, mulea's architecture allows for easy integration of user-defined ontologies, or GMT files from external sources (e.g., MSigDB or Enrichr), expanding its applicability across diverse research areas. mulea is distributed as a CRAN R package downloadable from https://cran.r-project.org/web/packages/mulea/ and https://github.com/ELTEbioinformatics/mulea . It offers researchers a powerful and flexible toolkit for functional enrichment analysis, addressing limitations of traditional tools with its progressive eFDR and by supporting a variety of ontologies. Overall, mulea fosters the exploration of diverse biological questions across various model organisms.
{"title":"mulea: An R package for enrichment analysis using multiple ontologies and empirical false discovery rate.","authors":"Cezary Turek, Márton Ölbei, Tamás Stirling, Gergely Fekete, Ervin Tasnádi, Leila Gul, Balázs Bohár, Balázs Papp, Wiktor Jurkowski, Eszter Ari","doi":"10.1186/s12859-024-05948-7","DOIUrl":"https://doi.org/10.1186/s12859-024-05948-7","url":null,"abstract":"<p><p>Traditional gene set enrichment analyses are typically limited to a few ontologies and do not account for the interdependence of gene sets or terms, resulting in overcorrected p-values. To address these challenges, we introduce mulea, an R package offering comprehensive overrepresentation and functional enrichment analysis. mulea employs a progressive empirical false discovery rate (eFDR) method, specifically designed for interconnected biological data, to accurately identify significant terms within diverse ontologies. mulea expands beyond traditional tools by incorporating a wide range of ontologies, encompassing Gene Ontology, pathways, regulatory elements, genomic locations, and protein domains. This flexibility enables researchers to tailor enrichment analysis to their specific questions, such as identifying enriched transcriptional regulators in gene expression data or overrepresented protein domains in protein sets. To facilitate seamless analysis, mulea provides gene sets (in standardised GMT format) for 27 model organisms, covering 22 ontology types from 16 databases and various identifiers resulting in almost 900 files. Additionally, the muleaData ExperimentData Bioconductor package simplifies access to these pre-defined ontologies. Finally, mulea's architecture allows for easy integration of user-defined ontologies, or GMT files from external sources (e.g., MSigDB or Enrichr), expanding its applicability across diverse research areas. mulea is distributed as a CRAN R package downloadable from https://cran.r-project.org/web/packages/mulea/ and https://github.com/ELTEbioinformatics/mulea . It offers researchers a powerful and flexible toolkit for functional enrichment analysis, addressing limitations of traditional tools with its progressive eFDR and by supporting a variety of ontologies. Overall, mulea fosters the exploration of diverse biological questions across various model organisms.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"334"},"PeriodicalIF":2.9,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11490090/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-15DOI: 10.1186/s12859-024-05954-9
Deyan Yordanov Yosifov, Michaela Reichenzeller, Stephan Stilgenbauer, Daniel Mertens
Background: The dilution-replicate experimental design for qPCR assays is especially efficient. It is based on multiple linear regression of multiple 3-point standard curves that are derived from the experimental samples themselves and thus obviates the need for a separate standard curve produced by serial dilution of a standard. The method minimizes the total number of reactions and guarantees that Cq values are within the linear dynamic range of the dilution-replicate standard curves. However, the lack of specialized software has so far precluded the widespread use of the dilution-replicate approach.
Results: Here we present repDilPCR, the first tool that utilizes the dilution-replicate method and extends it by adding the possibility to use multiple reference genes. repDilPCR offers extensive statistical and graphical functions that can also be used with preprocessed data (relative expression values) obtained by usual assay designs and evaluation methods. repDilPCR has been designed with the philosophy to automate and speed up data analysis (typically less than a minute from Cq values to publication-ready plots), and features automatic selection and performance of appropriate statistical tests, at least in the case of one-factor experimental designs. Nevertheless, the program also allows users to export intermediate data and perform more sophisticated analyses with external statistical software, e.g. if two-way ANOVA is necessary.
Conclusions: repDilPCR is a user-friendly tool that can contribute to more efficient planning of qPCR experiments and their robust analysis. A public web server is freely accessible at https://repdilpcr.eu without registration. The program can also be used as an R script or as a locally installed Shiny app, which can be downloaded from https://github.com/deyanyosifov/repDilPCR where also the source code is available.
{"title":"repDilPCR: a tool for automated analysis of qPCR assays by the dilution-replicate method.","authors":"Deyan Yordanov Yosifov, Michaela Reichenzeller, Stephan Stilgenbauer, Daniel Mertens","doi":"10.1186/s12859-024-05954-9","DOIUrl":"https://doi.org/10.1186/s12859-024-05954-9","url":null,"abstract":"<p><strong>Background: </strong>The dilution-replicate experimental design for qPCR assays is especially efficient. It is based on multiple linear regression of multiple 3-point standard curves that are derived from the experimental samples themselves and thus obviates the need for a separate standard curve produced by serial dilution of a standard. The method minimizes the total number of reactions and guarantees that Cq values are within the linear dynamic range of the dilution-replicate standard curves. However, the lack of specialized software has so far precluded the widespread use of the dilution-replicate approach.</p><p><strong>Results: </strong>Here we present repDilPCR, the first tool that utilizes the dilution-replicate method and extends it by adding the possibility to use multiple reference genes. repDilPCR offers extensive statistical and graphical functions that can also be used with preprocessed data (relative expression values) obtained by usual assay designs and evaluation methods. repDilPCR has been designed with the philosophy to automate and speed up data analysis (typically less than a minute from Cq values to publication-ready plots), and features automatic selection and performance of appropriate statistical tests, at least in the case of one-factor experimental designs. Nevertheless, the program also allows users to export intermediate data and perform more sophisticated analyses with external statistical software, e.g. if two-way ANOVA is necessary.</p><p><strong>Conclusions: </strong>repDilPCR is a user-friendly tool that can contribute to more efficient planning of qPCR experiments and their robust analysis. A public web server is freely accessible at https://repdilpcr.eu without registration. The program can also be used as an R script or as a locally installed Shiny app, which can be downloaded from https://github.com/deyanyosifov/repDilPCR where also the source code is available.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"331"},"PeriodicalIF":2.9,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11476982/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142485691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-15DOI: 10.1186/s12859-024-05866-8
Pritam Chakraborty, Anjan Bandyopadhyay, Preeti Padma Sahu, Aniket Burman, Saurav Mallik, Najah Alsubaie, Mohamed Abbas, Mohammed S Alqahtani, Ben Othman Soufiene
Stroke prediction remains a critical area of research in healthcare, aiming to enhance early intervention and patient care strategies. This study investigates the efficacy of machine learning techniques, particularly principal component analysis (PCA) and a stacking ensemble method, for predicting stroke occurrences based on demographic, clinical, and lifestyle factors. We systematically varied PCA components and implemented a stacking model comprising random forest, decision tree, and K-nearest neighbors (KNN).Our findings demonstrate that setting PCA components to 16 optimally enhanced predictive accuracy, achieving a remarkable 98.6% accuracy in stroke prediction. Evaluation metrics underscored the robustness of our approach in handling class imbalance and improving model performance, also comparative analyses against traditional machine learning algorithms such as SVM, logistic regression, and Naive Bayes highlighted the superiority of our proposed method.
{"title":"Predicting stroke occurrences: a stacked machine learning approach with feature selection and data preprocessing.","authors":"Pritam Chakraborty, Anjan Bandyopadhyay, Preeti Padma Sahu, Aniket Burman, Saurav Mallik, Najah Alsubaie, Mohamed Abbas, Mohammed S Alqahtani, Ben Othman Soufiene","doi":"10.1186/s12859-024-05866-8","DOIUrl":"https://doi.org/10.1186/s12859-024-05866-8","url":null,"abstract":"<p><p>Stroke prediction remains a critical area of research in healthcare, aiming to enhance early intervention and patient care strategies. This study investigates the efficacy of machine learning techniques, particularly principal component analysis (PCA) and a stacking ensemble method, for predicting stroke occurrences based on demographic, clinical, and lifestyle factors. We systematically varied PCA components and implemented a stacking model comprising random forest, decision tree, and K-nearest neighbors (KNN).Our findings demonstrate that setting PCA components to 16 optimally enhanced predictive accuracy, achieving a remarkable 98.6% accuracy in stroke prediction. Evaluation metrics underscored the robustness of our approach in handling class imbalance and improving model performance, also comparative analyses against traditional machine learning algorithms such as SVM, logistic regression, and Naive Bayes highlighted the superiority of our proposed method.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"329"},"PeriodicalIF":2.9,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11476080/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-15DOI: 10.1186/s12859-024-05898-0
Lucas Schneider, Peter Minary
Base editing is an enhanced gene editing approach that enables the precise transformation of single nucleotides and has the potential to cure rare diseases. The design process of base editors is labour-intensive and outcomes are not easily predictable. For any clinical use, base editing has to be accurate and efficient. Thus, any bystander mutations have to be minimized. In recent years, computational models to predict base editing outcomes have been developed. However, the overall robustness and performance of those models is limited. One way to improve the performance is to train models on a diverse, feature-rich, and large dataset, which does not exist for the base editing field. Hence, we develop BE-dataHIVE, a mySQL database that covers over 460,000 gRNA target combinations. The current version of BE-dataHIVE consists of data from five studies and is enriched with melting temperatures and energy terms. Furthermore, multiple different data structures for machine learning were computed and are directly available. The database can be accessed via our website https://be-datahive.com/ or API and is therefore suitable for practitioners and machine learning researchers.
碱基编辑是一种增强型基因编辑方法,可实现单个核苷酸的精确转化,具有治疗罕见疾病的潜力。碱基编辑器的设计过程是劳动密集型的,结果也不容易预测。要用于临床,碱基编辑必须准确、高效。因此,必须尽量减少旁观者突变。近年来,预测碱基编辑结果的计算模型已经开发出来。然而,这些模型的整体稳健性和性能有限。提高性能的方法之一是在多样化、特征丰富的大型数据集上训练模型,而碱基编辑领域并不存在这样的数据集。因此,我们开发了一个 MySQL 数据库 BE-dataHIVE,它涵盖了超过 46 万个 gRNA 目标组合。当前版本的 BE-dataHIVE 包含来自五项研究的数据,并丰富了熔化温度和能量项。此外,还为机器学习计算了多种不同的数据结构,并可直接使用。该数据库可通过我们的网站 https://be-datahive.com/ 或 API 访问,因此适合从业人员和机器学习研究人员使用。
{"title":"Be-dataHIVE: a base editing database.","authors":"Lucas Schneider, Peter Minary","doi":"10.1186/s12859-024-05898-0","DOIUrl":"https://doi.org/10.1186/s12859-024-05898-0","url":null,"abstract":"<p><p>Base editing is an enhanced gene editing approach that enables the precise transformation of single nucleotides and has the potential to cure rare diseases. The design process of base editors is labour-intensive and outcomes are not easily predictable. For any clinical use, base editing has to be accurate and efficient. Thus, any bystander mutations have to be minimized. In recent years, computational models to predict base editing outcomes have been developed. However, the overall robustness and performance of those models is limited. One way to improve the performance is to train models on a diverse, feature-rich, and large dataset, which does not exist for the base editing field. Hence, we develop BE-dataHIVE, a mySQL database that covers over 460,000 gRNA target combinations. The current version of BE-dataHIVE consists of data from five studies and is enriched with melting temperatures and energy terms. Furthermore, multiple different data structures for machine learning were computed and are directly available. The database can be accessed via our website https://be-datahive.com/ or API and is therefore suitable for practitioners and machine learning researchers.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"330"},"PeriodicalIF":2.9,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11476525/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Long non-coding RNAs (lncRNAs) can prevent, diagnose, and treat a variety of complex human diseases, and it is crucial to establish a method to efficiently predict lncRNA-disease associations.
Results: In this paper, we propose a prediction method for the lncRNA-disease association relationship, named LDAGM, which is based on the Graph Convolutional Autoencoder and Multilayer Perceptron model. The method first extracts the functional similarity and Gaussian interaction profile kernel similarity of lncRNAs and miRNAs, as well as the semantic similarity and Gaussian interaction profile kernel similarity of diseases. It then constructs six homogeneous networks and deeply fuses them using a deep topology feature extraction method. The fused networks facilitate feature complementation and deep mining of the original association relationships, capturing the deep connections between nodes. Next, by combining the obtained deep topological features with the similarity network of lncRNA, disease, and miRNA interactions, we construct a multi-view heterogeneous network model. The Graph Convolutional Autoencoder is employed for nonlinear feature extraction. Finally, the extracted nonlinear features are combined with the deep topological features of the multi-view heterogeneous network to obtain the final feature representation of the lncRNA-disease pair. Prediction of the lncRNA-disease association relationship is performed using the Multilayer Perceptron model. To enhance the performance and stability of the Multilayer Perceptron model, we introduce a hidden layer called the aggregation layer in the Multilayer Perceptron model. Through a gate mechanism, it controls the flow of information between each hidden layer in the Multilayer Perceptron model, aiming to achieve optimal feature extraction from each hidden layer.
Conclusions: Parameter analysis, ablation studies, and comparison experiments verified the effectiveness of this method, and case studies verified the accuracy of this method in predicting lncRNA-disease association relationships.
{"title":"LDAGM: prediction lncRNA-disease asociations by graph convolutional auto-encoder and multilayer perceptron based on multi-view heterogeneous networks.","authors":"Bing Zhang, Haoyu Wang, Chao Ma, Hai Huang, Zhou Fang, Jiaxing Qu","doi":"10.1186/s12859-024-05950-z","DOIUrl":"https://doi.org/10.1186/s12859-024-05950-z","url":null,"abstract":"<p><strong>Background: </strong>Long non-coding RNAs (lncRNAs) can prevent, diagnose, and treat a variety of complex human diseases, and it is crucial to establish a method to efficiently predict lncRNA-disease associations.</p><p><strong>Results: </strong>In this paper, we propose a prediction method for the lncRNA-disease association relationship, named LDAGM, which is based on the Graph Convolutional Autoencoder and Multilayer Perceptron model. The method first extracts the functional similarity and Gaussian interaction profile kernel similarity of lncRNAs and miRNAs, as well as the semantic similarity and Gaussian interaction profile kernel similarity of diseases. It then constructs six homogeneous networks and deeply fuses them using a deep topology feature extraction method. The fused networks facilitate feature complementation and deep mining of the original association relationships, capturing the deep connections between nodes. Next, by combining the obtained deep topological features with the similarity network of lncRNA, disease, and miRNA interactions, we construct a multi-view heterogeneous network model. The Graph Convolutional Autoencoder is employed for nonlinear feature extraction. Finally, the extracted nonlinear features are combined with the deep topological features of the multi-view heterogeneous network to obtain the final feature representation of the lncRNA-disease pair. Prediction of the lncRNA-disease association relationship is performed using the Multilayer Perceptron model. To enhance the performance and stability of the Multilayer Perceptron model, we introduce a hidden layer called the aggregation layer in the Multilayer Perceptron model. Through a gate mechanism, it controls the flow of information between each hidden layer in the Multilayer Perceptron model, aiming to achieve optimal feature extraction from each hidden layer.</p><p><strong>Conclusions: </strong>Parameter analysis, ablation studies, and comparison experiments verified the effectiveness of this method, and case studies verified the accuracy of this method in predicting lncRNA-disease association relationships.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"332"},"PeriodicalIF":2.9,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11481433/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: The rapid advancements in deep neural network models have significantly enhanced the ability to extract features from microbial sequence data, which is critical for addressing biological challenges. However, the scarcity and complexity of labeled microbial data pose substantial difficulties for supervised learning approaches. To address these issues, we propose DNASimCLR, an unsupervised framework designed for efficient gene sequence data feature extraction.
Results: DNASimCLR leverages convolutional neural networks and the SimCLR framework, based on contrastive learning, to extract intricate features from diverse microbial gene sequences. Pre-training was conducted on two classic large scale unlabelled datasets encompassing metagenomes and viral gene sequences. Subsequent classification tasks were performed by fine-tuning the pretrained model using the previously acquired model. Our experiments demonstrate that DNASimCLR is at least comparable to state-of-the-art techniques for gene sequence classification. For convolutional neural network-based approaches, DNASimCLR surpasses the latest existing methods, clearly establishing its superiority over the state-of-the-art CNN-based feature extraction techniques. Furthermore, the model exhibits superior performance across diverse tasks in analyzing biological sequence data, showcasing its robust adaptability.
Conclusions: DNASimCLR represents a robust and database-agnostic solution for gene sequence classification. Its versatility allows it to perform well in scenarios involving novel or previously unseen gene sequences, making it a valuable tool for diverse applications in genomics.
{"title":"DNASimCLR: a contrastive learning-based deep learning approach for gene sequence data classification.","authors":"Minghao Yang, Zehua Wang, Zizhuo Yan, Wenxiang Wang, Qian Zhu, Changlong Jin","doi":"10.1186/s12859-024-05955-8","DOIUrl":"https://doi.org/10.1186/s12859-024-05955-8","url":null,"abstract":"<p><strong>Background: </strong>The rapid advancements in deep neural network models have significantly enhanced the ability to extract features from microbial sequence data, which is critical for addressing biological challenges. However, the scarcity and complexity of labeled microbial data pose substantial difficulties for supervised learning approaches. To address these issues, we propose DNASimCLR, an unsupervised framework designed for efficient gene sequence data feature extraction.</p><p><strong>Results: </strong>DNASimCLR leverages convolutional neural networks and the SimCLR framework, based on contrastive learning, to extract intricate features from diverse microbial gene sequences. Pre-training was conducted on two classic large scale unlabelled datasets encompassing metagenomes and viral gene sequences. Subsequent classification tasks were performed by fine-tuning the pretrained model using the previously acquired model. Our experiments demonstrate that DNASimCLR is at least comparable to state-of-the-art techniques for gene sequence classification. For convolutional neural network-based approaches, DNASimCLR surpasses the latest existing methods, clearly establishing its superiority over the state-of-the-art CNN-based feature extraction techniques. Furthermore, the model exhibits superior performance across diverse tasks in analyzing biological sequence data, showcasing its robust adaptability.</p><p><strong>Conclusions: </strong>DNASimCLR represents a robust and database-agnostic solution for gene sequence classification. Its versatility allows it to perform well in scenarios involving novel or previously unseen gene sequences, making it a valuable tool for diverse applications in genomics.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"328"},"PeriodicalIF":2.9,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11476100/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-10DOI: 10.1186/s12859-024-05925-0
Samar Monem, Aboul Ella Hassanien, Alaa H Abdel-Hamid
Background: Drug combination treatments have proven to be a realistic technique for treating challenging diseases such as cancer by enhancing efficacy and mitigating side effects. To achieve the therapeutic goals of these combinations, it is essential to employ multi-targeted drug combinations, which maximize effectiveness and synergistic effects.
Results: This paper proposes 'MultiComb', a multi-task deep learning (MTDL) model designed to simultaneously predict the synergy and sensitivity of drug combinations. The model utilizes a graph convolution network to represent the Simplified Molecular-Input Line-Entry (SMILES) of two drugs, generating their respective features. Also, three fully connected subnetworks extract features of the cancer cell line. These drug and cell line features are then concatenated and processed through an attention mechanism, which outputs two optimized feature representations for the target tasks. The cross-stitch model learns the relationship between these tasks. At last, each learned task feature is fed into fully connected subnetworks to predict the synergy and sensitivity scores. The proposed model is validated using the O'Neil benchmark dataset, which includes 38 unique drugs combined to form 17,901 drug combination pairs and tested across 37 unique cancer cells. The model's performance is tested using some metrics like mean square error ( ), mean absolute error ( ), coefficient of determination ( ), Spearman, and Pearson scores. The mean synergy scores of the proposed model are 232.37, 9.59, 0.57, 0.76, and 0.73 for the previous metrics, respectively. Also, the values for mean sensitivity scores are 15.59, 2.74, 0.90, 0.95, and 0.95, respectively.
Conclusion: This paper proposes an MTDL model to predict synergy and sensitivity scores for drug combinations targeting specific cancer cell lines. The MTDL model demonstrates superior performance compared to existing approaches, providing better results.
{"title":"A multi-task graph deep learning model to predict drugs combination of synergy and sensitivity scores.","authors":"Samar Monem, Aboul Ella Hassanien, Alaa H Abdel-Hamid","doi":"10.1186/s12859-024-05925-0","DOIUrl":"10.1186/s12859-024-05925-0","url":null,"abstract":"<p><strong>Background: </strong>Drug combination treatments have proven to be a realistic technique for treating challenging diseases such as cancer by enhancing efficacy and mitigating side effects. To achieve the therapeutic goals of these combinations, it is essential to employ multi-targeted drug combinations, which maximize effectiveness and synergistic effects.</p><p><strong>Results: </strong>This paper proposes 'MultiComb', a multi-task deep learning (MTDL) model designed to simultaneously predict the synergy and sensitivity of drug combinations. The model utilizes a graph convolution network to represent the Simplified Molecular-Input Line-Entry (SMILES) of two drugs, generating their respective features. Also, three fully connected subnetworks extract features of the cancer cell line. These drug and cell line features are then concatenated and processed through an attention mechanism, which outputs two optimized feature representations for the target tasks. The cross-stitch model learns the relationship between these tasks. At last, each learned task feature is fed into fully connected subnetworks to predict the synergy and sensitivity scores. The proposed model is validated using the O'Neil benchmark dataset, which includes 38 unique drugs combined to form 17,901 drug combination pairs and tested across 37 unique cancer cells. The model's performance is tested using some metrics like mean square error ( <math><mrow><mi>MSE</mi></mrow> </math> ), mean absolute error ( <math><mrow><mi>MAE</mi></mrow> </math> ), coefficient of determination ( <math> <msup><mrow><mi>R</mi></mrow> <mn>2</mn></msup> </math> ), Spearman, and Pearson scores. The mean synergy scores of the proposed model are 232.37, 9.59, 0.57, 0.76, and 0.73 for the previous metrics, respectively. Also, the values for mean sensitivity scores are 15.59, 2.74, 0.90, 0.95, and 0.95, respectively.</p><p><strong>Conclusion: </strong>This paper proposes an MTDL model to predict synergy and sensitivity scores for drug combinations targeting specific cancer cell lines. The MTDL model demonstrates superior performance compared to existing approaches, providing better results.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"327"},"PeriodicalIF":2.9,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11468365/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142399244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1186/s12859-024-05896-2
Fei-Man Hsu, Paul Horton
Background: Some transcription factors, MYC for example, bind sites of potentially methylated DNA. This may increase binding specificity as such sites are (1) highly under-represented in the genome, and (2) offer additional, tissue specific information in the form of hypo- or hyper-methylation. Fortunately, bisulfite sequencing data can be used to investigate this phenomenon.
Method: We developed MethylSeqLogo, an extension of sequence logos which includes new elements to indicate DNA methylation and under-represented dimers in each position of a set binding sites. Our method displays information from both DNA strands, and takes into account the sequence context (CpG or other) and genome region (promoter versus whole genome) appropriate to properly assess the expected background dimer frequency and level of methylation. MethylSeqLogo preserves sequence logo semantics-the relative height of nucleotides within a column represents their proportion in the binding sites, while the absolute height of each column represents information (relative entropy) and the height of all columns added together represents total information RESULTS: We present figures illustrating the utility of using MethylSeqLogo to summarize data from several CpG binding transcription factors. The logos show that unmethylated CpG binding sites are a feature of transcription factors such as MYC and ZBTB33, while some other CpG binding transcription factors, such as CEBPB, appear methylation neutral.
Conclusions: Our software enables users to explore bisulfite and ChIP sequencing data sets-and in the process obtain publication quality figures.
背景:一些转录因子(例如 MYC)与可能甲基化的 DNA 位点结合。这可能会增加结合的特异性,因为这些位点(1)在基因组中的代表性极低,(2)以低甲基化或高甲基化的形式提供额外的组织特异性信息。幸运的是,亚硫酸氢盐测序数据可用于研究这一现象:我们开发了 MethylSeqLogo,它是序列标识的一种扩展,其中包含了一些新元素,用于显示 DNA 甲基化和一组结合位点中每个位置上代表性不足的二聚体。我们的方法显示 DNA 双链的信息,并考虑到适当的序列上下文(CpG 或其他)和基因组区域(启动子或全基因组),以正确评估预期的背景二聚体频率和甲基化水平。MethylSeqLogo 保留了序列徽标的语义--一列中核苷酸的相对高度代表它们在结合位点中的比例,而每列的绝对高度代表信息(相对熵),所有列加起来的高度代表总信息 结果:我们展示的图表说明了使用 MethylSeqLogo 总结几个 CpG 结合转录因子数据的实用性。图标显示,未甲基化的 CpG 结合位点是 MYC 和 ZBTB33 等转录因子的特征,而其他一些 CpG 结合转录因子(如 CEBPB)则呈现甲基化中性:结论:我们的软件使用户能够探索亚硫酸氢盐和 ChIP 测序数据集,并在此过程中获得具有发表质量的数据。
{"title":"MethylSeqLogo: DNA methylation smart sequence logos.","authors":"Fei-Man Hsu, Paul Horton","doi":"10.1186/s12859-024-05896-2","DOIUrl":"10.1186/s12859-024-05896-2","url":null,"abstract":"<p><strong>Background: </strong>Some transcription factors, MYC for example, bind sites of potentially methylated DNA. This may increase binding specificity as such sites are (1) highly under-represented in the genome, and (2) offer additional, tissue specific information in the form of hypo- or hyper-methylation. Fortunately, bisulfite sequencing data can be used to investigate this phenomenon.</p><p><strong>Method: </strong>We developed MethylSeqLogo, an extension of sequence logos which includes new elements to indicate DNA methylation and under-represented dimers in each position of a set binding sites. Our method displays information from both DNA strands, and takes into account the sequence context (CpG or other) and genome region (promoter versus whole genome) appropriate to properly assess the expected background dimer frequency and level of methylation. MethylSeqLogo preserves sequence logo semantics-the relative height of nucleotides within a column represents their proportion in the binding sites, while the absolute height of each column represents information (relative entropy) and the height of all columns added together represents total information RESULTS: We present figures illustrating the utility of using MethylSeqLogo to summarize data from several CpG binding transcription factors. The logos show that unmethylated CpG binding sites are a feature of transcription factors such as MYC and ZBTB33, while some other CpG binding transcription factors, such as CEBPB, appear methylation neutral.</p><p><strong>Conclusions: </strong>Our software enables users to explore bisulfite and ChIP sequencing data sets-and in the process obtain publication quality figures.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 Suppl 2","pages":"326"},"PeriodicalIF":2.9,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11462690/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142387652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-08DOI: 10.1186/s12859-024-05936-x
Xavier Bledsoe, Eric R Gamazon
Background: We present the NeuroimaGene resource as an R package designed to assist researchers in identifying genes and neurologic features relevant to psychiatric and neurological health. While recent studies have identified hundreds of genes as potential components of pathophysiology in neurologic and psychiatric disease, interpreting the physiological consequences of this variation is challenging. The integration of neuroimaging data with molecular findings is a step toward addressing this challenge. In addition to sharing associations with both molecular variation and clinical phenotypes, neuroimaging features are intrinsically informative of cognitive processes. NeuroimaGene provides a tool to understand how disease-associated genes relate to the intermediate structure of the brain.
Results: We created NeuroimaGene, a user-friendly, open access R package now available for public use. Its primary function is to identify neuroimaging derived brain features that are impacted by genetically regulated expression of user-provided genes or gene sets. This resource can be used to (1) characterize individual genes or gene sets as relevant to the structure and function of the brain, (2) identify the region(s) of the brain or body in which expression of target gene(s) is neurologically relevant, (3) impute the brain features most impacted by user-defined gene sets such as those produced by cohort level gene association studies, and (4) generate publication level, modifiable visual plots of significant findings. We demonstrate the utility of the resource by identifying neurologic correlates of stroke-associated genes derived from pre-existing analyses.
Conclusions: Integrating neurologic data as an intermediate phenotype in the pathway from genes to brain-based diagnostic phenotypes increases the interpretability of molecular studies and enriches our understanding of disease pathophysiology. The NeuroimaGene R package is designed to assist in this process and is publicly available for use.
背景:我们介绍的 NeuroimaGene 资源是一个 R 软件包,旨在帮助研究人员识别与精神和神经健康相关的基因和神经特征。虽然最近的研究已经确定了数百个基因是神经和精神疾病病理生理学的潜在组成部分,但解释这种变异的生理后果仍具有挑战性。将神经影像数据与分子研究结果相结合是应对这一挑战的一个步骤。除了与分子变异和临床表型有关联外,神经影像学特征还能为认知过程提供内在信息。NeuroimaGene 为了解疾病相关基因与大脑中间结构的关系提供了一种工具:我们创建了 NeuroimaGene,它是一个用户友好、开放存取的 R 软件包,现在可供公众使用。它的主要功能是识别受用户提供的基因或基因组的基因调控表达影响的神经影像衍生大脑特征。该资源可用于:(1) 鉴定与大脑结构和功能相关的单个基因或基因组;(2) 识别目标基因的表达与神经相关的大脑或身体区域;(3) 估算受用户定义的基因组(如队列水平基因关联研究产生的基因组)影响最大的大脑特征;(4) 生成发表水平、可修改的重要发现可视化图谱。我们从已有的分析中确定了中风相关基因的神经相关性,从而证明了该资源的实用性:结论:在从基因到基于大脑的诊断表型的过程中,将神经学数据作为中间表型进行整合,可提高分子研究的可解释性,并丰富我们对疾病病理生理学的理解。NeuroimaGene R 软件包旨在协助这一过程,并可公开使用。
{"title":"NeuroimaGene: an R package for assessing the neurological correlates of genetically regulated gene expression.","authors":"Xavier Bledsoe, Eric R Gamazon","doi":"10.1186/s12859-024-05936-x","DOIUrl":"10.1186/s12859-024-05936-x","url":null,"abstract":"<p><strong>Background: </strong>We present the NeuroimaGene resource as an R package designed to assist researchers in identifying genes and neurologic features relevant to psychiatric and neurological health. While recent studies have identified hundreds of genes as potential components of pathophysiology in neurologic and psychiatric disease, interpreting the physiological consequences of this variation is challenging. The integration of neuroimaging data with molecular findings is a step toward addressing this challenge. In addition to sharing associations with both molecular variation and clinical phenotypes, neuroimaging features are intrinsically informative of cognitive processes. NeuroimaGene provides a tool to understand how disease-associated genes relate to the intermediate structure of the brain.</p><p><strong>Results: </strong>We created NeuroimaGene, a user-friendly, open access R package now available for public use. Its primary function is to identify neuroimaging derived brain features that are impacted by genetically regulated expression of user-provided genes or gene sets. This resource can be used to (1) characterize individual genes or gene sets as relevant to the structure and function of the brain, (2) identify the region(s) of the brain or body in which expression of target gene(s) is neurologically relevant, (3) impute the brain features most impacted by user-defined gene sets such as those produced by cohort level gene association studies, and (4) generate publication level, modifiable visual plots of significant findings. We demonstrate the utility of the resource by identifying neurologic correlates of stroke-associated genes derived from pre-existing analyses.</p><p><strong>Conclusions: </strong>Integrating neurologic data as an intermediate phenotype in the pathway from genes to brain-based diagnostic phenotypes increases the interpretability of molecular studies and enriches our understanding of disease pathophysiology. The NeuroimaGene R package is designed to assist in this process and is publicly available for use.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"325"},"PeriodicalIF":2.9,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11463069/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142387651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-08DOI: 10.1186/s12859-024-05915-2
Bin Baek, Hyunju Lee
Background: Safe drug treatment requires an understanding of the potential side effects. Identifying the frequency of drug side effects can reduce the risks associated with drug use. However, existing computational methods for predicting drug side effect frequencies heavily depend on known drug side effect frequency information. Consequently, these methods face challenges when predicting the side effect frequencies of new drugs. Although a few methods can predict the side effect frequencies of new drugs, they exhibit unreliable performance owing to the exclusion of drug-side effect relationships.
Results: This study proposed CrossFeat, a model based on convolutional neural network-transformer architecture with cross-feature learning that can predict the occurrence and frequency of drug side effects for new drugs, even in the absence of information regarding drug-side effect relationships. CrossFeat facilitates the concurrent learning of drugs and side effect information within its transformer architecture. This simultaneous exchange of information enables drugs to learn about their associated side effects, while side effects concurrently acquire information about the respective drugs. Such bidirectional learning allows for the comprehensive integration of drug and side effect knowledge. Our five-fold cross-validation experiments demonstrated that CrossFeat outperforms existing studies in predicting side effect frequencies for new drugs without prior knowledge.
Conclusions: Our model offers a promising approach for predicting the drug side effect frequencies, particularly for new drugs where prior information is limited. CrossFeat's superior performance in cross-validation experiments, along with evidence from case studies and ablation experiments, highlights its effectiveness.
{"title":"Crossfeat: a transformer-based cross-feature learning model for predicting drug side effect frequency.","authors":"Bin Baek, Hyunju Lee","doi":"10.1186/s12859-024-05915-2","DOIUrl":"10.1186/s12859-024-05915-2","url":null,"abstract":"<p><strong>Background: </strong>Safe drug treatment requires an understanding of the potential side effects. Identifying the frequency of drug side effects can reduce the risks associated with drug use. However, existing computational methods for predicting drug side effect frequencies heavily depend on known drug side effect frequency information. Consequently, these methods face challenges when predicting the side effect frequencies of new drugs. Although a few methods can predict the side effect frequencies of new drugs, they exhibit unreliable performance owing to the exclusion of drug-side effect relationships.</p><p><strong>Results: </strong>This study proposed CrossFeat, a model based on convolutional neural network-transformer architecture with cross-feature learning that can predict the occurrence and frequency of drug side effects for new drugs, even in the absence of information regarding drug-side effect relationships. CrossFeat facilitates the concurrent learning of drugs and side effect information within its transformer architecture. This simultaneous exchange of information enables drugs to learn about their associated side effects, while side effects concurrently acquire information about the respective drugs. Such bidirectional learning allows for the comprehensive integration of drug and side effect knowledge. Our five-fold cross-validation experiments demonstrated that CrossFeat outperforms existing studies in predicting side effect frequencies for new drugs without prior knowledge.</p><p><strong>Conclusions: </strong>Our model offers a promising approach for predicting the drug side effect frequencies, particularly for new drugs where prior information is limited. CrossFeat's superior performance in cross-validation experiments, along with evidence from case studies and ablation experiments, highlights its effectiveness.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"324"},"PeriodicalIF":2.9,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11459996/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142387650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}