Pub Date : 2024-10-26DOI: 10.1186/s12859-024-05946-9
Duong H T Vo, Thomas Thorne
Background: Gene interaction networks are graphs in which nodes represent genes and edges represent functional interactions between them. These interactions can be at multiple levels, for instance, gene regulation, protein-protein interaction, or metabolic pathways. To analyse gene interaction networks at a large scale, gene co-expression network analysis is often applied on high-throughput gene expression data such as RNA sequencing data. With the advance in sequencing technology, expression of genes can be measured in individual cells. Single-cell RNA sequencing (scRNAseq) provides insights of cellular development, differentiation and characteristics at the transcriptomic level. High sparsity and high-dimensional data structures pose challenges in scRNAseq data analysis.
Results: In this study, a sparse inverse covariance matrix estimation framework for scRNAseq data is developed to capture direct functional interactions between genes. Comparative analyses highlight high performance and fast computation of Stein-type shrinkage in high-dimensional data using simulated scRNAseq data. Data transformation approaches also show improvement in performance of shrinkage methods in non-Gaussian distributed data. Zero-inflated modelling of scRNAseq data based on a negative binomial distribution enhances shrinkage performance in zero-inflated data without interference on non zero-inflated count data.
Conclusion: The proposed framework broadens application of graphical model in scRNAseq analysis with flexibility in sparsity of count data resulting from dropout events, high performance, and fast computational time. Implementation of the framework is in a reproducible Snakemake workflow https://github.com/calathea24/ZINBGraphicalModel and R package ZINBStein https://github.com/calathea24/ZINBStein .
{"title":"Shrinkage estimation of gene interaction networks in single-cell RNA sequencing data.","authors":"Duong H T Vo, Thomas Thorne","doi":"10.1186/s12859-024-05946-9","DOIUrl":"10.1186/s12859-024-05946-9","url":null,"abstract":"<p><strong>Background: </strong>Gene interaction networks are graphs in which nodes represent genes and edges represent functional interactions between them. These interactions can be at multiple levels, for instance, gene regulation, protein-protein interaction, or metabolic pathways. To analyse gene interaction networks at a large scale, gene co-expression network analysis is often applied on high-throughput gene expression data such as RNA sequencing data. With the advance in sequencing technology, expression of genes can be measured in individual cells. Single-cell RNA sequencing (scRNAseq) provides insights of cellular development, differentiation and characteristics at the transcriptomic level. High sparsity and high-dimensional data structures pose challenges in scRNAseq data analysis.</p><p><strong>Results: </strong>In this study, a sparse inverse covariance matrix estimation framework for scRNAseq data is developed to capture direct functional interactions between genes. Comparative analyses highlight high performance and fast computation of Stein-type shrinkage in high-dimensional data using simulated scRNAseq data. Data transformation approaches also show improvement in performance of shrinkage methods in non-Gaussian distributed data. Zero-inflated modelling of scRNAseq data based on a negative binomial distribution enhances shrinkage performance in zero-inflated data without interference on non zero-inflated count data.</p><p><strong>Conclusion: </strong>The proposed framework broadens application of graphical model in scRNAseq analysis with flexibility in sparsity of count data resulting from dropout events, high performance, and fast computational time. Implementation of the framework is in a reproducible Snakemake workflow https://github.com/calathea24/ZINBGraphicalModel and R package ZINBStein https://github.com/calathea24/ZINBStein .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515282/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142494579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-25DOI: 10.1186/s12859-024-05953-w
Elisabeth Hellec, Flavia Nunes, Charlotte Corporeau, Alexandre Cormier
Background: Protein kinases are a diverse superfamily of proteins common to organisms across the tree of life that are typically involved in signal transduction, allowing organisms to sense and respond to biotic or abiotic environmental factors. They have important roles in organismal physiology, including development, reproduction, acclimation to environmental stress, while their dysregulation can lead to disease, including several forms of cancer. Identifying the complement of protein kinases (the kinome) of any organism is useful for understanding its physiological capabilities, limitations and adaptations to environmental stress. The increasing availability of genomes makes it now possible to examine and compare the kinomes across a broad diversity of organisms. Here we present a pipeline respecting the FAIR principles (findable, accessible, interoperable and reusable) that facilitates the search and identification of protein kinases from a predicted proteome, and classifies them according to group of serine/threonine/tyrosine protein kinases present in eukaryotes.
Results: KiNext is a Nextflow pipeline that regroups a number of existing bioinformatic tools to search for and classify the protein kinases of an organism in a reproducible manner, starting from a set of amino acid sequences. Conventional eukaryotic protein kinases (ePKs) and atypical protein kinases (aPKs) are identified by using Hidden Markov Models (HMMs) generated from the catalytic domains of kinases. Furthermore, KiNext categorizes ePKs into the eight kinase groups by employing dedicated Hidden Markov Models (HMMs) tailored for each group. The performance of the KiNext pipeline was validated against previously identified kinomes obtained with other tools that were already published for two marine species, the Pacific oyster Crassostrea gigas and the unicellular green alga Ostreoccocus tauri. KiNext outperformed previous results by finding previously unidentified kinases and by attributing a large proportion of previously unclassified kinases to a group in both species. These results demonstrate improvements in kinase identification and classification, all while providing traceability and reproducibility of results in a FAIR pipeline. The default HMM models provided with KiNext are most suitable for eukaryotes, but the pipeline can be easily modified to include HMM models for other taxa of interest.
Conclusion: The KiNext pipeline enables efficient and reproducible identification of kinomes based on predicted amino acid sequences (i.e. proteomes). KiNext was designed to be easy to use, automated, portable and scalable.
{"title":"KiNext: a portable and scalable workflow for the identification and classification of protein kinases.","authors":"Elisabeth Hellec, Flavia Nunes, Charlotte Corporeau, Alexandre Cormier","doi":"10.1186/s12859-024-05953-w","DOIUrl":"10.1186/s12859-024-05953-w","url":null,"abstract":"<p><strong>Background: </strong>Protein kinases are a diverse superfamily of proteins common to organisms across the tree of life that are typically involved in signal transduction, allowing organisms to sense and respond to biotic or abiotic environmental factors. They have important roles in organismal physiology, including development, reproduction, acclimation to environmental stress, while their dysregulation can lead to disease, including several forms of cancer. Identifying the complement of protein kinases (the kinome) of any organism is useful for understanding its physiological capabilities, limitations and adaptations to environmental stress. The increasing availability of genomes makes it now possible to examine and compare the kinomes across a broad diversity of organisms. Here we present a pipeline respecting the FAIR principles (findable, accessible, interoperable and reusable) that facilitates the search and identification of protein kinases from a predicted proteome, and classifies them according to group of serine/threonine/tyrosine protein kinases present in eukaryotes.</p><p><strong>Results: </strong>KiNext is a Nextflow pipeline that regroups a number of existing bioinformatic tools to search for and classify the protein kinases of an organism in a reproducible manner, starting from a set of amino acid sequences. Conventional eukaryotic protein kinases (ePKs) and atypical protein kinases (aPKs) are identified by using Hidden Markov Models (HMMs) generated from the catalytic domains of kinases. Furthermore, KiNext categorizes ePKs into the eight kinase groups by employing dedicated Hidden Markov Models (HMMs) tailored for each group. The performance of the KiNext pipeline was validated against previously identified kinomes obtained with other tools that were already published for two marine species, the Pacific oyster Crassostrea gigas and the unicellular green alga Ostreoccocus tauri. KiNext outperformed previous results by finding previously unidentified kinases and by attributing a large proportion of previously unclassified kinases to a group in both species. These results demonstrate improvements in kinase identification and classification, all while providing traceability and reproducibility of results in a FAIR pipeline. The default HMM models provided with KiNext are most suitable for eukaryotes, but the pipeline can be easily modified to include HMM models for other taxa of interest.</p><p><strong>Conclusion: </strong>The KiNext pipeline enables efficient and reproducible identification of kinomes based on predicted amino acid sequences (i.e. proteomes). KiNext was designed to be easy to use, automated, portable and scalable.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515245/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142494578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-24DOI: 10.1186/s12859-024-05924-1
Kyle Christian L Santiago, Anish M S Shrestha
Background: Conventional differential gene expression analysis pipelines for non-model organisms require computationally expensive transcriptome assembly. We recently proposed an alternative strategy of directly aligning RNA-seq reads to a protein database, and demonstrated drastic improvements in speed, memory usage, and accuracy in identifying differentially expressed genes.
Result: Here we report a further speed-up by replacing DNA-protein alignment by quasi-mapping, making our pipeline > 1000× faster than assembly-based approach, and still more accurate. We also compare quasi-mapping to other mapping techniques, and show that it is faster but at the cost of sensitivity.
Conclusion: We provide a quick-and-dirty differential gene expression analysis pipeline for non-model organisms without a reference transcriptome, which directly quasi-maps RNA-seq reads to a reference protein database, avoiding computationally expensive transcriptome assembly.
背景:传统的非模式生物差异基因表达分析管道需要计算昂贵的转录组组装。我们最近提出了一种替代策略,即直接将 RNA-seq 读数与蛋白质数据库进行比对,结果表明,这种方法在速度、内存使用和识别差异表达基因的准确性方面都有大幅提高:结果:在这里,我们报告了用准映射代替 DNA 蛋白配准的进一步提速,使我们的管道比基于组装的方法快 1000 倍以上,而且更准确。我们还将类映射与其他映射技术进行了比较,结果表明,类映射速度更快,但灵敏度却有所降低:我们为没有参考转录组的非模式生物提供了一种快速简便的差异基因表达分析管道,它能直接将 RNA-seq 读数准映射到参考蛋白质数据库,避免了计算成本高昂的转录组组装。
{"title":"DNA-protein quasi-mapping for rapid differential gene expression analysis in non-model organisms.","authors":"Kyle Christian L Santiago, Anish M S Shrestha","doi":"10.1186/s12859-024-05924-1","DOIUrl":"10.1186/s12859-024-05924-1","url":null,"abstract":"<p><strong>Background: </strong>Conventional differential gene expression analysis pipelines for non-model organisms require computationally expensive transcriptome assembly. We recently proposed an alternative strategy of directly aligning RNA-seq reads to a protein database, and demonstrated drastic improvements in speed, memory usage, and accuracy in identifying differentially expressed genes.</p><p><strong>Result: </strong>Here we report a further speed-up by replacing DNA-protein alignment by quasi-mapping, making our pipeline > 1000× faster than assembly-based approach, and still more accurate. We also compare quasi-mapping to other mapping techniques, and show that it is faster but at the cost of sensitivity.</p><p><strong>Conclusion: </strong>We provide a quick-and-dirty differential gene expression analysis pipeline for non-model organisms without a reference transcriptome, which directly quasi-maps RNA-seq reads to a reference protein database, avoiding computationally expensive transcriptome assembly.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515663/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142494580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-24DOI: 10.1186/s12859-024-05959-4
Anhui Yin, Lei Chen, Bo Zhou, Yu-Dong Cai
Background: As noncoding RNAs, circular RNAs (circRNAs) can act as microRNA (miRNA) sponges due to their abundant miRNA binding sites, allowing them to regulate gene expression and influence disease development. Accurately identifying circRNA-miRNA associations (CMAs) is helpful to understand complex disease mechanisms. Given that biological experiments are time consuming and labor intensive, alternative computational methods to predict CMAs are urgently needed.
Results: This study proposes a novel computational model named CMAGN, which incorporates several advanced computational methods, for predicting CMAs. First, similarity networks for circRNAs and miRNAs are constructed according to their sequences. Graph attention autoencoder is then applied to these networks to generate the first representations of circRNAs and miRNAs. The second representations of circRNAs and miRNAs are obtained from the CMA network via node2vec. The similarity networks of circRNAs and miRNAs are reconstructed on the basis of these new representations. Finally, network consistency projection is applied to the reconstructed similarity networks and the CMA network to generate a recommendation matrix.
Conclusion: Five-fold cross-validation of CMAGN reveals that the area under ROC and PR curves exceed 0.96 on two widely used CMA datasets, outperforming several existing models. Additional tests elaborate the reasonability of the architecture of CMAGN and uncover its strengths and weaknesses.
{"title":"CMAGN: circRNA-miRNA association prediction based on graph attention auto-encoder and network consistency projection.","authors":"Anhui Yin, Lei Chen, Bo Zhou, Yu-Dong Cai","doi":"10.1186/s12859-024-05959-4","DOIUrl":"10.1186/s12859-024-05959-4","url":null,"abstract":"<p><strong>Background: </strong>As noncoding RNAs, circular RNAs (circRNAs) can act as microRNA (miRNA) sponges due to their abundant miRNA binding sites, allowing them to regulate gene expression and influence disease development. Accurately identifying circRNA-miRNA associations (CMAs) is helpful to understand complex disease mechanisms. Given that biological experiments are time consuming and labor intensive, alternative computational methods to predict CMAs are urgently needed.</p><p><strong>Results: </strong>This study proposes a novel computational model named CMAGN, which incorporates several advanced computational methods, for predicting CMAs. First, similarity networks for circRNAs and miRNAs are constructed according to their sequences. Graph attention autoencoder is then applied to these networks to generate the first representations of circRNAs and miRNAs. The second representations of circRNAs and miRNAs are obtained from the CMA network via node2vec. The similarity networks of circRNAs and miRNAs are reconstructed on the basis of these new representations. Finally, network consistency projection is applied to the reconstructed similarity networks and the CMA network to generate a recommendation matrix.</p><p><strong>Conclusion: </strong>Five-fold cross-validation of CMAGN reveals that the area under ROC and PR curves exceed 0.96 on two widely used CMA datasets, outperforming several existing models. Additional tests elaborate the reasonability of the architecture of CMAGN and uncover its strengths and weaknesses.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515630/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142494576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-24DOI: 10.1186/s12859-024-05929-w
Yorgos M Psarellis, Seungjoon Lee, Tapomoy Bhattacharjee, Sujit S Datta, Juan M Bello-Rivas, Ioannis G Kevrekidis
Background: E. coli chemotactic motion in the presence of a chemonutrient field can be studied using wet laboratory experiments or macroscale-level partial differential equations (PDEs) (among others). Bridging experimental measurements and chemotactic Partial Differential Equations requires knowledge of the evolution of all underlying fields, initial and boundary conditions, and often necessitates strong assumptions. In this work, we propose machine learning approaches, along with ideas from the Whitney and Takens embedding theorems, to circumvent these challenges.
Results: Machine learning approaches for identifying underlying PDEs were (a) validated through the use of simulation data from established continuum models and (b) used to infer chemotactic PDEs from experimental data. Such data-driven models were surrogates either for the entire chemotactic PDE right-hand-side (black box models), or, in a more targeted fashion, just for the chemotactic term (gray box models). Furthermore, it was demonstrated that a short history of bacterial density may compensate for the missing measurements of the field of chemonutrient concentration. In fact, given reasonable conditions, such a short history of bacterial density measurements could even be used to infer chemonutrient concentration.
Conclusion: Data-driven PDEs are an important modeling tool when studying Chemotaxis at the macroscale, as they can learn bacterial motility from various data sources, fidelities (here, computational models, experiments) or coordinate systems. The resulting data-driven PDEs can then be simulated to reproduce/predict computational or experimental bacterial density profile data independent of the coordinate system, approximate meaningful parameters or functional terms, and even possibly estimate the underlying (unmeasured) chemonutrient field evolution.
{"title":"Data-driven discovery of chemotactic migration of bacteria via coordinate-invariant machine learning.","authors":"Yorgos M Psarellis, Seungjoon Lee, Tapomoy Bhattacharjee, Sujit S Datta, Juan M Bello-Rivas, Ioannis G Kevrekidis","doi":"10.1186/s12859-024-05929-w","DOIUrl":"10.1186/s12859-024-05929-w","url":null,"abstract":"<p><strong>Background: </strong>E. coli chemotactic motion in the presence of a chemonutrient field can be studied using wet laboratory experiments or macroscale-level partial differential equations (PDEs) (among others). Bridging experimental measurements and chemotactic Partial Differential Equations requires knowledge of the evolution of all underlying fields, initial and boundary conditions, and often necessitates strong assumptions. In this work, we propose machine learning approaches, along with ideas from the Whitney and Takens embedding theorems, to circumvent these challenges.</p><p><strong>Results: </strong>Machine learning approaches for identifying underlying PDEs were (a) validated through the use of simulation data from established continuum models and (b) used to infer chemotactic PDEs from experimental data. Such data-driven models were surrogates either for the entire chemotactic PDE right-hand-side (black box models), or, in a more targeted fashion, just for the chemotactic term (gray box models). Furthermore, it was demonstrated that a short history of bacterial density may compensate for the missing measurements of the field of chemonutrient concentration. In fact, given reasonable conditions, such a short history of bacterial density measurements could even be used to infer chemonutrient concentration.</p><p><strong>Conclusion: </strong>Data-driven PDEs are an important modeling tool when studying Chemotaxis at the macroscale, as they can learn bacterial motility from various data sources, fidelities (here, computational models, experiments) or coordinate systems. The resulting data-driven PDEs can then be simulated to reproduce/predict computational or experimental bacterial density profile data independent of the coordinate system, approximate meaningful parameters or functional terms, and even possibly estimate the underlying (unmeasured) chemonutrient field evolution.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515320/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142494577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-22DOI: 10.1186/s12859-024-05939-8
Candan Çelik, Pavol Bokes, Abhyudai Singh
Background: Stochastic modelling plays a crucial role in comprehending the dynamics of intracellular events in various biochemical systems, including gene-expression models. Cell-to-cell variability arises from the stochasticity or noise in the levels of gene products such as messenger RNA (mRNA) and protein. The sources of noise can stem from different factors, including structural elements. Recent studies have revealed that the mRNA structure can be more intricate than previously assumed.
Results: Here, we focus on the formation of stem-loops and present a reinterpretation of previous data, offering new insights. Our analysis demonstrates that stem-loops that restrict translation have the potential to reduce noise.
Conclusions: In conclusion, we investigate a structured/generalised version of a stochastic gene-expression model, wherein mRNA molecules can be found in one of their finite number of different states and transition between them. By characterising and deriving non-trivial analytical expressions for the steady-state protein distribution, we provide two specific examples which can be readily obtained from the structured/generalised model, showcasing the model's practical applicability.
{"title":"Translation regulation by RNA stem-loops can reduce gene expression noise.","authors":"Candan Çelik, Pavol Bokes, Abhyudai Singh","doi":"10.1186/s12859-024-05939-8","DOIUrl":"10.1186/s12859-024-05939-8","url":null,"abstract":"<p><strong>Background: </strong>Stochastic modelling plays a crucial role in comprehending the dynamics of intracellular events in various biochemical systems, including gene-expression models. Cell-to-cell variability arises from the stochasticity or noise in the levels of gene products such as messenger RNA (mRNA) and protein. The sources of noise can stem from different factors, including structural elements. Recent studies have revealed that the mRNA structure can be more intricate than previously assumed.</p><p><strong>Results: </strong>Here, we focus on the formation of stem-loops and present a reinterpretation of previous data, offering new insights. Our analysis demonstrates that stem-loops that restrict translation have the potential to reduce noise.</p><p><strong>Conclusions: </strong>In conclusion, we investigate a structured/generalised version of a stochastic gene-expression model, wherein mRNA molecules can be found in one of their finite number of different states and transition between them. By characterising and deriving non-trivial analytical expressions for the steady-state protein distribution, we provide two specific examples which can be readily obtained from the structured/generalised model, showcasing the model's practical applicability.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515661/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142494581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-18DOI: 10.1186/s12859-024-05951-y
Yaxun Jia, Haoyang Wang, Zhu Yuan, Lian Zhu, Zuo-Lin Xiang
Background: Relation extraction (RE) plays a crucial role in biomedical research as it is essential for uncovering complex semantic relationships between entities in textual data. Given the significance of RE in biomedical informatics and the increasing volume of literature, there is an urgent need for advanced computational models capable of accurately and efficiently extracting these relationships on a large scale.
Results: This paper proposes a novel approach, SARE, combining ensemble learning Stacking and attention mechanisms to enhance the performance of biomedical relation extraction. By leveraging multiple pre-trained models, SARE demonstrates improved adaptability and robustness across diverse domains. The attention mechanisms enable the model to capture and utilize key information in the text more accurately. SARE achieved performance improvements of 4.8, 8.7, and 0.8 percentage points on the PPI, DDI, and ChemProt datasets, respectively, compared to the original BERT variant and the domain-specific PubMedBERT model.
Conclusions: SARE offers a promising solution for improving the accuracy and efficiency of relation extraction tasks in biomedical research, facilitating advancements in biomedical informatics. The results suggest that combining ensemble learning with attention mechanisms is effective for extracting complex relationships from biomedical texts. Our code and data are publicly available at: https://github.com/GS233/Biomedical .
{"title":"Biomedical relation extraction method based on ensemble learning and attention mechanism.","authors":"Yaxun Jia, Haoyang Wang, Zhu Yuan, Lian Zhu, Zuo-Lin Xiang","doi":"10.1186/s12859-024-05951-y","DOIUrl":"https://doi.org/10.1186/s12859-024-05951-y","url":null,"abstract":"<p><strong>Background: </strong>Relation extraction (RE) plays a crucial role in biomedical research as it is essential for uncovering complex semantic relationships between entities in textual data. Given the significance of RE in biomedical informatics and the increasing volume of literature, there is an urgent need for advanced computational models capable of accurately and efficiently extracting these relationships on a large scale.</p><p><strong>Results: </strong>This paper proposes a novel approach, SARE, combining ensemble learning Stacking and attention mechanisms to enhance the performance of biomedical relation extraction. By leveraging multiple pre-trained models, SARE demonstrates improved adaptability and robustness across diverse domains. The attention mechanisms enable the model to capture and utilize key information in the text more accurately. SARE achieved performance improvements of 4.8, 8.7, and 0.8 percentage points on the PPI, DDI, and ChemProt datasets, respectively, compared to the original BERT variant and the domain-specific PubMedBERT model.</p><p><strong>Conclusions: </strong>SARE offers a promising solution for improving the accuracy and efficiency of relation extraction tasks in biomedical research, facilitating advancements in biomedical informatics. The results suggest that combining ensemble learning with attention mechanisms is effective for extracting complex relationships from biomedical texts. Our code and data are publicly available at: https://github.com/GS233/Biomedical .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11488084/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-18DOI: 10.1186/s12859-024-05948-7
Cezary Turek, Márton Ölbei, Tamás Stirling, Gergely Fekete, Ervin Tasnádi, Leila Gul, Balázs Bohár, Balázs Papp, Wiktor Jurkowski, Eszter Ari
Traditional gene set enrichment analyses are typically limited to a few ontologies and do not account for the interdependence of gene sets or terms, resulting in overcorrected p-values. To address these challenges, we introduce mulea, an R package offering comprehensive overrepresentation and functional enrichment analysis. mulea employs a progressive empirical false discovery rate (eFDR) method, specifically designed for interconnected biological data, to accurately identify significant terms within diverse ontologies. mulea expands beyond traditional tools by incorporating a wide range of ontologies, encompassing Gene Ontology, pathways, regulatory elements, genomic locations, and protein domains. This flexibility enables researchers to tailor enrichment analysis to their specific questions, such as identifying enriched transcriptional regulators in gene expression data or overrepresented protein domains in protein sets. To facilitate seamless analysis, mulea provides gene sets (in standardised GMT format) for 27 model organisms, covering 22 ontology types from 16 databases and various identifiers resulting in almost 900 files. Additionally, the muleaData ExperimentData Bioconductor package simplifies access to these pre-defined ontologies. Finally, mulea's architecture allows for easy integration of user-defined ontologies, or GMT files from external sources (e.g., MSigDB or Enrichr), expanding its applicability across diverse research areas. mulea is distributed as a CRAN R package downloadable from https://cran.r-project.org/web/packages/mulea/ and https://github.com/ELTEbioinformatics/mulea . It offers researchers a powerful and flexible toolkit for functional enrichment analysis, addressing limitations of traditional tools with its progressive eFDR and by supporting a variety of ontologies. Overall, mulea fosters the exploration of diverse biological questions across various model organisms.
{"title":"mulea: An R package for enrichment analysis using multiple ontologies and empirical false discovery rate.","authors":"Cezary Turek, Márton Ölbei, Tamás Stirling, Gergely Fekete, Ervin Tasnádi, Leila Gul, Balázs Bohár, Balázs Papp, Wiktor Jurkowski, Eszter Ari","doi":"10.1186/s12859-024-05948-7","DOIUrl":"https://doi.org/10.1186/s12859-024-05948-7","url":null,"abstract":"<p><p>Traditional gene set enrichment analyses are typically limited to a few ontologies and do not account for the interdependence of gene sets or terms, resulting in overcorrected p-values. To address these challenges, we introduce mulea, an R package offering comprehensive overrepresentation and functional enrichment analysis. mulea employs a progressive empirical false discovery rate (eFDR) method, specifically designed for interconnected biological data, to accurately identify significant terms within diverse ontologies. mulea expands beyond traditional tools by incorporating a wide range of ontologies, encompassing Gene Ontology, pathways, regulatory elements, genomic locations, and protein domains. This flexibility enables researchers to tailor enrichment analysis to their specific questions, such as identifying enriched transcriptional regulators in gene expression data or overrepresented protein domains in protein sets. To facilitate seamless analysis, mulea provides gene sets (in standardised GMT format) for 27 model organisms, covering 22 ontology types from 16 databases and various identifiers resulting in almost 900 files. Additionally, the muleaData ExperimentData Bioconductor package simplifies access to these pre-defined ontologies. Finally, mulea's architecture allows for easy integration of user-defined ontologies, or GMT files from external sources (e.g., MSigDB or Enrichr), expanding its applicability across diverse research areas. mulea is distributed as a CRAN R package downloadable from https://cran.r-project.org/web/packages/mulea/ and https://github.com/ELTEbioinformatics/mulea . It offers researchers a powerful and flexible toolkit for functional enrichment analysis, addressing limitations of traditional tools with its progressive eFDR and by supporting a variety of ontologies. Overall, mulea fosters the exploration of diverse biological questions across various model organisms.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11490090/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-15DOI: 10.1186/s12859-024-05954-9
Deyan Yordanov Yosifov, Michaela Reichenzeller, Stephan Stilgenbauer, Daniel Mertens
Background: The dilution-replicate experimental design for qPCR assays is especially efficient. It is based on multiple linear regression of multiple 3-point standard curves that are derived from the experimental samples themselves and thus obviates the need for a separate standard curve produced by serial dilution of a standard. The method minimizes the total number of reactions and guarantees that Cq values are within the linear dynamic range of the dilution-replicate standard curves. However, the lack of specialized software has so far precluded the widespread use of the dilution-replicate approach.
Results: Here we present repDilPCR, the first tool that utilizes the dilution-replicate method and extends it by adding the possibility to use multiple reference genes. repDilPCR offers extensive statistical and graphical functions that can also be used with preprocessed data (relative expression values) obtained by usual assay designs and evaluation methods. repDilPCR has been designed with the philosophy to automate and speed up data analysis (typically less than a minute from Cq values to publication-ready plots), and features automatic selection and performance of appropriate statistical tests, at least in the case of one-factor experimental designs. Nevertheless, the program also allows users to export intermediate data and perform more sophisticated analyses with external statistical software, e.g. if two-way ANOVA is necessary.
Conclusions: repDilPCR is a user-friendly tool that can contribute to more efficient planning of qPCR experiments and their robust analysis. A public web server is freely accessible at https://repdilpcr.eu without registration. The program can also be used as an R script or as a locally installed Shiny app, which can be downloaded from https://github.com/deyanyosifov/repDilPCR where also the source code is available.
{"title":"repDilPCR: a tool for automated analysis of qPCR assays by the dilution-replicate method.","authors":"Deyan Yordanov Yosifov, Michaela Reichenzeller, Stephan Stilgenbauer, Daniel Mertens","doi":"10.1186/s12859-024-05954-9","DOIUrl":"https://doi.org/10.1186/s12859-024-05954-9","url":null,"abstract":"<p><strong>Background: </strong>The dilution-replicate experimental design for qPCR assays is especially efficient. It is based on multiple linear regression of multiple 3-point standard curves that are derived from the experimental samples themselves and thus obviates the need for a separate standard curve produced by serial dilution of a standard. The method minimizes the total number of reactions and guarantees that Cq values are within the linear dynamic range of the dilution-replicate standard curves. However, the lack of specialized software has so far precluded the widespread use of the dilution-replicate approach.</p><p><strong>Results: </strong>Here we present repDilPCR, the first tool that utilizes the dilution-replicate method and extends it by adding the possibility to use multiple reference genes. repDilPCR offers extensive statistical and graphical functions that can also be used with preprocessed data (relative expression values) obtained by usual assay designs and evaluation methods. repDilPCR has been designed with the philosophy to automate and speed up data analysis (typically less than a minute from Cq values to publication-ready plots), and features automatic selection and performance of appropriate statistical tests, at least in the case of one-factor experimental designs. Nevertheless, the program also allows users to export intermediate data and perform more sophisticated analyses with external statistical software, e.g. if two-way ANOVA is necessary.</p><p><strong>Conclusions: </strong>repDilPCR is a user-friendly tool that can contribute to more efficient planning of qPCR experiments and their robust analysis. A public web server is freely accessible at https://repdilpcr.eu without registration. The program can also be used as an R script or as a locally installed Shiny app, which can be downloaded from https://github.com/deyanyosifov/repDilPCR where also the source code is available.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11476982/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142485691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-15DOI: 10.1186/s12859-024-05866-8
Pritam Chakraborty, Anjan Bandyopadhyay, Preeti Padma Sahu, Aniket Burman, Saurav Mallik, Najah Alsubaie, Mohamed Abbas, Mohammed S Alqahtani, Ben Othman Soufiene
Stroke prediction remains a critical area of research in healthcare, aiming to enhance early intervention and patient care strategies. This study investigates the efficacy of machine learning techniques, particularly principal component analysis (PCA) and a stacking ensemble method, for predicting stroke occurrences based on demographic, clinical, and lifestyle factors. We systematically varied PCA components and implemented a stacking model comprising random forest, decision tree, and K-nearest neighbors (KNN).Our findings demonstrate that setting PCA components to 16 optimally enhanced predictive accuracy, achieving a remarkable 98.6% accuracy in stroke prediction. Evaluation metrics underscored the robustness of our approach in handling class imbalance and improving model performance, also comparative analyses against traditional machine learning algorithms such as SVM, logistic regression, and Naive Bayes highlighted the superiority of our proposed method.
{"title":"Predicting stroke occurrences: a stacked machine learning approach with feature selection and data preprocessing.","authors":"Pritam Chakraborty, Anjan Bandyopadhyay, Preeti Padma Sahu, Aniket Burman, Saurav Mallik, Najah Alsubaie, Mohamed Abbas, Mohammed S Alqahtani, Ben Othman Soufiene","doi":"10.1186/s12859-024-05866-8","DOIUrl":"https://doi.org/10.1186/s12859-024-05866-8","url":null,"abstract":"<p><p>Stroke prediction remains a critical area of research in healthcare, aiming to enhance early intervention and patient care strategies. This study investigates the efficacy of machine learning techniques, particularly principal component analysis (PCA) and a stacking ensemble method, for predicting stroke occurrences based on demographic, clinical, and lifestyle factors. We systematically varied PCA components and implemented a stacking model comprising random forest, decision tree, and K-nearest neighbors (KNN).Our findings demonstrate that setting PCA components to 16 optimally enhanced predictive accuracy, achieving a remarkable 98.6% accuracy in stroke prediction. Evaluation metrics underscored the robustness of our approach in handling class imbalance and improving model performance, also comparative analyses against traditional machine learning algorithms such as SVM, logistic regression, and Naive Bayes highlighted the superiority of our proposed method.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11476080/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}