首页 > 最新文献

BMC Bioinformatics最新文献

英文 中文
Shrinkage estimation of gene interaction networks in single-cell RNA sequencing data. 单细胞 RNA 测序数据中基因相互作用网络的收缩估计
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-26 DOI: 10.1186/s12859-024-05946-9
Duong H T Vo, Thomas Thorne

Background: Gene interaction networks are graphs in which nodes represent genes and edges represent functional interactions between them. These interactions can be at multiple levels, for instance, gene regulation, protein-protein interaction, or metabolic pathways. To analyse gene interaction networks at a large scale, gene co-expression network analysis is often applied on high-throughput gene expression data such as RNA sequencing data. With the advance in sequencing technology, expression of genes can be measured in individual cells. Single-cell RNA sequencing (scRNAseq) provides insights of cellular development, differentiation and characteristics at the transcriptomic level. High sparsity and high-dimensional data structures pose challenges in scRNAseq data analysis.

Results: In this study, a sparse inverse covariance matrix estimation framework for scRNAseq data is developed to capture direct functional interactions between genes. Comparative analyses highlight high performance and fast computation of Stein-type shrinkage in high-dimensional data using simulated scRNAseq data. Data transformation approaches also show improvement in performance of shrinkage methods in non-Gaussian distributed data. Zero-inflated modelling of scRNAseq data based on a negative binomial distribution enhances shrinkage performance in zero-inflated data without interference on non zero-inflated count data.

Conclusion: The proposed framework broadens application of graphical model in scRNAseq analysis with flexibility in sparsity of count data resulting from dropout events, high performance, and fast computational time. Implementation of the framework is in a reproducible Snakemake workflow https://github.com/calathea24/ZINBGraphicalModel and R package ZINBStein https://github.com/calathea24/ZINBStein .

背景基因相互作用网络是一种图,其中节点代表基因,边缘代表基因之间的功能相互作用。这些相互作用可以是多层次的,例如基因调控、蛋白-蛋白相互作用或代谢途径。为了大规模分析基因相互作用网络,基因共表达网络分析通常应用于高通量基因表达数据,如 RNA 测序数据。随着测序技术的进步,基因的表达可以在单个细胞中测量。单细胞 RNA 测序(scRNAseq)可从转录组水平深入了解细胞的发育、分化和特征。高稀疏性和高维数据结构给 scRNAseq 数据分析带来了挑战:本研究为 scRNAseq 数据开发了一个稀疏逆协方差矩阵估计框架,以捕捉基因之间的直接功能相互作用。使用模拟 scRNAseq 数据进行的比较分析突出表明,在高维数据中,Stein 型收缩的计算性能高且速度快。数据转换方法也显示了收缩方法在非高斯分布数据中性能的提高。基于负二项分布的 scRNAseq 数据零膨胀建模提高了零膨胀数据的收缩性能,而不会干扰非零膨胀计数数据:结论:所提出的框架扩大了图形模型在 scRNAseq 分析中的应用范围,可灵活处理因辍学事件导致的计数数据稀疏性、高性能和快速计算时间。该框架在可重现的 Snakemake 工作流 https://github.com/calathea24/ZINBGraphicalModel 和 R 软件包 ZINBStein https://github.com/calathea24/ZINBStein 中实现。
{"title":"Shrinkage estimation of gene interaction networks in single-cell RNA sequencing data.","authors":"Duong H T Vo, Thomas Thorne","doi":"10.1186/s12859-024-05946-9","DOIUrl":"10.1186/s12859-024-05946-9","url":null,"abstract":"<p><strong>Background: </strong>Gene interaction networks are graphs in which nodes represent genes and edges represent functional interactions between them. These interactions can be at multiple levels, for instance, gene regulation, protein-protein interaction, or metabolic pathways. To analyse gene interaction networks at a large scale, gene co-expression network analysis is often applied on high-throughput gene expression data such as RNA sequencing data. With the advance in sequencing technology, expression of genes can be measured in individual cells. Single-cell RNA sequencing (scRNAseq) provides insights of cellular development, differentiation and characteristics at the transcriptomic level. High sparsity and high-dimensional data structures pose challenges in scRNAseq data analysis.</p><p><strong>Results: </strong>In this study, a sparse inverse covariance matrix estimation framework for scRNAseq data is developed to capture direct functional interactions between genes. Comparative analyses highlight high performance and fast computation of Stein-type shrinkage in high-dimensional data using simulated scRNAseq data. Data transformation approaches also show improvement in performance of shrinkage methods in non-Gaussian distributed data. Zero-inflated modelling of scRNAseq data based on a negative binomial distribution enhances shrinkage performance in zero-inflated data without interference on non zero-inflated count data.</p><p><strong>Conclusion: </strong>The proposed framework broadens application of graphical model in scRNAseq analysis with flexibility in sparsity of count data resulting from dropout events, high performance, and fast computational time. Implementation of the framework is in a reproducible Snakemake workflow https://github.com/calathea24/ZINBGraphicalModel and R package ZINBStein https://github.com/calathea24/ZINBStein .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515282/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142494579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
KiNext: a portable and scalable workflow for the identification and classification of protein kinases. KiNext:用于蛋白激酶鉴定和分类的便携式可扩展工作流程。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-25 DOI: 10.1186/s12859-024-05953-w
Elisabeth Hellec, Flavia Nunes, Charlotte Corporeau, Alexandre Cormier

Background: Protein kinases are a diverse superfamily of proteins common to organisms across the tree of life that are typically involved in signal transduction, allowing organisms to sense and respond to biotic or abiotic environmental factors. They have important roles in organismal physiology, including development, reproduction, acclimation to environmental stress, while their dysregulation can lead to disease, including several forms of cancer. Identifying the complement of protein kinases (the kinome) of any organism is useful for understanding its physiological capabilities, limitations and adaptations to environmental stress. The increasing availability of genomes makes it now possible to examine and compare the kinomes across a broad diversity of organisms. Here we present a pipeline respecting the FAIR principles (findable, accessible, interoperable and reusable) that facilitates the search and identification of protein kinases from a predicted proteome, and classifies them according to group of serine/threonine/tyrosine protein kinases present in eukaryotes.

Results: KiNext is a Nextflow pipeline that regroups a number of existing bioinformatic tools to search for and classify the protein kinases of an organism in a reproducible manner, starting from a set of amino acid sequences. Conventional eukaryotic protein kinases (ePKs) and atypical protein kinases (aPKs) are identified by using Hidden Markov Models (HMMs) generated from the catalytic domains of kinases. Furthermore, KiNext categorizes ePKs into the eight kinase groups by employing dedicated Hidden Markov Models (HMMs) tailored for each group. The performance of the KiNext pipeline was validated against previously identified kinomes obtained with other tools that were already published for two marine species, the Pacific oyster Crassostrea gigas and the unicellular green alga Ostreoccocus tauri. KiNext outperformed previous results by finding previously unidentified kinases and by attributing a large proportion of previously unclassified kinases to a group in both species. These results demonstrate improvements in kinase identification and classification, all while providing traceability and reproducibility of results in a FAIR pipeline. The default HMM models provided with KiNext are most suitable for eukaryotes, but the pipeline can be easily modified to include HMM models for other taxa of interest.

Conclusion: The KiNext pipeline enables efficient and reproducible identification of kinomes based on predicted amino acid sequences (i.e. proteomes). KiNext was designed to be easy to use, automated, portable and scalable.

背景:蛋白激酶是生命树上生物体中常见的多种超家族蛋白,通常参与信号转导,使生物体能够感知并响应生物或非生物环境因素。它们在生物体的生理过程中发挥着重要作用,包括发育、繁殖、适应环境压力,而它们的失调则可能导致疾病,包括几种形式的癌症。鉴定任何生物体的蛋白激酶补体(激酶组)都有助于了解其生理能力、局限性和对环境压力的适应性。随着基因组可用性的不断提高,现在有可能对多种生物的激酶组进行研究和比较。在此,我们介绍一种遵循 FAIR 原则(可发现、可访问、可互操作和可重复使用)的管道,它有助于从预测的蛋白质组中搜索和鉴定蛋白激酶,并根据真核生物中存在的丝氨酸/苏氨酸/酪氨酸蛋白激酶群对它们进行分类:KiNext是一个Nextflow管道,它重新组合了一些现有的生物信息学工具,从一组氨基酸序列开始,以可重复的方式搜索生物体内的蛋白激酶并对其进行分类。传统的真核生物蛋白激酶(ePKs)和非典型蛋白激酶(aPKs)是通过使用从激酶催化结构域生成的隐马尔可夫模型(HMMs)来识别的。此外,KiNext 还利用为每个激酶组定制的专用隐马尔可夫模型(HMM),将 ePKs 分成八个激酶组。KiNext 管道的性能与之前用其他工具获得的激酶组进行了验证,这些激酶组是针对两个海洋物种(太平洋牡蛎 Crassostrea gigas 和单细胞绿藻 Ostreoccocus tauri)已发表的激酶组进行鉴定的。KiNext 发现了以前未识别的激酶,并在这两个物种中将很大一部分以前未分类的激酶归入了一个群组,其结果优于以前的结果。这些结果证明了激酶鉴定和分类的改进,同时在 FAIR 管道中提供了结果的可追溯性和可重复性。KiNext 提供的默认 HMM 模型最适用于真核生物,但也可以很容易地修改管道,以包括适用于其他感兴趣类群的 HMM 模型:KiNext 管道可根据预测的氨基酸序列(即蛋白质组)高效、可重复地识别激酶组。KiNext 设计为易于使用、自动化、可移植和可扩展。
{"title":"KiNext: a portable and scalable workflow for the identification and classification of protein kinases.","authors":"Elisabeth Hellec, Flavia Nunes, Charlotte Corporeau, Alexandre Cormier","doi":"10.1186/s12859-024-05953-w","DOIUrl":"10.1186/s12859-024-05953-w","url":null,"abstract":"<p><strong>Background: </strong>Protein kinases are a diverse superfamily of proteins common to organisms across the tree of life that are typically involved in signal transduction, allowing organisms to sense and respond to biotic or abiotic environmental factors. They have important roles in organismal physiology, including development, reproduction, acclimation to environmental stress, while their dysregulation can lead to disease, including several forms of cancer. Identifying the complement of protein kinases (the kinome) of any organism is useful for understanding its physiological capabilities, limitations and adaptations to environmental stress. The increasing availability of genomes makes it now possible to examine and compare the kinomes across a broad diversity of organisms. Here we present a pipeline respecting the FAIR principles (findable, accessible, interoperable and reusable) that facilitates the search and identification of protein kinases from a predicted proteome, and classifies them according to group of serine/threonine/tyrosine protein kinases present in eukaryotes.</p><p><strong>Results: </strong>KiNext is a Nextflow pipeline that regroups a number of existing bioinformatic tools to search for and classify the protein kinases of an organism in a reproducible manner, starting from a set of amino acid sequences. Conventional eukaryotic protein kinases (ePKs) and atypical protein kinases (aPKs) are identified by using Hidden Markov Models (HMMs) generated from the catalytic domains of kinases. Furthermore, KiNext categorizes ePKs into the eight kinase groups by employing dedicated Hidden Markov Models (HMMs) tailored for each group. The performance of the KiNext pipeline was validated against previously identified kinomes obtained with other tools that were already published for two marine species, the Pacific oyster Crassostrea gigas and the unicellular green alga Ostreoccocus tauri. KiNext outperformed previous results by finding previously unidentified kinases and by attributing a large proportion of previously unclassified kinases to a group in both species. These results demonstrate improvements in kinase identification and classification, all while providing traceability and reproducibility of results in a FAIR pipeline. The default HMM models provided with KiNext are most suitable for eukaryotes, but the pipeline can be easily modified to include HMM models for other taxa of interest.</p><p><strong>Conclusion: </strong>The KiNext pipeline enables efficient and reproducible identification of kinomes based on predicted amino acid sequences (i.e. proteomes). KiNext was designed to be easy to use, automated, portable and scalable.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515245/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142494578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DNA-protein quasi-mapping for rapid differential gene expression analysis in non-model organisms. 用于非模式生物中快速差异基因表达分析的 DNA 蛋白质准图谱。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-24 DOI: 10.1186/s12859-024-05924-1
Kyle Christian L Santiago, Anish M S Shrestha

Background: Conventional differential gene expression analysis pipelines for non-model organisms require computationally expensive transcriptome assembly. We recently proposed an alternative strategy of directly aligning RNA-seq reads to a protein database, and demonstrated drastic improvements in speed, memory usage, and accuracy in identifying differentially expressed genes.

Result: Here we report a further speed-up by replacing DNA-protein alignment by quasi-mapping, making our pipeline > 1000× faster than assembly-based approach, and still more accurate. We also compare quasi-mapping to other mapping techniques, and show that it is faster but at the cost of sensitivity.

Conclusion: We provide a quick-and-dirty differential gene expression analysis pipeline for non-model organisms without a reference transcriptome, which directly quasi-maps RNA-seq reads to a reference protein database, avoiding computationally expensive transcriptome assembly.

背景:传统的非模式生物差异基因表达分析管道需要计算昂贵的转录组组装。我们最近提出了一种替代策略,即直接将 RNA-seq 读数与蛋白质数据库进行比对,结果表明,这种方法在速度、内存使用和识别差异表达基因的准确性方面都有大幅提高:结果:在这里,我们报告了用准映射代替 DNA 蛋白配准的进一步提速,使我们的管道比基于组装的方法快 1000 倍以上,而且更准确。我们还将类映射与其他映射技术进行了比较,结果表明,类映射速度更快,但灵敏度却有所降低:我们为没有参考转录组的非模式生物提供了一种快速简便的差异基因表达分析管道,它能直接将 RNA-seq 读数准映射到参考蛋白质数据库,避免了计算成本高昂的转录组组装。
{"title":"DNA-protein quasi-mapping for rapid differential gene expression analysis in non-model organisms.","authors":"Kyle Christian L Santiago, Anish M S Shrestha","doi":"10.1186/s12859-024-05924-1","DOIUrl":"10.1186/s12859-024-05924-1","url":null,"abstract":"<p><strong>Background: </strong>Conventional differential gene expression analysis pipelines for non-model organisms require computationally expensive transcriptome assembly. We recently proposed an alternative strategy of directly aligning RNA-seq reads to a protein database, and demonstrated drastic improvements in speed, memory usage, and accuracy in identifying differentially expressed genes.</p><p><strong>Result: </strong>Here we report a further speed-up by replacing DNA-protein alignment by quasi-mapping, making our pipeline > 1000× faster than assembly-based approach, and still more accurate. We also compare quasi-mapping to other mapping techniques, and show that it is faster but at the cost of sensitivity.</p><p><strong>Conclusion: </strong>We provide a quick-and-dirty differential gene expression analysis pipeline for non-model organisms without a reference transcriptome, which directly quasi-maps RNA-seq reads to a reference protein database, avoiding computationally expensive transcriptome assembly.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515663/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142494580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CMAGN: circRNA-miRNA association prediction based on graph attention auto-encoder and network consistency projection. CMAGN:基于图注意自动编码器和网络一致性投影的 circRNA-miRNA 关联预测。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-24 DOI: 10.1186/s12859-024-05959-4
Anhui Yin, Lei Chen, Bo Zhou, Yu-Dong Cai

Background: As noncoding RNAs, circular RNAs (circRNAs) can act as microRNA (miRNA) sponges due to their abundant miRNA binding sites, allowing them to regulate gene expression and influence disease development. Accurately identifying circRNA-miRNA associations (CMAs) is helpful to understand complex disease mechanisms. Given that biological experiments are time consuming and labor intensive, alternative computational methods to predict CMAs are urgently needed.

Results: This study proposes a novel computational model named CMAGN, which incorporates several advanced computational methods, for predicting CMAs. First, similarity networks for circRNAs and miRNAs are constructed according to their sequences. Graph attention autoencoder is then applied to these networks to generate the first representations of circRNAs and miRNAs. The second representations of circRNAs and miRNAs are obtained from the CMA network via node2vec. The similarity networks of circRNAs and miRNAs are reconstructed on the basis of these new representations. Finally, network consistency projection is applied to the reconstructed similarity networks and the CMA network to generate a recommendation matrix.

Conclusion: Five-fold cross-validation of CMAGN reveals that the area under ROC and PR curves exceed 0.96 on two widely used CMA datasets, outperforming several existing models. Additional tests elaborate the reasonability of the architecture of CMAGN and uncover its strengths and weaknesses.

背景:作为非编码 RNA,环状 RNA(circRNA)因其丰富的 miRNA 结合位点,可充当 microRNA(miRNA)海绵,从而调控基因表达并影响疾病发展。准确鉴定 circRNA 与 miRNA 的关联(CMAs)有助于了解复杂的疾病机制。鉴于生物实验耗时耗力,急需其他计算方法来预测 CMAs:本研究提出了一种名为 CMAGN 的新型计算模型,该模型融合了多种先进的计算方法,可用于预测 CMAs。首先,根据 circRNA 和 miRNA 的序列构建它们的相似性网络。然后,将图注意自动编码器应用于这些网络,生成 circRNA 和 miRNA 的第一个表示。通过 node2vec 从 CMA 网络获得 circRNA 和 miRNA 的第二表征。在这些新表征的基础上重建 circRNA 和 miRNA 的相似性网络。最后,对重建的相似性网络和 CMA 网络进行网络一致性投影,生成推荐矩阵:结论:CMAGN 的五倍交叉验证表明,在两个广泛使用的 CMA 数据集上,其 ROC 和 PR 曲线下的面积超过了 0.96,优于现有的几个模型。其他测试详细说明了 CMAGN 架构的合理性,并揭示了其优缺点。
{"title":"CMAGN: circRNA-miRNA association prediction based on graph attention auto-encoder and network consistency projection.","authors":"Anhui Yin, Lei Chen, Bo Zhou, Yu-Dong Cai","doi":"10.1186/s12859-024-05959-4","DOIUrl":"10.1186/s12859-024-05959-4","url":null,"abstract":"<p><strong>Background: </strong>As noncoding RNAs, circular RNAs (circRNAs) can act as microRNA (miRNA) sponges due to their abundant miRNA binding sites, allowing them to regulate gene expression and influence disease development. Accurately identifying circRNA-miRNA associations (CMAs) is helpful to understand complex disease mechanisms. Given that biological experiments are time consuming and labor intensive, alternative computational methods to predict CMAs are urgently needed.</p><p><strong>Results: </strong>This study proposes a novel computational model named CMAGN, which incorporates several advanced computational methods, for predicting CMAs. First, similarity networks for circRNAs and miRNAs are constructed according to their sequences. Graph attention autoencoder is then applied to these networks to generate the first representations of circRNAs and miRNAs. The second representations of circRNAs and miRNAs are obtained from the CMA network via node2vec. The similarity networks of circRNAs and miRNAs are reconstructed on the basis of these new representations. Finally, network consistency projection is applied to the reconstructed similarity networks and the CMA network to generate a recommendation matrix.</p><p><strong>Conclusion: </strong>Five-fold cross-validation of CMAGN reveals that the area under ROC and PR curves exceed 0.96 on two widely used CMA datasets, outperforming several existing models. Additional tests elaborate the reasonability of the architecture of CMAGN and uncover its strengths and weaknesses.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515630/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142494576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data-driven discovery of chemotactic migration of bacteria via coordinate-invariant machine learning. 通过坐标不变机器学习,以数据驱动发现细菌的趋化迁移。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-24 DOI: 10.1186/s12859-024-05929-w
Yorgos M Psarellis, Seungjoon Lee, Tapomoy Bhattacharjee, Sujit S Datta, Juan M Bello-Rivas, Ioannis G Kevrekidis

Background: E. coli chemotactic motion in the presence of a chemonutrient field can be studied using wet laboratory experiments or macroscale-level partial differential equations (PDEs) (among others). Bridging experimental measurements and chemotactic Partial Differential Equations requires knowledge of the evolution of all underlying fields, initial and boundary conditions, and often necessitates strong assumptions. In this work, we propose machine learning approaches, along with ideas from the Whitney and Takens embedding theorems, to circumvent these challenges.

Results: Machine learning approaches for identifying underlying PDEs were (a) validated through the use of simulation data from established continuum models and (b) used to infer chemotactic PDEs from experimental data. Such data-driven models were surrogates either for the entire chemotactic PDE right-hand-side (black box models), or, in a more targeted fashion, just for the chemotactic term (gray box models). Furthermore, it was demonstrated that a short history of bacterial density may compensate for the missing measurements of the field of chemonutrient concentration. In fact, given reasonable conditions, such a short history of bacterial density measurements could even be used to infer chemonutrient concentration.

Conclusion: Data-driven PDEs are an important modeling tool when studying Chemotaxis at the macroscale, as they can learn bacterial motility from various data sources, fidelities (here, computational models, experiments) or coordinate systems. The resulting data-driven PDEs can then be simulated to reproduce/predict computational or experimental bacterial density profile data independent of the coordinate system, approximate meaningful parameters or functional terms, and even possibly estimate the underlying (unmeasured) chemonutrient field evolution.

背景:可以使用湿实验室实验或宏观偏微分方程(PDEs)等方法来研究大肠杆菌在螯合剂场作用下的趋化运动。连接实验测量和趋化偏微分方程需要了解所有基础场、初始条件和边界条件的演变,而且往往需要强有力的假设。在这项工作中,我们提出了机器学习方法以及惠特尼和塔肯斯嵌入定理的思想,以规避这些挑战:结果:(a)通过使用已建立的连续模型的模拟数据验证了识别底层 PDE 的机器学习方法;(b)使用机器学习方法从实验数据中推断出趋化 PDE。这些数据驱动模型既可以是整个趋化 PDE 右侧的代用模型(黑框模型),也可以是更有针对性的趋化项的代用模型(灰框模型)。此外,研究还证明,细菌密度的短暂历史可以弥补螯合剂浓度场测量的缺失。事实上,在条件合理的情况下,这种短时间的细菌密度测量甚至可以用来推断螯合营养素的浓度:在研究宏观尺度的趋化性时,数据驱动的 PDE 是一种重要的建模工具,因为它们可以从不同的数据源、保真度(此处为计算模型、实验)或坐标系中学习细菌的运动。由此产生的数据驱动 PDEs 可以通过模拟来重现/预测计算或实验中的细菌密度曲线数据,而不受坐标系的影响,近似有意义的参数或函数项,甚至可能估计潜在的(未测量的)螯合剂场演化。
{"title":"Data-driven discovery of chemotactic migration of bacteria via coordinate-invariant machine learning.","authors":"Yorgos M Psarellis, Seungjoon Lee, Tapomoy Bhattacharjee, Sujit S Datta, Juan M Bello-Rivas, Ioannis G Kevrekidis","doi":"10.1186/s12859-024-05929-w","DOIUrl":"10.1186/s12859-024-05929-w","url":null,"abstract":"<p><strong>Background: </strong>E. coli chemotactic motion in the presence of a chemonutrient field can be studied using wet laboratory experiments or macroscale-level partial differential equations (PDEs) (among others). Bridging experimental measurements and chemotactic Partial Differential Equations requires knowledge of the evolution of all underlying fields, initial and boundary conditions, and often necessitates strong assumptions. In this work, we propose machine learning approaches, along with ideas from the Whitney and Takens embedding theorems, to circumvent these challenges.</p><p><strong>Results: </strong>Machine learning approaches for identifying underlying PDEs were (a) validated through the use of simulation data from established continuum models and (b) used to infer chemotactic PDEs from experimental data. Such data-driven models were surrogates either for the entire chemotactic PDE right-hand-side (black box models), or, in a more targeted fashion, just for the chemotactic term (gray box models). Furthermore, it was demonstrated that a short history of bacterial density may compensate for the missing measurements of the field of chemonutrient concentration. In fact, given reasonable conditions, such a short history of bacterial density measurements could even be used to infer chemonutrient concentration.</p><p><strong>Conclusion: </strong>Data-driven PDEs are an important modeling tool when studying Chemotaxis at the macroscale, as they can learn bacterial motility from various data sources, fidelities (here, computational models, experiments) or coordinate systems. The resulting data-driven PDEs can then be simulated to reproduce/predict computational or experimental bacterial density profile data independent of the coordinate system, approximate meaningful parameters or functional terms, and even possibly estimate the underlying (unmeasured) chemonutrient field evolution.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515320/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142494577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Translation regulation by RNA stem-loops can reduce gene expression noise. RNA 干环的翻译调控可减少基因表达噪音。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-22 DOI: 10.1186/s12859-024-05939-8
Candan Çelik, Pavol Bokes, Abhyudai Singh

Background: Stochastic modelling plays a crucial role in comprehending the dynamics of intracellular events in various biochemical systems, including gene-expression models. Cell-to-cell variability arises from the stochasticity or noise in the levels of gene products such as messenger RNA (mRNA) and protein. The sources of noise can stem from different factors, including structural elements. Recent studies have revealed that the mRNA structure can be more intricate than previously assumed.

Results: Here, we focus on the formation of stem-loops and present a reinterpretation of previous data, offering new insights. Our analysis demonstrates that stem-loops that restrict translation have the potential to reduce noise.

Conclusions: In conclusion, we investigate a structured/generalised version of a stochastic gene-expression model, wherein mRNA molecules can be found in one of their finite number of different states and transition between them. By characterising and deriving non-trivial analytical expressions for the steady-state protein distribution, we provide two specific examples which can be readily obtained from the structured/generalised model, showcasing the model's practical applicability.

背景:随机建模在理解各种生化系统(包括基因表达模型)中细胞内事件的动态方面发挥着至关重要的作用。细胞间的可变性源于信使核糖核酸(mRNA)和蛋白质等基因产物水平的随机性或噪声。噪音的来源可能来自不同的因素,包括结构元素。最近的研究发现,mRNA 结构可能比以前假设的更加复杂:在此,我们重点研究了茎环的形成,并对以前的数据进行了重新解释,提出了新的见解。我们的分析表明,限制翻译的茎环有可能减少噪音:总之,我们研究了随机基因表达模型的结构化/广义版本,其中 mRNA 分子可处于有限数量的不同状态之一,并可在这些状态之间转换。通过描述和推导稳态蛋白质分布的非难分析表达式,我们提供了两个具体的例子,这些例子可以很容易地从结构化/广义模型中获得,展示了该模型的实际应用性。
{"title":"Translation regulation by RNA stem-loops can reduce gene expression noise.","authors":"Candan Çelik, Pavol Bokes, Abhyudai Singh","doi":"10.1186/s12859-024-05939-8","DOIUrl":"10.1186/s12859-024-05939-8","url":null,"abstract":"<p><strong>Background: </strong>Stochastic modelling plays a crucial role in comprehending the dynamics of intracellular events in various biochemical systems, including gene-expression models. Cell-to-cell variability arises from the stochasticity or noise in the levels of gene products such as messenger RNA (mRNA) and protein. The sources of noise can stem from different factors, including structural elements. Recent studies have revealed that the mRNA structure can be more intricate than previously assumed.</p><p><strong>Results: </strong>Here, we focus on the formation of stem-loops and present a reinterpretation of previous data, offering new insights. Our analysis demonstrates that stem-loops that restrict translation have the potential to reduce noise.</p><p><strong>Conclusions: </strong>In conclusion, we investigate a structured/generalised version of a stochastic gene-expression model, wherein mRNA molecules can be found in one of their finite number of different states and transition between them. By characterising and deriving non-trivial analytical expressions for the steady-state protein distribution, we provide two specific examples which can be readily obtained from the structured/generalised model, showcasing the model's practical applicability.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515661/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142494581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Biomedical relation extraction method based on ensemble learning and attention mechanism. 基于集合学习和注意力机制的生物医学关系提取方法
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-18 DOI: 10.1186/s12859-024-05951-y
Yaxun Jia, Haoyang Wang, Zhu Yuan, Lian Zhu, Zuo-Lin Xiang

Background: Relation extraction (RE) plays a crucial role in biomedical research as it is essential for uncovering complex semantic relationships between entities in textual data. Given the significance of RE in biomedical informatics and the increasing volume of literature, there is an urgent need for advanced computational models capable of accurately and efficiently extracting these relationships on a large scale.

Results: This paper proposes a novel approach, SARE, combining ensemble learning Stacking and attention mechanisms to enhance the performance of biomedical relation extraction. By leveraging multiple pre-trained models, SARE demonstrates improved adaptability and robustness across diverse domains. The attention mechanisms enable the model to capture and utilize key information in the text more accurately. SARE achieved performance improvements of 4.8, 8.7, and 0.8 percentage points on the PPI, DDI, and ChemProt datasets, respectively, compared to the original BERT variant and the domain-specific PubMedBERT model.

Conclusions: SARE offers a promising solution for improving the accuracy and efficiency of relation extraction tasks in biomedical research, facilitating advancements in biomedical informatics. The results suggest that combining ensemble learning with attention mechanisms is effective for extracting complex relationships from biomedical texts. Our code and data are publicly available at: https://github.com/GS233/Biomedical .

背景:关系提取(RE)在生物医学研究中发挥着至关重要的作用,因为它对于揭示文本数据中实体之间复杂的语义关系至关重要。鉴于关系提取在生物医学信息学中的重要性以及文献量的不断增加,迫切需要能够准确、高效地大规模提取这些关系的先进计算模型:本文提出了一种新方法 SARE,它结合了集合学习堆叠(Stacking)和注意力机制,以提高生物医学关系提取的性能。通过利用多个预先训练好的模型,SARE 在不同领域都表现出更强的适应性和鲁棒性。注意力机制使模型能够更准确地捕捉和利用文本中的关键信息。与原始 BERT 变体和特定领域的 PubMedBERT 模型相比,SARE 在 PPI、DDI 和 ChemProt 数据集上的性能分别提高了 4.8、8.7 和 0.8 个百分点:SARE 为提高生物医学研究中关系提取任务的准确性和效率提供了一种有前途的解决方案,促进了生物医学信息学的发展。研究结果表明,将集合学习与注意力机制相结合能有效地从生物医学文本中提取复杂的关系。我们的代码和数据可在以下网站公开: https://github.com/GS233/Biomedical 。
{"title":"Biomedical relation extraction method based on ensemble learning and attention mechanism.","authors":"Yaxun Jia, Haoyang Wang, Zhu Yuan, Lian Zhu, Zuo-Lin Xiang","doi":"10.1186/s12859-024-05951-y","DOIUrl":"https://doi.org/10.1186/s12859-024-05951-y","url":null,"abstract":"<p><strong>Background: </strong>Relation extraction (RE) plays a crucial role in biomedical research as it is essential for uncovering complex semantic relationships between entities in textual data. Given the significance of RE in biomedical informatics and the increasing volume of literature, there is an urgent need for advanced computational models capable of accurately and efficiently extracting these relationships on a large scale.</p><p><strong>Results: </strong>This paper proposes a novel approach, SARE, combining ensemble learning Stacking and attention mechanisms to enhance the performance of biomedical relation extraction. By leveraging multiple pre-trained models, SARE demonstrates improved adaptability and robustness across diverse domains. The attention mechanisms enable the model to capture and utilize key information in the text more accurately. SARE achieved performance improvements of 4.8, 8.7, and 0.8 percentage points on the PPI, DDI, and ChemProt datasets, respectively, compared to the original BERT variant and the domain-specific PubMedBERT model.</p><p><strong>Conclusions: </strong>SARE offers a promising solution for improving the accuracy and efficiency of relation extraction tasks in biomedical research, facilitating advancements in biomedical informatics. The results suggest that combining ensemble learning with attention mechanisms is effective for extracting complex relationships from biomedical texts. Our code and data are publicly available at: https://github.com/GS233/Biomedical .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11488084/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
mulea: An R package for enrichment analysis using multiple ontologies and empirical false discovery rate. mulea:使用多本体和经验错误发现率进行富集分析的 R 软件包。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-18 DOI: 10.1186/s12859-024-05948-7
Cezary Turek, Márton Ölbei, Tamás Stirling, Gergely Fekete, Ervin Tasnádi, Leila Gul, Balázs Bohár, Balázs Papp, Wiktor Jurkowski, Eszter Ari

Traditional gene set enrichment analyses are typically limited to a few ontologies and do not account for the interdependence of gene sets or terms, resulting in overcorrected p-values. To address these challenges, we introduce mulea, an R package offering comprehensive overrepresentation and functional enrichment analysis. mulea employs a progressive empirical false discovery rate (eFDR) method, specifically designed for interconnected biological data, to accurately identify significant terms within diverse ontologies. mulea expands beyond traditional tools by incorporating a wide range of ontologies, encompassing Gene Ontology, pathways, regulatory elements, genomic locations, and protein domains. This flexibility enables researchers to tailor enrichment analysis to their specific questions, such as identifying enriched transcriptional regulators in gene expression data or overrepresented protein domains in protein sets. To facilitate seamless analysis, mulea provides gene sets (in standardised GMT format) for 27 model organisms, covering 22 ontology types from 16 databases and various identifiers resulting in almost 900 files. Additionally, the muleaData ExperimentData Bioconductor package simplifies access to these pre-defined ontologies. Finally, mulea's architecture allows for easy integration of user-defined ontologies, or GMT files from external sources (e.g., MSigDB or Enrichr), expanding its applicability across diverse research areas. mulea is distributed as a CRAN R package downloadable from https://cran.r-project.org/web/packages/mulea/ and https://github.com/ELTEbioinformatics/mulea . It offers researchers a powerful and flexible toolkit for functional enrichment analysis, addressing limitations of traditional tools with its progressive eFDR and by supporting a variety of ontologies. Overall, mulea fosters the exploration of diverse biological questions across various model organisms.

传统的基因组富集分析通常仅限于少数几个本体,而且不考虑基因组或术语之间的相互依赖关系,从而导致过校正 p 值。mulea 采用一种渐进式经验错误发现率 (eFDR) 方法,专为相互关联的生物数据而设计,可准确识别不同本体中的重要术语。mulea 的功能超越了传统工具,纳入了广泛的本体,包括基因本体、通路、调控元件、基因组位置和蛋白质域。这种灵活性使研究人员能够针对具体问题进行富集分析,例如在基因表达数据中识别富集的转录调控因子,或在蛋白质组中识别代表性过高的蛋白质域。为便于进行无缝分析,mulea 提供了 27 种模式生物的基因集(标准化 GMT 格式),涵盖来自 16 个数据库的 22 种本体类型和各种标识符,形成近 900 个文件。此外,muleaData ExperimentData Bioconductor 软件包简化了对这些预定义本体的访问。最后,mulea 的架构允许轻松集成用户定义的本体或来自外部资源(如 MSigDB 或 Enrichr)的 GMT 文件,从而扩大了其在不同研究领域的适用性。mulea 以 CRAN R 软件包的形式发布,可从 https://cran.r-project.org/web/packages/mulea/ 和 https://github.com/ELTEbioinformatics/mulea 下载。它为研究人员提供了一个强大而灵活的功能富集分析工具包,通过渐进式 eFDR 和支持各种本体解决了传统工具的局限性。总之,mulea 有助于探索各种模式生物的各种生物学问题。
{"title":"mulea: An R package for enrichment analysis using multiple ontologies and empirical false discovery rate.","authors":"Cezary Turek, Márton Ölbei, Tamás Stirling, Gergely Fekete, Ervin Tasnádi, Leila Gul, Balázs Bohár, Balázs Papp, Wiktor Jurkowski, Eszter Ari","doi":"10.1186/s12859-024-05948-7","DOIUrl":"https://doi.org/10.1186/s12859-024-05948-7","url":null,"abstract":"<p><p>Traditional gene set enrichment analyses are typically limited to a few ontologies and do not account for the interdependence of gene sets or terms, resulting in overcorrected p-values. To address these challenges, we introduce mulea, an R package offering comprehensive overrepresentation and functional enrichment analysis. mulea employs a progressive empirical false discovery rate (eFDR) method, specifically designed for interconnected biological data, to accurately identify significant terms within diverse ontologies. mulea expands beyond traditional tools by incorporating a wide range of ontologies, encompassing Gene Ontology, pathways, regulatory elements, genomic locations, and protein domains. This flexibility enables researchers to tailor enrichment analysis to their specific questions, such as identifying enriched transcriptional regulators in gene expression data or overrepresented protein domains in protein sets. To facilitate seamless analysis, mulea provides gene sets (in standardised GMT format) for 27 model organisms, covering 22 ontology types from 16 databases and various identifiers resulting in almost 900 files. Additionally, the muleaData ExperimentData Bioconductor package simplifies access to these pre-defined ontologies. Finally, mulea's architecture allows for easy integration of user-defined ontologies, or GMT files from external sources (e.g., MSigDB or Enrichr), expanding its applicability across diverse research areas. mulea is distributed as a CRAN R package downloadable from https://cran.r-project.org/web/packages/mulea/ and https://github.com/ELTEbioinformatics/mulea . It offers researchers a powerful and flexible toolkit for functional enrichment analysis, addressing limitations of traditional tools with its progressive eFDR and by supporting a variety of ontologies. Overall, mulea fosters the exploration of diverse biological questions across various model organisms.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11490090/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
repDilPCR: a tool for automated analysis of qPCR assays by the dilution-replicate method. repDilPCR:采用稀释-复制法自动分析 qPCR 检测的工具。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-15 DOI: 10.1186/s12859-024-05954-9
Deyan Yordanov Yosifov, Michaela Reichenzeller, Stephan Stilgenbauer, Daniel Mertens

Background: The dilution-replicate experimental design for qPCR assays is especially efficient. It is based on multiple linear regression of multiple 3-point standard curves that are derived from the experimental samples themselves and thus obviates the need for a separate standard curve produced by serial dilution of a standard. The method minimizes the total number of reactions and guarantees that Cq values are within the linear dynamic range of the dilution-replicate standard curves. However, the lack of specialized software has so far precluded the widespread use of the dilution-replicate approach.

Results: Here we present repDilPCR, the first tool that utilizes the dilution-replicate method and extends it by adding the possibility to use multiple reference genes. repDilPCR offers extensive statistical and graphical functions that can also be used with preprocessed data (relative expression values) obtained by usual assay designs and evaluation methods. repDilPCR has been designed with the philosophy to automate and speed up data analysis (typically less than a minute from Cq values to publication-ready plots), and features automatic selection and performance of appropriate statistical tests, at least in the case of one-factor experimental designs. Nevertheless, the program also allows users to export intermediate data and perform more sophisticated analyses with external statistical software, e.g. if two-way ANOVA is necessary.

Conclusions: repDilPCR is a user-friendly tool that can contribute to more efficient planning of qPCR experiments and their robust analysis. A public web server is freely accessible at https://repdilpcr.eu without registration. The program can also be used as an R script or as a locally installed Shiny app, which can be downloaded from https://github.com/deyanyosifov/repDilPCR where also the source code is available.

背景:用于 qPCR 检测的稀释-重复实验设计特别有效。它基于从实验样品本身得出的多条 3 点标准曲线的多重线性回归,因此无需通过连续稀释标准品来生成单独的标准曲线。该方法最大限度地减少了反应总数,并确保 Cq 值在稀释-重复标准曲线的线性动态范围内。然而,由于缺乏专门的软件,稀释-复制法至今仍未得到广泛应用:我们在此介绍 repDilPCR,它是第一款利用稀释-复制方法的工具,并通过增加使用多个参考基因的可能性对其进行了扩展。 repDilPCR 提供了广泛的统计和图形功能,也可用于通过常规检测设计和评估方法获得的预处理数据(相对表达值)。repDilPCR 的设计理念是自动加快数据分析速度(从 Cq 值到可用于发表论文的图表通常不超过一分钟),并具有自动选择和执行适当统计检验的功能,至少在单因素实验设计的情况下是如此。结论:repDilPCR 是一款用户友好型工具,有助于更高效地规划 qPCR 实验并对其进行稳健分析。公共网络服务器可在 https://repdilpcr.eu 免费访问,无需注册。该程序也可作为 R 脚本或本地安装的 Shiny 应用程序使用,可从 https://github.com/deyanyosifov/repDilPCR 下载,源代码也可从该网站获取。
{"title":"repDilPCR: a tool for automated analysis of qPCR assays by the dilution-replicate method.","authors":"Deyan Yordanov Yosifov, Michaela Reichenzeller, Stephan Stilgenbauer, Daniel Mertens","doi":"10.1186/s12859-024-05954-9","DOIUrl":"https://doi.org/10.1186/s12859-024-05954-9","url":null,"abstract":"<p><strong>Background: </strong>The dilution-replicate experimental design for qPCR assays is especially efficient. It is based on multiple linear regression of multiple 3-point standard curves that are derived from the experimental samples themselves and thus obviates the need for a separate standard curve produced by serial dilution of a standard. The method minimizes the total number of reactions and guarantees that Cq values are within the linear dynamic range of the dilution-replicate standard curves. However, the lack of specialized software has so far precluded the widespread use of the dilution-replicate approach.</p><p><strong>Results: </strong>Here we present repDilPCR, the first tool that utilizes the dilution-replicate method and extends it by adding the possibility to use multiple reference genes. repDilPCR offers extensive statistical and graphical functions that can also be used with preprocessed data (relative expression values) obtained by usual assay designs and evaluation methods. repDilPCR has been designed with the philosophy to automate and speed up data analysis (typically less than a minute from Cq values to publication-ready plots), and features automatic selection and performance of appropriate statistical tests, at least in the case of one-factor experimental designs. Nevertheless, the program also allows users to export intermediate data and perform more sophisticated analyses with external statistical software, e.g. if two-way ANOVA is necessary.</p><p><strong>Conclusions: </strong>repDilPCR is a user-friendly tool that can contribute to more efficient planning of qPCR experiments and their robust analysis. A public web server is freely accessible at https://repdilpcr.eu without registration. The program can also be used as an R script or as a locally installed Shiny app, which can be downloaded from https://github.com/deyanyosifov/repDilPCR where also the source code is available.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11476982/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142485691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting stroke occurrences: a stacked machine learning approach with feature selection and data preprocessing. 预测中风发生率:一种具有特征选择和数据预处理功能的叠加式机器学习方法。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-15 DOI: 10.1186/s12859-024-05866-8
Pritam Chakraborty, Anjan Bandyopadhyay, Preeti Padma Sahu, Aniket Burman, Saurav Mallik, Najah Alsubaie, Mohamed Abbas, Mohammed S Alqahtani, Ben Othman Soufiene

Stroke prediction remains a critical area of research in healthcare, aiming to enhance early intervention and patient care strategies. This study investigates the efficacy of machine learning techniques, particularly principal component analysis (PCA) and a stacking ensemble method, for predicting stroke occurrences based on demographic, clinical, and lifestyle factors. We systematically varied PCA components and implemented a stacking model comprising random forest, decision tree, and K-nearest neighbors (KNN).Our findings demonstrate that setting PCA components to 16 optimally enhanced predictive accuracy, achieving a remarkable 98.6% accuracy in stroke prediction. Evaluation metrics underscored the robustness of our approach in handling class imbalance and improving model performance, also comparative analyses against traditional machine learning algorithms such as SVM, logistic regression, and Naive Bayes highlighted the superiority of our proposed method.

脑卒中预测仍是医疗保健领域的一个重要研究领域,旨在加强早期干预和患者护理策略。本研究探讨了机器学习技术,尤其是主成分分析(PCA)和堆叠集合方法,在基于人口、临床和生活方式因素预测脑卒中发生率方面的功效。我们系统地改变了 PCA 分量,并实施了一个由随机森林、决策树和 K-nearest neighbors (KNN) 组成的堆叠模型。我们的研究结果表明,将 PCA 分量设置为 16 最能提高预测准确性,中风预测准确率高达 98.6%。评估指标强调了我们的方法在处理类不平衡和提高模型性能方面的稳健性,与 SVM、逻辑回归和 Naive Bayes 等传统机器学习算法的比较分析也凸显了我们提出的方法的优越性。
{"title":"Predicting stroke occurrences: a stacked machine learning approach with feature selection and data preprocessing.","authors":"Pritam Chakraborty, Anjan Bandyopadhyay, Preeti Padma Sahu, Aniket Burman, Saurav Mallik, Najah Alsubaie, Mohamed Abbas, Mohammed S Alqahtani, Ben Othman Soufiene","doi":"10.1186/s12859-024-05866-8","DOIUrl":"https://doi.org/10.1186/s12859-024-05866-8","url":null,"abstract":"<p><p>Stroke prediction remains a critical area of research in healthcare, aiming to enhance early intervention and patient care strategies. This study investigates the efficacy of machine learning techniques, particularly principal component analysis (PCA) and a stacking ensemble method, for predicting stroke occurrences based on demographic, clinical, and lifestyle factors. We systematically varied PCA components and implemented a stacking model comprising random forest, decision tree, and K-nearest neighbors (KNN).Our findings demonstrate that setting PCA components to 16 optimally enhanced predictive accuracy, achieving a remarkable 98.6% accuracy in stroke prediction. Evaluation metrics underscored the robustness of our approach in handling class imbalance and improving model performance, also comparative analyses against traditional machine learning algorithms such as SVM, logistic regression, and Naive Bayes highlighted the superiority of our proposed method.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11476080/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
BMC Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1