首页 > 最新文献

BMC Bioinformatics最新文献

英文 中文
The evaluation of transcription factor binding site prediction tools in human and Arabidopsis genomes. 人类和拟南芥基因组转录因子结合位点预测工具的评估。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-12-02 DOI: 10.1186/s12859-024-05995-0
Dinithi V Wanniarachchi, Sameera Viswakula, Anushka M Wickramasuriya

Background: The precise prediction of transcription factor binding sites (TFBSs) is pivotal for unraveling the gene regulatory networks underlying biological processes. While numerous tools have emerged for in silico TFBS prediction in recent years, the evolving landscape of computational biology necessitates thorough assessments of tool performance to ensure accuracy and reliability. Only a limited number of studies have been conducted to evaluate the performance of TFBS prediction tools comprehensively. Thus, the present study focused on assessing twelve widely used TFBS prediction tools and four de novo motif discovery tools using a benchmark dataset comprising real, generic, Markov, and negative sequences. TFBSs of Arabidopsis thaliana and Homo sapiens genomes downloaded from the JASPAR database were implanted in these sequences and the performance of tools was evaluated using several statistical parameters at different overlap percentages between the lengths of known and predicted binding sites.

Results: Overall, the Multiple Cluster Alignment and Search Tool (MCAST) emerged as the best TFBS prediction tool, followed by Find Individual Motif Occurrences (FIMO) and MOtif Occurrence Detection Suite (MOODS). In addition, MotEvo and Dinucleotide Weight Tensor Toolbox (DWT-toolbox) demonstrated the highest sensitivity in identifying TFBSs at 90% and 80% overlap. Further, MCAST and DWT-toolbox managed to demonstrate the highest sensitivity across all three data types real, generic, and Markov. Among the de novo motif discovery tools, the Multiple Em for Motif Elicitation (MEME) emerged as the best performer. An analysis of the promoter regions of genes involved in the anthocyanin biosynthesis pathway in plants and the pentose phosphate pathway in humans, using the three best-performing tools, revealed considerable variation among the top 20 motifs identified by these tools.

Conclusion: The findings of this study lay a robust groundwork for selecting optimal TFBS prediction tools for future research. Given the variability observed in tool performance, employing multiple tools for identifying TFBSs in a set of sequences is highly recommended. In addition, further studies are recommended to develop an integrated toolbox that incorporates TFBS prediction or motif discovery tools, aiming to streamline result precision and accuracy.

背景:转录因子结合位点(TFBSs)的精确预测对于揭示生物学过程背后的基因调控网络至关重要。虽然近年来出现了许多用于计算机TFBS预测的工具,但计算生物学的不断发展需要对工具性能进行全面评估,以确保准确性和可靠性。只有有限的研究对TFBS预测工具的性能进行了全面的评估。因此,本研究的重点是评估12种广泛使用的TFBS预测工具和4种全新的motif发现工具,使用基准数据集包括真实序列、通用序列、马尔可夫序列和负序列。将从JASPAR数据库下载的拟南芥和智人基因组的TFBSs植入到这些序列中,并使用已知和预测结合位点长度之间不同重叠百分比的几个统计参数来评估工具的性能。结果:总体而言,多簇比对和搜索工具(MCAST)是最佳的TFBS预测工具,其次是查找单个Motif Occurrence (FIMO)和Motif Occurrence Detection Suite (MOODS)。此外,MotEvo和二核苷酸权重张量工具箱(dwt -工具箱)在90%和80%重叠时识别TFBSs的灵敏度最高。此外,MCAST和DWT-toolbox成功地在所有三种数据类型(真实、通用和马尔可夫)中展示了最高的灵敏度。在全新的motif发现工具中,Multiple Em for motif Elicitation (MEME)表现最好。使用三种性能最好的工具对植物花青素生物合成途径和人类戊糖磷酸途径中涉及的基因启动子区域进行了分析,揭示了这些工具确定的前20个基序之间存在相当大的差异。结论:本研究结果为未来研究选择最佳的TFBS预测工具奠定了坚实的基础。考虑到工具性能的可变性,强烈建议使用多种工具在一组序列中识别tfbs。此外,建议进一步研究开发集成TFBS预测或motif发现工具的工具箱,以提高结果的精度和准确性。
{"title":"The evaluation of transcription factor binding site prediction tools in human and Arabidopsis genomes.","authors":"Dinithi V Wanniarachchi, Sameera Viswakula, Anushka M Wickramasuriya","doi":"10.1186/s12859-024-05995-0","DOIUrl":"10.1186/s12859-024-05995-0","url":null,"abstract":"<p><strong>Background: </strong>The precise prediction of transcription factor binding sites (TFBSs) is pivotal for unraveling the gene regulatory networks underlying biological processes. While numerous tools have emerged for in silico TFBS prediction in recent years, the evolving landscape of computational biology necessitates thorough assessments of tool performance to ensure accuracy and reliability. Only a limited number of studies have been conducted to evaluate the performance of TFBS prediction tools comprehensively. Thus, the present study focused on assessing twelve widely used TFBS prediction tools and four de novo motif discovery tools using a benchmark dataset comprising real, generic, Markov, and negative sequences. TFBSs of Arabidopsis thaliana and Homo sapiens genomes downloaded from the JASPAR database were implanted in these sequences and the performance of tools was evaluated using several statistical parameters at different overlap percentages between the lengths of known and predicted binding sites.</p><p><strong>Results: </strong>Overall, the Multiple Cluster Alignment and Search Tool (MCAST) emerged as the best TFBS prediction tool, followed by Find Individual Motif Occurrences (FIMO) and MOtif Occurrence Detection Suite (MOODS). In addition, MotEvo and Dinucleotide Weight Tensor Toolbox (DWT-toolbox) demonstrated the highest sensitivity in identifying TFBSs at 90% and 80% overlap. Further, MCAST and DWT-toolbox managed to demonstrate the highest sensitivity across all three data types real, generic, and Markov. Among the de novo motif discovery tools, the Multiple Em for Motif Elicitation (MEME) emerged as the best performer. An analysis of the promoter regions of genes involved in the anthocyanin biosynthesis pathway in plants and the pentose phosphate pathway in humans, using the three best-performing tools, revealed considerable variation among the top 20 motifs identified by these tools.</p><p><strong>Conclusion: </strong>The findings of this study lay a robust groundwork for selecting optimal TFBS prediction tools for future research. Given the variability observed in tool performance, employing multiple tools for identifying TFBSs in a set of sequences is highly recommended. In addition, further studies are recommended to develop an integrated toolbox that incorporates TFBS prediction or motif discovery tools, aiming to streamline result precision and accuracy.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"371"},"PeriodicalIF":2.9,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11613939/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142765866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Single-character insertion-deletion model preserves long indels in ancestral sequence reconstruction. 单字符插入-删除模型在祖先序列重建中保留了长索引。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-12-02 DOI: 10.1186/s12859-024-05986-1
Gholamhossein Jowkar, Jūlija Pečerska, Manuel Gil, Maria Anisimova

Insertions and deletions (indels) play a significant role in genome evolution across species. Realistic modelling of indel evolution is challenging and is still an open research question. Several attempts have been made to explicitly model multi-character (long) indels, such as TKF92, by relaxing the site independence assumption and introducing fragments. However, these methods are computationally expensive. On the other hand, the Poisson Indel Process (PIP) assumes site independence but allows one to infer single-character indels on the phylogenetic tree, distinguishing insertions from deletions. PIP's marginal likelihood computation has linear time complexity, enabling ancestral sequence reconstruction (ASR) with indels in linear time. Recently, we developed ARPIP, an ASR method using PIP, capable of inferring indel events with explicit evolutionary interpretations. Here, we investigate the effect of the single-character indel assumption on reconstructed ancestral sequences on mammalian protein orthologs and on simulated data. We show that ARPIP's ancestral estimates preserve the gap length distribution observed in the input alignment. In mammalian proteins the lengths of inserted segments appear to be substantially longer compared to deleted segments. Further, we confirm the well-established deletion bias observed in real data. To date, ARPIP is the only ancestral reconstruction method that explicitly models insertion and deletion events over time. Given a good quality input alignment, it can capture ancestral long indel events on the phylogeny.

插入和缺失(indels)在跨物种基因组进化中起着重要作用。indel进化的现实建模是具有挑战性的,仍然是一个开放的研究问题。通过放松站点独立性假设和引入片段,已经进行了多次尝试,以显式地建模多字符(长)索引,例如TKF92。然而,这些方法在计算上是昂贵的。另一方面,Poisson Indel Process (PIP)假定位点独立,但允许在系统发育树上推断单字符索引,区分插入和删除。PIP的边际似然计算具有线性时间复杂度,可以在线性时间内实现索引的祖先序列重建。最近,我们开发了ARPIP,一种使用PIP的ASR方法,能够用明确的进化解释推断indel事件。在这里,我们研究了单字符indel假设对哺乳动物蛋白质同源重构祖先序列和模拟数据的影响。我们证明ARPIP的祖先估计保留了在输入对齐中观察到的间隙长度分布。在哺乳动物蛋白质中,插入片段的长度似乎比缺失片段的长度要长得多。此外,我们证实了在实际数据中观察到的完善的删除偏差。迄今为止,ARPIP是唯一一种显式地模拟插入和删除事件的祖先重建方法。给定高质量的输入对齐,它可以捕获系统发育上的祖先长indel事件。
{"title":"Single-character insertion-deletion model preserves long indels in ancestral sequence reconstruction.","authors":"Gholamhossein Jowkar, Jūlija Pečerska, Manuel Gil, Maria Anisimova","doi":"10.1186/s12859-024-05986-1","DOIUrl":"https://doi.org/10.1186/s12859-024-05986-1","url":null,"abstract":"<p><p>Insertions and deletions (indels) play a significant role in genome evolution across species. Realistic modelling of indel evolution is challenging and is still an open research question. Several attempts have been made to explicitly model multi-character (long) indels, such as TKF92, by relaxing the site independence assumption and introducing fragments. However, these methods are computationally expensive. On the other hand, the Poisson Indel Process (PIP) assumes site independence but allows one to infer single-character indels on the phylogenetic tree, distinguishing insertions from deletions. PIP's marginal likelihood computation has linear time complexity, enabling ancestral sequence reconstruction (ASR) with indels in linear time. Recently, we developed ARPIP, an ASR method using PIP, capable of inferring indel events with explicit evolutionary interpretations. Here, we investigate the effect of the single-character indel assumption on reconstructed ancestral sequences on mammalian protein orthologs and on simulated data. We show that ARPIP's ancestral estimates preserve the gap length distribution observed in the input alignment. In mammalian proteins the lengths of inserted segments appear to be substantially longer compared to deleted segments. Further, we confirm the well-established deletion bias observed in real data. To date, ARPIP is the only ancestral reconstruction method that explicitly models insertion and deletion events over time. Given a good quality input alignment, it can capture ancestral long indel events on the phylogeny.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"370"},"PeriodicalIF":2.9,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610121/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142765865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel phenotype imputation method with copula model. 一种新的copula模型的表型输入方法。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-11-30 DOI: 10.1186/s12859-024-05990-5
Jianjun Zhang, Jane Zizhen Zhao, Samantha Gonzales, Xuexia Wang, Qiuying Sha

Background: Jointly analyzing multiple phenotype/traits may increase power in genetic association studies by aggregating weak genetic effects. The chance that at least one phenotype is missing increases exponentially as the number of phenotype increases especially for a real dataset. It is a common practice to discard individuals with missing phenotype or phenotype with a large proportion of missing values. Such a discarding method may lead to a loss of power or even an insufficient sample size for analysis. To our knowledge, many existing phenotype imputing methods are built on multivariate normal assumptions for analysis. Violation of these assumptions may lead to inflated type I errors or even loss of power in some cases. To overcome these limitations, we propose a novel phenotype imputation method based on a new Gaussian copula model with three different loss functions to address the issue of missing phenotype.

Results: In a variety of simulations and a real genetic association study for lung function, we show that our method outperforms existing methods and can also increase the power of the association test when compared to other comparable phenotype imputation methods. The proposed method is implemented in an R package available at https://github.com/jane-zizhen-zhao/CopulaPhenoImpute1.0 CONCLUSIONS: We propose a novel phenotype imputation method with a new Gaussian copula model based on three loss functions. Results of the simulation studies and real data analyses illustrate that the proposed method outperforms comparable methods.

背景:联合分析多个表型/性状可以通过聚集弱遗传效应来提高遗传关联研究的有效性。随着表型数量的增加,缺失至少一种表型的机会呈指数增长,尤其是对于真实数据集。抛弃缺失表型的个体或缺失值占很大比例的表型是一种常见的做法。这种丢弃方法可能会导致功率损失,甚至导致用于分析的样本量不足。据我们所知,许多现有的表型推算方法都是建立在多元正态假设的基础上进行分析的。违反这些假设可能会导致膨胀的I型错误,甚至在某些情况下失去功率。为了克服这些限制,我们提出了一种新的基于三种不同损失函数的新高斯copula模型的表型imputation方法来解决表型缺失的问题。结果:在各种模拟和真实的肺功能遗传关联研究中,我们表明我们的方法优于现有方法,并且与其他可比较的表型插入方法相比,还可以增加关联测试的功率。结论:我们提出了一种新的基于三种损失函数的高斯copula模型的表型imputation方法。仿真研究和实际数据分析结果表明,该方法优于同类方法。
{"title":"A novel phenotype imputation method with copula model.","authors":"Jianjun Zhang, Jane Zizhen Zhao, Samantha Gonzales, Xuexia Wang, Qiuying Sha","doi":"10.1186/s12859-024-05990-5","DOIUrl":"https://doi.org/10.1186/s12859-024-05990-5","url":null,"abstract":"<p><strong>Background: </strong>Jointly analyzing multiple phenotype/traits may increase power in genetic association studies by aggregating weak genetic effects. The chance that at least one phenotype is missing increases exponentially as the number of phenotype increases especially for a real dataset. It is a common practice to discard individuals with missing phenotype or phenotype with a large proportion of missing values. Such a discarding method may lead to a loss of power or even an insufficient sample size for analysis. To our knowledge, many existing phenotype imputing methods are built on multivariate normal assumptions for analysis. Violation of these assumptions may lead to inflated type I errors or even loss of power in some cases. To overcome these limitations, we propose a novel phenotype imputation method based on a new Gaussian copula model with three different loss functions to address the issue of missing phenotype.</p><p><strong>Results: </strong>In a variety of simulations and a real genetic association study for lung function, we show that our method outperforms existing methods and can also increase the power of the association test when compared to other comparable phenotype imputation methods. The proposed method is implemented in an R package available at https://github.com/jane-zizhen-zhao/CopulaPhenoImpute1.0 CONCLUSIONS: We propose a novel phenotype imputation method with a new Gaussian copula model based on three loss functions. Results of the simulation studies and real data analyses illustrate that the proposed method outperforms comparable methods.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"369"},"PeriodicalIF":2.9,"publicationDate":"2024-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11607873/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142765791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhanced prediction of hemolytic activity in antimicrobial peptides using deep learning-based sequence analysis. 利用基于深度学习的序列分析增强对抗菌肽溶血活性的预测。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-11-27 DOI: 10.1186/s12859-024-05983-4
Ibrahim Abdelbaky, Mohamed Elhakeem, Hilal Tayara, Elsayed Badr, Mustafa Abdul Salam

Antimicrobial peptides (AMPs) are a promising class of antimicrobial drugs due to their broad-spectrum activity against microorganisms. However, their clinical application is limited by their potential to cause hemolysis, the destruction of red blood cells. To address this issue, we propose a deep learning model based on convolutional neural networks (CNNs) for predicting the hemolytic activity of AMPs. Peptide sequences are represented using one-hot encoding, and the CNN architecture consists of multiple convolutional and fully connected layers. The model was trained on six different datasets: HemoPI-1, HemoPI-2, HemoPI-3, RNN-Hem, Hlppredfuse, and AMP-Combined, achieving Matthew's correlation coefficients of 0.9274, 0.5614, 0.6051, 0.6142, 0.8799, and 0.7484, respectively. Our model outperforms previously reported methods and can facilitate the development of novel AMPs with reduced hemolytic activity, which is crucial for their therapeutic use in treating bacterial infections.

抗菌肽(AMPs)具有广谱抗微生物的活性,是一类前景广阔的抗菌药物。然而,由于抗菌肽可能导致溶血(破坏红细胞),其临床应用受到了限制。为解决这一问题,我们提出了一种基于卷积神经网络(CNN)的深度学习模型,用于预测 AMPs 的溶血活性。肽序列使用单次编码表示,CNN 架构由多个卷积层和全连接层组成。该模型在六个不同的数据集上进行了训练:马修相关系数分别为 0.9274、0.5614、0.6051、0.6142、0.8799 和 0.7484。我们的模型优于之前报道的方法,有助于开发出具有较低溶血活性的新型 AMPs,这对于它们在治疗细菌感染中的应用至关重要。
{"title":"Enhanced prediction of hemolytic activity in antimicrobial peptides using deep learning-based sequence analysis.","authors":"Ibrahim Abdelbaky, Mohamed Elhakeem, Hilal Tayara, Elsayed Badr, Mustafa Abdul Salam","doi":"10.1186/s12859-024-05983-4","DOIUrl":"10.1186/s12859-024-05983-4","url":null,"abstract":"<p><p>Antimicrobial peptides (AMPs) are a promising class of antimicrobial drugs due to their broad-spectrum activity against microorganisms. However, their clinical application is limited by their potential to cause hemolysis, the destruction of red blood cells. To address this issue, we propose a deep learning model based on convolutional neural networks (CNNs) for predicting the hemolytic activity of AMPs. Peptide sequences are represented using one-hot encoding, and the CNN architecture consists of multiple convolutional and fully connected layers. The model was trained on six different datasets: HemoPI-1, HemoPI-2, HemoPI-3, RNN-Hem, Hlppredfuse, and AMP-Combined, achieving Matthew's correlation coefficients of 0.9274, 0.5614, 0.6051, 0.6142, 0.8799, and 0.7484, respectively. Our model outperforms previously reported methods and can facilitate the development of novel AMPs with reduced hemolytic activity, which is crucial for their therapeutic use in treating bacterial infections.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"368"},"PeriodicalIF":2.9,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11603801/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142738191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TreeWave: command line tool for alignment-free phylogeny reconstruction based on graphical representation of DNA sequences and genomic signal processing. TreeWave:基于 DNA 序列图形表示和基因组信号处理的无配对系统发育重建命令行工具。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-11-27 DOI: 10.1186/s12859-024-05992-3
Nasma Boumajdi, Houda Bendani, Lahcen Belyamani, Azeddine Ibrahimi

Background: Genomic sequence similarity comparison is a crucial research area in bioinformatics. Multiple Sequence Alignment (MSA) is the basic technique used to identify regions of similarity between sequences, although MSA tools are widely used and highly accurate, they are often limited by computational complexity, and inaccuracies when handling highly divergent sequences, which leads to the development of alignment-free (AF) algorithms.

Results: This paper presents TreeWave, a novel AF approach based on frequency chaos game representation and discrete wavelet transform of sequences for phylogeny inference. We validate our method on various genomic datasets such as complete virus genome sequences, bacteria genome sequences, human mitochondrial genome sequences, and rRNA gene sequences. Compared to classical methods, our tool demonstrates a significant reduction in running time, especially when analyzing large datasets. The resulting phylogenetic trees show that TreeWave has similar classification accuracy to the classical MSA methods based on the normalized Robinson-Foulds distances and Baker's Gamma coefficients.

Conclusions: TreeWave is an open source and user-friendly command line tool for phylogeny reconstruction. It is a faster and more scalable tool that prioritizes computational efficiency while maintaining accuracy. TreeWave is freely available at https://github.com/nasmaB/TreeWave .

背景基因组序列相似性比较是生物信息学的一个重要研究领域。多重序列比对(MSA)是用于识别序列间相似性区域的基本技术,尽管 MSA 工具被广泛使用且准确性很高,但它们往往受限于计算复杂性,以及处理高度差异序列时的不准确性,这导致了无比对算法(AF)的发展:本文介绍了 TreeWave,一种基于频率混沌博弈表示和序列离散小波变换的系统发育推断的新型 AF 方法。我们在各种基因组数据集上验证了我们的方法,如完整的病毒基因组序列、细菌基因组序列、人类线粒体基因组序列和 rRNA 基因序列。与传统方法相比,我们的工具大大缩短了运行时间,尤其是在分析大型数据集时。生成的系统发生树显示,TreeWave 与基于归一化 Robinson-Foulds 距离和 Baker's Gamma 系数的经典 MSA 方法具有相似的分类准确性:TreeWave 是一款开源且用户友好的系统发育重建命令行工具。它是一种速度更快、可扩展性更强的工具,在保证准确性的同时优先考虑计算效率。TreeWave 可在 https://github.com/nasmaB/TreeWave 免费获取。
{"title":"TreeWave: command line tool for alignment-free phylogeny reconstruction based on graphical representation of DNA sequences and genomic signal processing.","authors":"Nasma Boumajdi, Houda Bendani, Lahcen Belyamani, Azeddine Ibrahimi","doi":"10.1186/s12859-024-05992-3","DOIUrl":"10.1186/s12859-024-05992-3","url":null,"abstract":"<p><strong>Background: </strong>Genomic sequence similarity comparison is a crucial research area in bioinformatics. Multiple Sequence Alignment (MSA) is the basic technique used to identify regions of similarity between sequences, although MSA tools are widely used and highly accurate, they are often limited by computational complexity, and inaccuracies when handling highly divergent sequences, which leads to the development of alignment-free (AF) algorithms.</p><p><strong>Results: </strong>This paper presents TreeWave, a novel AF approach based on frequency chaos game representation and discrete wavelet transform of sequences for phylogeny inference. We validate our method on various genomic datasets such as complete virus genome sequences, bacteria genome sequences, human mitochondrial genome sequences, and rRNA gene sequences. Compared to classical methods, our tool demonstrates a significant reduction in running time, especially when analyzing large datasets. The resulting phylogenetic trees show that TreeWave has similar classification accuracy to the classical MSA methods based on the normalized Robinson-Foulds distances and Baker's Gamma coefficients.</p><p><strong>Conclusions: </strong>TreeWave is an open source and user-friendly command line tool for phylogeny reconstruction. It is a faster and more scalable tool that prioritizes computational efficiency while maintaining accuracy. TreeWave is freely available at https://github.com/nasmaB/TreeWave .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"367"},"PeriodicalIF":2.9,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11600722/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142738192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Human limits in machine learning: prediction of potato yield and disease using soil microbiome data. 机器学习中的人类极限:利用土壤微生物组数据预测马铃薯产量和疾病。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-11-26 DOI: 10.1186/s12859-024-05977-2
Rosa Aghdam, Xudong Tang, Shan Shan, Richard Lankau, Claudia Solís-Lemus

Background: The preservation of soil health is a critical challenge in the 21st century due to its significant impact on agriculture, human health, and biodiversity. We provide one of the first comprehensive investigations into the predictive potential of machine learning models for understanding the connections between soil and biological phenotypes. We investigate an integrative framework performing accurate machine learning-based prediction of plant performance from biological, chemical, and physical properties of the soil via two models: random forest and Bayesian neural network.

Results: Prediction improves when we add environmental features, such as soil properties and microbial density, along with microbiome data. Different preprocessing strategies show that human decisions significantly impact predictive performance. We show that the naive total sum scaling normalization that is commonly used in microbiome research is one of the optimal strategies to maximize predictive power. Also, we find that accurately defined labels are more important than normalization, taxonomic level, or model characteristics. ML performance is limited when humans can't classify samples accurately. Lastly, we provide domain scientists via a full model selection decision tree to identify the human choices that optimize model prediction power.

Conclusions: Our study highlights the importance of incorporating diverse environmental features and careful data preprocessing in enhancing the predictive power of machine learning models for soil and biological phenotype connections. This approach can significantly contribute to advancing agricultural practices and soil health management.

背景:由于土壤健康对农业、人类健康和生物多样性具有重大影响,因此保护土壤健康是 21 世纪面临的一项严峻挑战。我们对机器学习模型的预测潜力进行了首次全面调查,以了解土壤与生物表型之间的联系。我们研究了一个综合框架,通过随机森林和贝叶斯神经网络这两种模型,从土壤的生物、化学和物理特性出发,对植物的表现进行基于机器学习的准确预测:结果:当我们添加环境特征(如土壤特性和微生物密度)以及微生物组数据时,预测结果会有所改善。不同的预处理策略表明,人为决策对预测性能有显著影响。我们发现,微生物组研究中常用的天真总和缩放归一化是最大化预测能力的最佳策略之一。此外,我们还发现,准确定义标签比归一化、分类级别或模型特征更为重要。当人类无法对样本进行准确分类时,ML 的性能就会受到限制。最后,我们通过完整的模型选择决策树为领域科学家提供了识别人类选择的方法,从而优化模型预测能力:我们的研究强调了结合多种环境特征和仔细的数据预处理对于提高土壤和生物表型联系机器学习模型预测能力的重要性。这种方法可以极大地促进农业实践和土壤健康管理。
{"title":"Human limits in machine learning: prediction of potato yield and disease using soil microbiome data.","authors":"Rosa Aghdam, Xudong Tang, Shan Shan, Richard Lankau, Claudia Solís-Lemus","doi":"10.1186/s12859-024-05977-2","DOIUrl":"10.1186/s12859-024-05977-2","url":null,"abstract":"<p><strong>Background: </strong>The preservation of soil health is a critical challenge in the 21st century due to its significant impact on agriculture, human health, and biodiversity. We provide one of the first comprehensive investigations into the predictive potential of machine learning models for understanding the connections between soil and biological phenotypes. We investigate an integrative framework performing accurate machine learning-based prediction of plant performance from biological, chemical, and physical properties of the soil via two models: random forest and Bayesian neural network.</p><p><strong>Results: </strong>Prediction improves when we add environmental features, such as soil properties and microbial density, along with microbiome data. Different preprocessing strategies show that human decisions significantly impact predictive performance. We show that the naive total sum scaling normalization that is commonly used in microbiome research is one of the optimal strategies to maximize predictive power. Also, we find that accurately defined labels are more important than normalization, taxonomic level, or model characteristics. ML performance is limited when humans can't classify samples accurately. Lastly, we provide domain scientists via a full model selection decision tree to identify the human choices that optimize model prediction power.</p><p><strong>Conclusions: </strong>Our study highlights the importance of incorporating diverse environmental features and careful data preprocessing in enhancing the predictive power of machine learning models for soil and biological phenotype connections. This approach can significantly contribute to advancing agricultural practices and soil health management.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"366"},"PeriodicalIF":2.9,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11600749/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142725963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Plaseval: a framework for comparing and evaluating plasmid detection tools. Plaseval:比较和评估质粒检测工具的框架。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-11-26 DOI: 10.1186/s12859-024-05941-0
Aniket Mane, Haley Sanderson, Aaron P White, Rahat Zaheer, Robert Beiko, Cédric Chauve

Background: Plasmids play a major role in the transfer of antimicrobial resistance (AMR) genes among bacteria via horizontal gene transfer. The identification of plasmids in short-read assemblies is a challenging problem and a very active research area. Plasmid binning aims at detecting, in a draft genome assembly, groups (bins) of contigs likely to originate from the same plasmid. Several methods for plasmid binning have been developed recently, such as PlasBin-flow, HyAsP, gplas, MOB-suite, and plasmidSPAdes. This motivates the problem of evaluating the performances of plasmid binning methods, either against a given ground truth or between them.

Results: We describe PlasEval, a novel method aimed at comparing the results of plasmid binning tools. PlasEval computes a dissimilarity measure between two sets of plasmid bins, that can originate either from two plasmid binning tools, or from a plasmid binning tool and a ground truth set of plasmid bins. The PlasEval dissimilarity accounts for the contig content of plasmid bins, the length of contigs and is repeat-aware. Moreover, the dissimilarity score computed by PlasEval is broken down into several parts, that allows to understand qualitative differences between the compared sets of plasmid bins. We illustrate the use of PlasEval by benchmarking four recently developed plasmid binning tools-PlasBin-flow, HyAsP, gplas, and MOB-recon-on a data set of 53 E. coli bacterial genomes.

Conclusion: Analysis of the results of plasmid binning methods using PlasEval shows that their behaviour varies significantly. PlasEval can be used to decide which specific plasmid binning method should be used for a specific dataset. The disagreement between different methods also suggests that the problem of plasmid binning on short-read contigs requires further research. We believe that PlasEval can prove to be an effective tool in this regard. PlasEval is publicly available at https://github.com/acme92/PlasEval.

背景:质粒在细菌间通过水平基因转移传递抗菌药耐药性(AMR)基因方面发挥着重要作用。在短读组装中识别质粒是一个具有挑战性的问题,也是一个非常活跃的研究领域。质粒分选的目的是在基因组组装草案中检测可能来自同一质粒的等位基因组(bins)。最近开发了几种质粒分选方法,如 PlasBin-flow、HyAsP、gplas、MOB-suite 和 plasmidSPAdes。这就促使我们对质粒分选方法的性能进行评估,既可以根据给定的基本事实进行评估,也可以在两种方法之间进行评估:我们介绍了一种旨在比较质粒分选工具结果的新方法 PlasEval。PlasEval 计算两组质粒分选结果之间的不相似度,这两组质粒分选结果可以来自两个质粒分选工具,也可以来自一个质粒分选工具和一组质粒分选结果的基本真相。PlasEval 差异度考虑了质粒箱的等位基因内容、等位基因的长度,并具有重复感知功能。此外,PlasEval 计算出的不相似性得分被分解成几个部分,这样就能了解比较的质粒箱集之间的定性差异。我们通过对最近开发的四种质粒分选工具--PlasBin-flow、HyAsP、gplas 和 MOB-recon--在 53 个大肠杆菌基因组数据集上进行基准测试,来说明 PlasEval 的使用情况:结论:使用 PlasEval 对质粒分选方法的结果进行分析表明,这些方法的行为差异很大。PlasEval 可用来决定特定数据集应使用哪种特定的质粒分选方法。不同方法之间的分歧也表明,短读数等位基因的质粒分选问题需要进一步研究。我们相信,PlasEval 可以成为这方面的有效工具。PlasEval 可在 https://github.com/acme92/PlasEval 公开获取。
{"title":"Plaseval: a framework for comparing and evaluating plasmid detection tools.","authors":"Aniket Mane, Haley Sanderson, Aaron P White, Rahat Zaheer, Robert Beiko, Cédric Chauve","doi":"10.1186/s12859-024-05941-0","DOIUrl":"10.1186/s12859-024-05941-0","url":null,"abstract":"<p><strong>Background: </strong>Plasmids play a major role in the transfer of antimicrobial resistance (AMR) genes among bacteria via horizontal gene transfer. The identification of plasmids in short-read assemblies is a challenging problem and a very active research area. Plasmid binning aims at detecting, in a draft genome assembly, groups (bins) of contigs likely to originate from the same plasmid. Several methods for plasmid binning have been developed recently, such as PlasBin-flow, HyAsP, gplas, MOB-suite, and plasmidSPAdes. This motivates the problem of evaluating the performances of plasmid binning methods, either against a given ground truth or between them.</p><p><strong>Results: </strong>We describe PlasEval, a novel method aimed at comparing the results of plasmid binning tools. PlasEval computes a dissimilarity measure between two sets of plasmid bins, that can originate either from two plasmid binning tools, or from a plasmid binning tool and a ground truth set of plasmid bins. The PlasEval dissimilarity accounts for the contig content of plasmid bins, the length of contigs and is repeat-aware. Moreover, the dissimilarity score computed by PlasEval is broken down into several parts, that allows to understand qualitative differences between the compared sets of plasmid bins. We illustrate the use of PlasEval by benchmarking four recently developed plasmid binning tools-PlasBin-flow, HyAsP, gplas, and MOB-recon-on a data set of 53 E. coli bacterial genomes.</p><p><strong>Conclusion: </strong>Analysis of the results of plasmid binning methods using PlasEval shows that their behaviour varies significantly. PlasEval can be used to decide which specific plasmid binning method should be used for a specific dataset. The disagreement between different methods also suggests that the problem of plasmid binning on short-read contigs requires further research. We believe that PlasEval can prove to be an effective tool in this regard. PlasEval is publicly available at https://github.com/acme92/PlasEval.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"365"},"PeriodicalIF":2.9,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11590284/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142724618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MoAGL-SA: a multi-omics adaptive integration method with graph learning and self attention for cancer subtype classification. MoAGL-SA:一种利用图学习和自我关注进行癌症亚型分类的多组学自适应整合方法。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-11-23 DOI: 10.1186/s12859-024-05989-y
Lei Cheng, Qian Huang, Zhengqun Zhu, Yanan Li, Shuguang Ge, Longzhen Zhang, Ping Gong

Background: The integration of multi-omics data through deep learning has greatly improved cancer subtype classification, particularly in feature learning and multi-omics data integration. However, key challenges remain in embedding sample structure information into the feature space and designing flexible integration strategies.

Results: We propose MoAGL-SA, an adaptive multi-omics integration method based on graph learning and self-attention, to address these challenges. First, patient relationship graphs are generated from each omics dataset using graph learning. Next, three-layer graph convolutional networks are employed to extract omic-specific graph embeddings. Self-attention is then used to focus on the most relevant omics, adaptively assigning weights to different graph embeddings for multi-omics integration. Finally, cancer subtypes are classified using a softmax classifier.

Conclusions: Experimental results show that MoAGL-SA outperforms several popular algorithms on datasets for breast invasive carcinoma, kidney renal papillary cell carcinoma, and kidney renal clear cell carcinoma. Additionally, MoAGL-SA successfully identifies key biomarkers for breast invasive carcinoma.

背景:通过深度学习整合多组学数据大大改善了癌症亚型分类,尤其是在特征学习和多组学数据整合方面。然而,将样本结构信息嵌入特征空间和设计灵活的整合策略仍是关键挑战:我们提出了基于图学习和自我关注的自适应多组学整合方法 MoAGL-SA,以应对这些挑战。首先,利用图学习从每个组学数据集生成患者关系图。然后,利用三层图卷积网络提取特定于 omic 的图嵌入。然后,利用自我关注来关注最相关的 omics,并自适应地为不同的图嵌入分配权重,以实现多组学整合。最后,使用软最大分类器对癌症亚型进行分类:实验结果表明,MoAGL-SA 在乳腺浸润癌、肾乳头状细胞癌和肾透明细胞癌数据集上的表现优于几种流行的算法。此外,MoAGL-SA 还成功识别了乳腺浸润癌的关键生物标记物。
{"title":"MoAGL-SA: a multi-omics adaptive integration method with graph learning and self attention for cancer subtype classification.","authors":"Lei Cheng, Qian Huang, Zhengqun Zhu, Yanan Li, Shuguang Ge, Longzhen Zhang, Ping Gong","doi":"10.1186/s12859-024-05989-y","DOIUrl":"10.1186/s12859-024-05989-y","url":null,"abstract":"<p><strong>Background: </strong>The integration of multi-omics data through deep learning has greatly improved cancer subtype classification, particularly in feature learning and multi-omics data integration. However, key challenges remain in embedding sample structure information into the feature space and designing flexible integration strategies.</p><p><strong>Results: </strong>We propose MoAGL-SA, an adaptive multi-omics integration method based on graph learning and self-attention, to address these challenges. First, patient relationship graphs are generated from each omics dataset using graph learning. Next, three-layer graph convolutional networks are employed to extract omic-specific graph embeddings. Self-attention is then used to focus on the most relevant omics, adaptively assigning weights to different graph embeddings for multi-omics integration. Finally, cancer subtypes are classified using a softmax classifier.</p><p><strong>Conclusions: </strong>Experimental results show that MoAGL-SA outperforms several popular algorithms on datasets for breast invasive carcinoma, kidney renal papillary cell carcinoma, and kidney renal clear cell carcinoma. Additionally, MoAGL-SA successfully identifies key biomarkers for breast invasive carcinoma.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"364"},"PeriodicalIF":2.9,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11585958/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142695114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PIPETS: a statistically informed, gene-annotation agnostic analysis method to study bacterial termination using 3'-end sequencing. PIPETS:利用 3'-end 测序研究细菌终止的统计信息、基因注释不可知分析方法。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-11-23 DOI: 10.1186/s12859-024-05982-5
Quinlan Furumo, Michelle M Meyer

Background: Over the last decade the drop in short-read sequencing costs has allowed experimental techniques utilizing sequencing to address specific biological questions to proliferate, oftentimes outpacing standardized or effective analysis approaches for the data generated. There are growing amounts of bacterial 3'-end sequencing data, yet there is currently no commonly accepted analysis methodology for this datatype. Most data analysis approaches are somewhat ad hoc and, despite the presence of substantial signal within annotated genes, focus on genomic regions outside the annotated genes (e.g. 3' or 5' UTRs). Furthermore, the lack of consistent systematic analysis approaches, as well as the absence of genome-wide ground truth data, make it impossible to compare conclusions generated by different labs, using different organisms.

Results: We present PIPETS, (Poisson Identification of PEaks from Term-Seq data), an R package available on Bioconductor that provides a novel analysis method for 3'-end sequencing data. PIPETS is a statistically informed, gene-annotation agnostic methodology. Across two different datasets from two different organisms, PIPETS identified significant 3'-end termination signal across a wider range of annotated genomic contexts than existing analysis approaches, suggesting that existing approaches may miss biologically relevant signal. Furthermore, assessment of the previously called 3'-end positions not captured by PIPETS showed that they were uniformly very low coverage.

Conclusions: PIPETS provides a broadly applicable platform to explore and analyze 3'-end sequencing data sets from across different organisms. It requires only the 3'-end sequencing data, and is broadly accessible to non-expert users.

背景:在过去十年中,短线程测序成本的下降使得利用测序解决特定生物学问题的实验技术激增,有时甚至超过了对所产生数据进行标准化或有效分析的方法。细菌 3'- 端测序数据的数量越来越多,但目前还没有针对这种数据类型的公认分析方法。大多数数据分析方法都是临时性的,尽管在已注释的基因中存在大量信号,但分析的重点是已注释基因以外的基因组区域(如 3' 或 5' UTR)。此外,由于缺乏一致的系统分析方法,也没有全基因组的基本真实数据,因此无法比较不同实验室使用不同生物体得出的结论:我们介绍了PIPETS(Poisson Identification of PEaks from Term-Seq data),它是Bioconductor上的一个R软件包,为3'端测序数据提供了一种新颖的分析方法。PIPETS 是一种不考虑基因注释的统计学方法。与现有的分析方法相比,PIPETS 在来自两种不同生物的两个不同数据集中,在更广泛的注释基因组上下文中发现了重要的 3'-end 终止信号,这表明现有的方法可能会遗漏与生物相关的信号。此外,对 PIPETS 未捕获的先前调用的 3'-end 位置进行的评估显示,这些位置的覆盖率都非常低:结论:PIPETS 为探索和分析不同生物的 3'-end 测序数据集提供了一个广泛适用的平台。它只需要 3'-end 测序数据,非专业用户也能广泛使用。
{"title":"PIPETS: a statistically informed, gene-annotation agnostic analysis method to study bacterial termination using 3'-end sequencing.","authors":"Quinlan Furumo, Michelle M Meyer","doi":"10.1186/s12859-024-05982-5","DOIUrl":"10.1186/s12859-024-05982-5","url":null,"abstract":"<p><strong>Background: </strong>Over the last decade the drop in short-read sequencing costs has allowed experimental techniques utilizing sequencing to address specific biological questions to proliferate, oftentimes outpacing standardized or effective analysis approaches for the data generated. There are growing amounts of bacterial 3'-end sequencing data, yet there is currently no commonly accepted analysis methodology for this datatype. Most data analysis approaches are somewhat ad hoc and, despite the presence of substantial signal within annotated genes, focus on genomic regions outside the annotated genes (e.g. 3' or 5' UTRs). Furthermore, the lack of consistent systematic analysis approaches, as well as the absence of genome-wide ground truth data, make it impossible to compare conclusions generated by different labs, using different organisms.</p><p><strong>Results: </strong>We present PIPETS, (Poisson Identification of PEaks from Term-Seq data), an R package available on Bioconductor that provides a novel analysis method for 3'-end sequencing data. PIPETS is a statistically informed, gene-annotation agnostic methodology. Across two different datasets from two different organisms, PIPETS identified significant 3'-end termination signal across a wider range of annotated genomic contexts than existing analysis approaches, suggesting that existing approaches may miss biologically relevant signal. Furthermore, assessment of the previously called 3'-end positions not captured by PIPETS showed that they were uniformly very low coverage.</p><p><strong>Conclusions: </strong>PIPETS provides a broadly applicable platform to explore and analyze 3'-end sequencing data sets from across different organisms. It requires only the 3'-end sequencing data, and is broadly accessible to non-expert users.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"363"},"PeriodicalIF":2.9,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11585934/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142695120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ClassifieR 2.0: expanding interactive gene expression-based stratification to prostate and high-grade serous ovarian cancer. ClassifieR 2.0:将基于基因表达的交互式分层方法扩展到前列腺癌和高级别浆液性卵巢癌。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-11-21 DOI: 10.1186/s12859-024-05981-6
Aideen McCabe, Gerard P Quinn, Suneil Jain, Micheál Ó Dálaigh, Kellie Dean, Ross G Murphy, Simon S McDade

Background: Advances in transcriptional profiling methods have enabled the discovery of molecular subtypes within and across traditional tissue-based cancer classifications. Such molecular subgroups hold potential for improving patient outcomes by guiding treatment decisions and revealing physiological distinctions and targetable pathways. Computational methods for stratifying transcriptomic data into molecular subgroups are increasingly abundant. However, assigning samples to these subtypes and other transcriptionally inferred predictions is time-consuming and requires significant bioinformatics expertise. To address this need, we recently reported "ClassifieR," a flexible, interactive cloud application for the functional annotation of colorectal and breast cancer transcriptomes. Here, we report "ClassifieR 2.0" which introduces additional modules for the molecular subtyping of prostate and high-grade serous ovarian cancer (HGSOC).

Results: ClassifieR 2.0 introduces ClassifieRp and ClassifieRov, two specialised modules specifically designed to address the challenges of prostate and HGSOC molecular classification. ClassifieRp includes sigInfer, a method we developed to infer commercial prognostic prostate gene expression signatures from publicly available gene-lists or indeed any user-uploaded gene-list. ClassifieRov utilizes consensus molecular subtyping methods for HGSOC, including tools like consensusOV, for accurate ovarian cancer stratification. Both modules include functionalities present in the original ClassifieR framework for estimating cellular composition, predicting transcription factor (TF) activity and single sample gene set enrichment analysis (ssGSEA).

Conclusions: ClassifieR 2.0 combines molecular subtyping of prostate cancer and HGSOC with commonly used sample annotation tools in a single, user-friendly platform, allowing scientists without bioinformatics training to explore prostate and HGSOC transcriptional data without the need for extensive bioinformatics knowledge or manual data handling to operate various packages. Our sigInfer method within ClassifieRp enables the inference of commercially available gene signatures for prostate cancer, while ClassifieRov incorporates consensus molecular subtyping for HGSOC. Overall, ClassifieR 2.0 aims to make molecular subtyping more accessible to the wider research community. This is crucial for increased understanding of the molecular heterogeneity of these cancers and developing personalised treatment strategies.

背景:转录剖析方法的进步使人们能够在传统的基于组织的癌症分类中发现分子亚型。这些分子亚型具有改善患者预后的潜力,可指导治疗决策,揭示生理差异和靶向途径。将转录组数据分层到分子亚组的计算方法越来越多。然而,将样本归入这些亚型和其他转录推断预测需要耗费大量时间,而且需要大量生物信息学专业知识。为了满足这一需求,我们最近报道了 "ClassifieR",这是一种灵活、交互式的云应用程序,用于结直肠癌和乳腺癌转录组的功能注释。在此,我们报告了 "ClassifieR 2.0",它引入了用于前列腺癌和高级别浆液性卵巢癌(HGSOC)分子亚型分析的附加模块:结果:ClassifieR 2.0引入了ClassifieRp和ClassifieRov,这是两个专门设计的模块,用于应对前列腺癌和高级别浆液性卵巢癌分子分类的挑战。ClassifieRp 包括 sigInfer,这是我们开发的一种方法,用于从公开的基因列表或任何用户上传的基因列表中推断商业化的前列腺预后基因表达特征。ClassifieRov 利用 HGSOC 的共识分子亚型方法,包括共识OV 等工具,对卵巢癌进行精确分层。这两个模块都包含原始 ClassifieR 框架中的功能,用于估计细胞组成、预测转录因子(TF)活性和单样本基因组富集分析(ssGSEA):ClassifieR 2.0 将前列腺癌和 HGSOC 的分子亚型分析与常用的样本注释工具结合在一个用户友好的平台上,让没有接受过生物信息学培训的科学家也能探索前列腺癌和 HGSOC 的转录数据,而不需要丰富的生物信息学知识或手动数据处理来操作各种软件包。我们在 ClassifieRp 中采用的 sigInfer 方法可以推断出前列腺癌的商用基因特征,而 ClassifieRov 则纳入了 HGSOC 的共识分子亚型。总之,ClassifieR 2.0 的目标是让更多的研究人员能够进行分子亚型分析。这对于进一步了解这些癌症的分子异质性和制定个性化治疗策略至关重要。
{"title":"ClassifieR 2.0: expanding interactive gene expression-based stratification to prostate and high-grade serous ovarian cancer.","authors":"Aideen McCabe, Gerard P Quinn, Suneil Jain, Micheál Ó Dálaigh, Kellie Dean, Ross G Murphy, Simon S McDade","doi":"10.1186/s12859-024-05981-6","DOIUrl":"10.1186/s12859-024-05981-6","url":null,"abstract":"<p><strong>Background: </strong>Advances in transcriptional profiling methods have enabled the discovery of molecular subtypes within and across traditional tissue-based cancer classifications. Such molecular subgroups hold potential for improving patient outcomes by guiding treatment decisions and revealing physiological distinctions and targetable pathways. Computational methods for stratifying transcriptomic data into molecular subgroups are increasingly abundant. However, assigning samples to these subtypes and other transcriptionally inferred predictions is time-consuming and requires significant bioinformatics expertise. To address this need, we recently reported \"ClassifieR,\" a flexible, interactive cloud application for the functional annotation of colorectal and breast cancer transcriptomes. Here, we report \"ClassifieR 2.0\" which introduces additional modules for the molecular subtyping of prostate and high-grade serous ovarian cancer (HGSOC).</p><p><strong>Results: </strong>ClassifieR 2.0 introduces ClassifieRp and ClassifieRov, two specialised modules specifically designed to address the challenges of prostate and HGSOC molecular classification. ClassifieRp includes sigInfer, a method we developed to infer commercial prognostic prostate gene expression signatures from publicly available gene-lists or indeed any user-uploaded gene-list. ClassifieRov utilizes consensus molecular subtyping methods for HGSOC, including tools like consensusOV, for accurate ovarian cancer stratification. Both modules include functionalities present in the original ClassifieR framework for estimating cellular composition, predicting transcription factor (TF) activity and single sample gene set enrichment analysis (ssGSEA).</p><p><strong>Conclusions: </strong>ClassifieR 2.0 combines molecular subtyping of prostate cancer and HGSOC with commonly used sample annotation tools in a single, user-friendly platform, allowing scientists without bioinformatics training to explore prostate and HGSOC transcriptional data without the need for extensive bioinformatics knowledge or manual data handling to operate various packages. Our sigInfer method within ClassifieRp enables the inference of commercially available gene signatures for prostate cancer, while ClassifieRov incorporates consensus molecular subtyping for HGSOC. Overall, ClassifieR 2.0 aims to make molecular subtyping more accessible to the wider research community. This is crucial for increased understanding of the molecular heterogeneity of these cancers and developing personalised treatment strategies.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"362"},"PeriodicalIF":2.9,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11580654/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142685936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
BMC Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1