Pub Date : 2024-12-02DOI: 10.1186/s12859-024-05995-0
Dinithi V Wanniarachchi, Sameera Viswakula, Anushka M Wickramasuriya
Background: The precise prediction of transcription factor binding sites (TFBSs) is pivotal for unraveling the gene regulatory networks underlying biological processes. While numerous tools have emerged for in silico TFBS prediction in recent years, the evolving landscape of computational biology necessitates thorough assessments of tool performance to ensure accuracy and reliability. Only a limited number of studies have been conducted to evaluate the performance of TFBS prediction tools comprehensively. Thus, the present study focused on assessing twelve widely used TFBS prediction tools and four de novo motif discovery tools using a benchmark dataset comprising real, generic, Markov, and negative sequences. TFBSs of Arabidopsis thaliana and Homo sapiens genomes downloaded from the JASPAR database were implanted in these sequences and the performance of tools was evaluated using several statistical parameters at different overlap percentages between the lengths of known and predicted binding sites.
Results: Overall, the Multiple Cluster Alignment and Search Tool (MCAST) emerged as the best TFBS prediction tool, followed by Find Individual Motif Occurrences (FIMO) and MOtif Occurrence Detection Suite (MOODS). In addition, MotEvo and Dinucleotide Weight Tensor Toolbox (DWT-toolbox) demonstrated the highest sensitivity in identifying TFBSs at 90% and 80% overlap. Further, MCAST and DWT-toolbox managed to demonstrate the highest sensitivity across all three data types real, generic, and Markov. Among the de novo motif discovery tools, the Multiple Em for Motif Elicitation (MEME) emerged as the best performer. An analysis of the promoter regions of genes involved in the anthocyanin biosynthesis pathway in plants and the pentose phosphate pathway in humans, using the three best-performing tools, revealed considerable variation among the top 20 motifs identified by these tools.
Conclusion: The findings of this study lay a robust groundwork for selecting optimal TFBS prediction tools for future research. Given the variability observed in tool performance, employing multiple tools for identifying TFBSs in a set of sequences is highly recommended. In addition, further studies are recommended to develop an integrated toolbox that incorporates TFBS prediction or motif discovery tools, aiming to streamline result precision and accuracy.
背景:转录因子结合位点(TFBSs)的精确预测对于揭示生物学过程背后的基因调控网络至关重要。虽然近年来出现了许多用于计算机TFBS预测的工具,但计算生物学的不断发展需要对工具性能进行全面评估,以确保准确性和可靠性。只有有限的研究对TFBS预测工具的性能进行了全面的评估。因此,本研究的重点是评估12种广泛使用的TFBS预测工具和4种全新的motif发现工具,使用基准数据集包括真实序列、通用序列、马尔可夫序列和负序列。将从JASPAR数据库下载的拟南芥和智人基因组的TFBSs植入到这些序列中,并使用已知和预测结合位点长度之间不同重叠百分比的几个统计参数来评估工具的性能。结果:总体而言,多簇比对和搜索工具(MCAST)是最佳的TFBS预测工具,其次是查找单个Motif Occurrence (FIMO)和Motif Occurrence Detection Suite (MOODS)。此外,MotEvo和二核苷酸权重张量工具箱(dwt -工具箱)在90%和80%重叠时识别TFBSs的灵敏度最高。此外,MCAST和DWT-toolbox成功地在所有三种数据类型(真实、通用和马尔可夫)中展示了最高的灵敏度。在全新的motif发现工具中,Multiple Em for motif Elicitation (MEME)表现最好。使用三种性能最好的工具对植物花青素生物合成途径和人类戊糖磷酸途径中涉及的基因启动子区域进行了分析,揭示了这些工具确定的前20个基序之间存在相当大的差异。结论:本研究结果为未来研究选择最佳的TFBS预测工具奠定了坚实的基础。考虑到工具性能的可变性,强烈建议使用多种工具在一组序列中识别tfbs。此外,建议进一步研究开发集成TFBS预测或motif发现工具的工具箱,以提高结果的精度和准确性。
{"title":"The evaluation of transcription factor binding site prediction tools in human and Arabidopsis genomes.","authors":"Dinithi V Wanniarachchi, Sameera Viswakula, Anushka M Wickramasuriya","doi":"10.1186/s12859-024-05995-0","DOIUrl":"10.1186/s12859-024-05995-0","url":null,"abstract":"<p><strong>Background: </strong>The precise prediction of transcription factor binding sites (TFBSs) is pivotal for unraveling the gene regulatory networks underlying biological processes. While numerous tools have emerged for in silico TFBS prediction in recent years, the evolving landscape of computational biology necessitates thorough assessments of tool performance to ensure accuracy and reliability. Only a limited number of studies have been conducted to evaluate the performance of TFBS prediction tools comprehensively. Thus, the present study focused on assessing twelve widely used TFBS prediction tools and four de novo motif discovery tools using a benchmark dataset comprising real, generic, Markov, and negative sequences. TFBSs of Arabidopsis thaliana and Homo sapiens genomes downloaded from the JASPAR database were implanted in these sequences and the performance of tools was evaluated using several statistical parameters at different overlap percentages between the lengths of known and predicted binding sites.</p><p><strong>Results: </strong>Overall, the Multiple Cluster Alignment and Search Tool (MCAST) emerged as the best TFBS prediction tool, followed by Find Individual Motif Occurrences (FIMO) and MOtif Occurrence Detection Suite (MOODS). In addition, MotEvo and Dinucleotide Weight Tensor Toolbox (DWT-toolbox) demonstrated the highest sensitivity in identifying TFBSs at 90% and 80% overlap. Further, MCAST and DWT-toolbox managed to demonstrate the highest sensitivity across all three data types real, generic, and Markov. Among the de novo motif discovery tools, the Multiple Em for Motif Elicitation (MEME) emerged as the best performer. An analysis of the promoter regions of genes involved in the anthocyanin biosynthesis pathway in plants and the pentose phosphate pathway in humans, using the three best-performing tools, revealed considerable variation among the top 20 motifs identified by these tools.</p><p><strong>Conclusion: </strong>The findings of this study lay a robust groundwork for selecting optimal TFBS prediction tools for future research. Given the variability observed in tool performance, employing multiple tools for identifying TFBSs in a set of sequences is highly recommended. In addition, further studies are recommended to develop an integrated toolbox that incorporates TFBS prediction or motif discovery tools, aiming to streamline result precision and accuracy.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"371"},"PeriodicalIF":2.9,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11613939/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142765866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-02DOI: 10.1186/s12859-024-05986-1
Gholamhossein Jowkar, Jūlija Pečerska, Manuel Gil, Maria Anisimova
Insertions and deletions (indels) play a significant role in genome evolution across species. Realistic modelling of indel evolution is challenging and is still an open research question. Several attempts have been made to explicitly model multi-character (long) indels, such as TKF92, by relaxing the site independence assumption and introducing fragments. However, these methods are computationally expensive. On the other hand, the Poisson Indel Process (PIP) assumes site independence but allows one to infer single-character indels on the phylogenetic tree, distinguishing insertions from deletions. PIP's marginal likelihood computation has linear time complexity, enabling ancestral sequence reconstruction (ASR) with indels in linear time. Recently, we developed ARPIP, an ASR method using PIP, capable of inferring indel events with explicit evolutionary interpretations. Here, we investigate the effect of the single-character indel assumption on reconstructed ancestral sequences on mammalian protein orthologs and on simulated data. We show that ARPIP's ancestral estimates preserve the gap length distribution observed in the input alignment. In mammalian proteins the lengths of inserted segments appear to be substantially longer compared to deleted segments. Further, we confirm the well-established deletion bias observed in real data. To date, ARPIP is the only ancestral reconstruction method that explicitly models insertion and deletion events over time. Given a good quality input alignment, it can capture ancestral long indel events on the phylogeny.
插入和缺失(indels)在跨物种基因组进化中起着重要作用。indel进化的现实建模是具有挑战性的,仍然是一个开放的研究问题。通过放松站点独立性假设和引入片段,已经进行了多次尝试,以显式地建模多字符(长)索引,例如TKF92。然而,这些方法在计算上是昂贵的。另一方面,Poisson Indel Process (PIP)假定位点独立,但允许在系统发育树上推断单字符索引,区分插入和删除。PIP的边际似然计算具有线性时间复杂度,可以在线性时间内实现索引的祖先序列重建。最近,我们开发了ARPIP,一种使用PIP的ASR方法,能够用明确的进化解释推断indel事件。在这里,我们研究了单字符indel假设对哺乳动物蛋白质同源重构祖先序列和模拟数据的影响。我们证明ARPIP的祖先估计保留了在输入对齐中观察到的间隙长度分布。在哺乳动物蛋白质中,插入片段的长度似乎比缺失片段的长度要长得多。此外,我们证实了在实际数据中观察到的完善的删除偏差。迄今为止,ARPIP是唯一一种显式地模拟插入和删除事件的祖先重建方法。给定高质量的输入对齐,它可以捕获系统发育上的祖先长indel事件。
{"title":"Single-character insertion-deletion model preserves long indels in ancestral sequence reconstruction.","authors":"Gholamhossein Jowkar, Jūlija Pečerska, Manuel Gil, Maria Anisimova","doi":"10.1186/s12859-024-05986-1","DOIUrl":"https://doi.org/10.1186/s12859-024-05986-1","url":null,"abstract":"<p><p>Insertions and deletions (indels) play a significant role in genome evolution across species. Realistic modelling of indel evolution is challenging and is still an open research question. Several attempts have been made to explicitly model multi-character (long) indels, such as TKF92, by relaxing the site independence assumption and introducing fragments. However, these methods are computationally expensive. On the other hand, the Poisson Indel Process (PIP) assumes site independence but allows one to infer single-character indels on the phylogenetic tree, distinguishing insertions from deletions. PIP's marginal likelihood computation has linear time complexity, enabling ancestral sequence reconstruction (ASR) with indels in linear time. Recently, we developed ARPIP, an ASR method using PIP, capable of inferring indel events with explicit evolutionary interpretations. Here, we investigate the effect of the single-character indel assumption on reconstructed ancestral sequences on mammalian protein orthologs and on simulated data. We show that ARPIP's ancestral estimates preserve the gap length distribution observed in the input alignment. In mammalian proteins the lengths of inserted segments appear to be substantially longer compared to deleted segments. Further, we confirm the well-established deletion bias observed in real data. To date, ARPIP is the only ancestral reconstruction method that explicitly models insertion and deletion events over time. Given a good quality input alignment, it can capture ancestral long indel events on the phylogeny.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"370"},"PeriodicalIF":2.9,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610121/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142765865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Jointly analyzing multiple phenotype/traits may increase power in genetic association studies by aggregating weak genetic effects. The chance that at least one phenotype is missing increases exponentially as the number of phenotype increases especially for a real dataset. It is a common practice to discard individuals with missing phenotype or phenotype with a large proportion of missing values. Such a discarding method may lead to a loss of power or even an insufficient sample size for analysis. To our knowledge, many existing phenotype imputing methods are built on multivariate normal assumptions for analysis. Violation of these assumptions may lead to inflated type I errors or even loss of power in some cases. To overcome these limitations, we propose a novel phenotype imputation method based on a new Gaussian copula model with three different loss functions to address the issue of missing phenotype.
Results: In a variety of simulations and a real genetic association study for lung function, we show that our method outperforms existing methods and can also increase the power of the association test when compared to other comparable phenotype imputation methods. The proposed method is implemented in an R package available at https://github.com/jane-zizhen-zhao/CopulaPhenoImpute1.0 CONCLUSIONS: We propose a novel phenotype imputation method with a new Gaussian copula model based on three loss functions. Results of the simulation studies and real data analyses illustrate that the proposed method outperforms comparable methods.
{"title":"A novel phenotype imputation method with copula model.","authors":"Jianjun Zhang, Jane Zizhen Zhao, Samantha Gonzales, Xuexia Wang, Qiuying Sha","doi":"10.1186/s12859-024-05990-5","DOIUrl":"https://doi.org/10.1186/s12859-024-05990-5","url":null,"abstract":"<p><strong>Background: </strong>Jointly analyzing multiple phenotype/traits may increase power in genetic association studies by aggregating weak genetic effects. The chance that at least one phenotype is missing increases exponentially as the number of phenotype increases especially for a real dataset. It is a common practice to discard individuals with missing phenotype or phenotype with a large proportion of missing values. Such a discarding method may lead to a loss of power or even an insufficient sample size for analysis. To our knowledge, many existing phenotype imputing methods are built on multivariate normal assumptions for analysis. Violation of these assumptions may lead to inflated type I errors or even loss of power in some cases. To overcome these limitations, we propose a novel phenotype imputation method based on a new Gaussian copula model with three different loss functions to address the issue of missing phenotype.</p><p><strong>Results: </strong>In a variety of simulations and a real genetic association study for lung function, we show that our method outperforms existing methods and can also increase the power of the association test when compared to other comparable phenotype imputation methods. The proposed method is implemented in an R package available at https://github.com/jane-zizhen-zhao/CopulaPhenoImpute1.0 CONCLUSIONS: We propose a novel phenotype imputation method with a new Gaussian copula model based on three loss functions. Results of the simulation studies and real data analyses illustrate that the proposed method outperforms comparable methods.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"369"},"PeriodicalIF":2.9,"publicationDate":"2024-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11607873/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142765791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-27DOI: 10.1186/s12859-024-05983-4
Ibrahim Abdelbaky, Mohamed Elhakeem, Hilal Tayara, Elsayed Badr, Mustafa Abdul Salam
Antimicrobial peptides (AMPs) are a promising class of antimicrobial drugs due to their broad-spectrum activity against microorganisms. However, their clinical application is limited by their potential to cause hemolysis, the destruction of red blood cells. To address this issue, we propose a deep learning model based on convolutional neural networks (CNNs) for predicting the hemolytic activity of AMPs. Peptide sequences are represented using one-hot encoding, and the CNN architecture consists of multiple convolutional and fully connected layers. The model was trained on six different datasets: HemoPI-1, HemoPI-2, HemoPI-3, RNN-Hem, Hlppredfuse, and AMP-Combined, achieving Matthew's correlation coefficients of 0.9274, 0.5614, 0.6051, 0.6142, 0.8799, and 0.7484, respectively. Our model outperforms previously reported methods and can facilitate the development of novel AMPs with reduced hemolytic activity, which is crucial for their therapeutic use in treating bacterial infections.
{"title":"Enhanced prediction of hemolytic activity in antimicrobial peptides using deep learning-based sequence analysis.","authors":"Ibrahim Abdelbaky, Mohamed Elhakeem, Hilal Tayara, Elsayed Badr, Mustafa Abdul Salam","doi":"10.1186/s12859-024-05983-4","DOIUrl":"10.1186/s12859-024-05983-4","url":null,"abstract":"<p><p>Antimicrobial peptides (AMPs) are a promising class of antimicrobial drugs due to their broad-spectrum activity against microorganisms. However, their clinical application is limited by their potential to cause hemolysis, the destruction of red blood cells. To address this issue, we propose a deep learning model based on convolutional neural networks (CNNs) for predicting the hemolytic activity of AMPs. Peptide sequences are represented using one-hot encoding, and the CNN architecture consists of multiple convolutional and fully connected layers. The model was trained on six different datasets: HemoPI-1, HemoPI-2, HemoPI-3, RNN-Hem, Hlppredfuse, and AMP-Combined, achieving Matthew's correlation coefficients of 0.9274, 0.5614, 0.6051, 0.6142, 0.8799, and 0.7484, respectively. Our model outperforms previously reported methods and can facilitate the development of novel AMPs with reduced hemolytic activity, which is crucial for their therapeutic use in treating bacterial infections.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"368"},"PeriodicalIF":2.9,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11603801/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142738191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Genomic sequence similarity comparison is a crucial research area in bioinformatics. Multiple Sequence Alignment (MSA) is the basic technique used to identify regions of similarity between sequences, although MSA tools are widely used and highly accurate, they are often limited by computational complexity, and inaccuracies when handling highly divergent sequences, which leads to the development of alignment-free (AF) algorithms.
Results: This paper presents TreeWave, a novel AF approach based on frequency chaos game representation and discrete wavelet transform of sequences for phylogeny inference. We validate our method on various genomic datasets such as complete virus genome sequences, bacteria genome sequences, human mitochondrial genome sequences, and rRNA gene sequences. Compared to classical methods, our tool demonstrates a significant reduction in running time, especially when analyzing large datasets. The resulting phylogenetic trees show that TreeWave has similar classification accuracy to the classical MSA methods based on the normalized Robinson-Foulds distances and Baker's Gamma coefficients.
Conclusions: TreeWave is an open source and user-friendly command line tool for phylogeny reconstruction. It is a faster and more scalable tool that prioritizes computational efficiency while maintaining accuracy. TreeWave is freely available at https://github.com/nasmaB/TreeWave .
{"title":"TreeWave: command line tool for alignment-free phylogeny reconstruction based on graphical representation of DNA sequences and genomic signal processing.","authors":"Nasma Boumajdi, Houda Bendani, Lahcen Belyamani, Azeddine Ibrahimi","doi":"10.1186/s12859-024-05992-3","DOIUrl":"10.1186/s12859-024-05992-3","url":null,"abstract":"<p><strong>Background: </strong>Genomic sequence similarity comparison is a crucial research area in bioinformatics. Multiple Sequence Alignment (MSA) is the basic technique used to identify regions of similarity between sequences, although MSA tools are widely used and highly accurate, they are often limited by computational complexity, and inaccuracies when handling highly divergent sequences, which leads to the development of alignment-free (AF) algorithms.</p><p><strong>Results: </strong>This paper presents TreeWave, a novel AF approach based on frequency chaos game representation and discrete wavelet transform of sequences for phylogeny inference. We validate our method on various genomic datasets such as complete virus genome sequences, bacteria genome sequences, human mitochondrial genome sequences, and rRNA gene sequences. Compared to classical methods, our tool demonstrates a significant reduction in running time, especially when analyzing large datasets. The resulting phylogenetic trees show that TreeWave has similar classification accuracy to the classical MSA methods based on the normalized Robinson-Foulds distances and Baker's Gamma coefficients.</p><p><strong>Conclusions: </strong>TreeWave is an open source and user-friendly command line tool for phylogeny reconstruction. It is a faster and more scalable tool that prioritizes computational efficiency while maintaining accuracy. TreeWave is freely available at https://github.com/nasmaB/TreeWave .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"367"},"PeriodicalIF":2.9,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11600722/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142738192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-26DOI: 10.1186/s12859-024-05977-2
Rosa Aghdam, Xudong Tang, Shan Shan, Richard Lankau, Claudia Solís-Lemus
Background: The preservation of soil health is a critical challenge in the 21st century due to its significant impact on agriculture, human health, and biodiversity. We provide one of the first comprehensive investigations into the predictive potential of machine learning models for understanding the connections between soil and biological phenotypes. We investigate an integrative framework performing accurate machine learning-based prediction of plant performance from biological, chemical, and physical properties of the soil via two models: random forest and Bayesian neural network.
Results: Prediction improves when we add environmental features, such as soil properties and microbial density, along with microbiome data. Different preprocessing strategies show that human decisions significantly impact predictive performance. We show that the naive total sum scaling normalization that is commonly used in microbiome research is one of the optimal strategies to maximize predictive power. Also, we find that accurately defined labels are more important than normalization, taxonomic level, or model characteristics. ML performance is limited when humans can't classify samples accurately. Lastly, we provide domain scientists via a full model selection decision tree to identify the human choices that optimize model prediction power.
Conclusions: Our study highlights the importance of incorporating diverse environmental features and careful data preprocessing in enhancing the predictive power of machine learning models for soil and biological phenotype connections. This approach can significantly contribute to advancing agricultural practices and soil health management.
{"title":"Human limits in machine learning: prediction of potato yield and disease using soil microbiome data.","authors":"Rosa Aghdam, Xudong Tang, Shan Shan, Richard Lankau, Claudia Solís-Lemus","doi":"10.1186/s12859-024-05977-2","DOIUrl":"10.1186/s12859-024-05977-2","url":null,"abstract":"<p><strong>Background: </strong>The preservation of soil health is a critical challenge in the 21st century due to its significant impact on agriculture, human health, and biodiversity. We provide one of the first comprehensive investigations into the predictive potential of machine learning models for understanding the connections between soil and biological phenotypes. We investigate an integrative framework performing accurate machine learning-based prediction of plant performance from biological, chemical, and physical properties of the soil via two models: random forest and Bayesian neural network.</p><p><strong>Results: </strong>Prediction improves when we add environmental features, such as soil properties and microbial density, along with microbiome data. Different preprocessing strategies show that human decisions significantly impact predictive performance. We show that the naive total sum scaling normalization that is commonly used in microbiome research is one of the optimal strategies to maximize predictive power. Also, we find that accurately defined labels are more important than normalization, taxonomic level, or model characteristics. ML performance is limited when humans can't classify samples accurately. Lastly, we provide domain scientists via a full model selection decision tree to identify the human choices that optimize model prediction power.</p><p><strong>Conclusions: </strong>Our study highlights the importance of incorporating diverse environmental features and careful data preprocessing in enhancing the predictive power of machine learning models for soil and biological phenotype connections. This approach can significantly contribute to advancing agricultural practices and soil health management.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"366"},"PeriodicalIF":2.9,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11600749/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142725963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-26DOI: 10.1186/s12859-024-05941-0
Aniket Mane, Haley Sanderson, Aaron P White, Rahat Zaheer, Robert Beiko, Cédric Chauve
Background: Plasmids play a major role in the transfer of antimicrobial resistance (AMR) genes among bacteria via horizontal gene transfer. The identification of plasmids in short-read assemblies is a challenging problem and a very active research area. Plasmid binning aims at detecting, in a draft genome assembly, groups (bins) of contigs likely to originate from the same plasmid. Several methods for plasmid binning have been developed recently, such as PlasBin-flow, HyAsP, gplas, MOB-suite, and plasmidSPAdes. This motivates the problem of evaluating the performances of plasmid binning methods, either against a given ground truth or between them.
Results: We describe PlasEval, a novel method aimed at comparing the results of plasmid binning tools. PlasEval computes a dissimilarity measure between two sets of plasmid bins, that can originate either from two plasmid binning tools, or from a plasmid binning tool and a ground truth set of plasmid bins. The PlasEval dissimilarity accounts for the contig content of plasmid bins, the length of contigs and is repeat-aware. Moreover, the dissimilarity score computed by PlasEval is broken down into several parts, that allows to understand qualitative differences between the compared sets of plasmid bins. We illustrate the use of PlasEval by benchmarking four recently developed plasmid binning tools-PlasBin-flow, HyAsP, gplas, and MOB-recon-on a data set of 53 E. coli bacterial genomes.
Conclusion: Analysis of the results of plasmid binning methods using PlasEval shows that their behaviour varies significantly. PlasEval can be used to decide which specific plasmid binning method should be used for a specific dataset. The disagreement between different methods also suggests that the problem of plasmid binning on short-read contigs requires further research. We believe that PlasEval can prove to be an effective tool in this regard. PlasEval is publicly available at https://github.com/acme92/PlasEval.
{"title":"Plaseval: a framework for comparing and evaluating plasmid detection tools.","authors":"Aniket Mane, Haley Sanderson, Aaron P White, Rahat Zaheer, Robert Beiko, Cédric Chauve","doi":"10.1186/s12859-024-05941-0","DOIUrl":"10.1186/s12859-024-05941-0","url":null,"abstract":"<p><strong>Background: </strong>Plasmids play a major role in the transfer of antimicrobial resistance (AMR) genes among bacteria via horizontal gene transfer. The identification of plasmids in short-read assemblies is a challenging problem and a very active research area. Plasmid binning aims at detecting, in a draft genome assembly, groups (bins) of contigs likely to originate from the same plasmid. Several methods for plasmid binning have been developed recently, such as PlasBin-flow, HyAsP, gplas, MOB-suite, and plasmidSPAdes. This motivates the problem of evaluating the performances of plasmid binning methods, either against a given ground truth or between them.</p><p><strong>Results: </strong>We describe PlasEval, a novel method aimed at comparing the results of plasmid binning tools. PlasEval computes a dissimilarity measure between two sets of plasmid bins, that can originate either from two plasmid binning tools, or from a plasmid binning tool and a ground truth set of plasmid bins. The PlasEval dissimilarity accounts for the contig content of plasmid bins, the length of contigs and is repeat-aware. Moreover, the dissimilarity score computed by PlasEval is broken down into several parts, that allows to understand qualitative differences between the compared sets of plasmid bins. We illustrate the use of PlasEval by benchmarking four recently developed plasmid binning tools-PlasBin-flow, HyAsP, gplas, and MOB-recon-on a data set of 53 E. coli bacterial genomes.</p><p><strong>Conclusion: </strong>Analysis of the results of plasmid binning methods using PlasEval shows that their behaviour varies significantly. PlasEval can be used to decide which specific plasmid binning method should be used for a specific dataset. The disagreement between different methods also suggests that the problem of plasmid binning on short-read contigs requires further research. We believe that PlasEval can prove to be an effective tool in this regard. PlasEval is publicly available at https://github.com/acme92/PlasEval.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"365"},"PeriodicalIF":2.9,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11590284/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142724618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-23DOI: 10.1186/s12859-024-05989-y
Lei Cheng, Qian Huang, Zhengqun Zhu, Yanan Li, Shuguang Ge, Longzhen Zhang, Ping Gong
Background: The integration of multi-omics data through deep learning has greatly improved cancer subtype classification, particularly in feature learning and multi-omics data integration. However, key challenges remain in embedding sample structure information into the feature space and designing flexible integration strategies.
Results: We propose MoAGL-SA, an adaptive multi-omics integration method based on graph learning and self-attention, to address these challenges. First, patient relationship graphs are generated from each omics dataset using graph learning. Next, three-layer graph convolutional networks are employed to extract omic-specific graph embeddings. Self-attention is then used to focus on the most relevant omics, adaptively assigning weights to different graph embeddings for multi-omics integration. Finally, cancer subtypes are classified using a softmax classifier.
Conclusions: Experimental results show that MoAGL-SA outperforms several popular algorithms on datasets for breast invasive carcinoma, kidney renal papillary cell carcinoma, and kidney renal clear cell carcinoma. Additionally, MoAGL-SA successfully identifies key biomarkers for breast invasive carcinoma.
{"title":"MoAGL-SA: a multi-omics adaptive integration method with graph learning and self attention for cancer subtype classification.","authors":"Lei Cheng, Qian Huang, Zhengqun Zhu, Yanan Li, Shuguang Ge, Longzhen Zhang, Ping Gong","doi":"10.1186/s12859-024-05989-y","DOIUrl":"10.1186/s12859-024-05989-y","url":null,"abstract":"<p><strong>Background: </strong>The integration of multi-omics data through deep learning has greatly improved cancer subtype classification, particularly in feature learning and multi-omics data integration. However, key challenges remain in embedding sample structure information into the feature space and designing flexible integration strategies.</p><p><strong>Results: </strong>We propose MoAGL-SA, an adaptive multi-omics integration method based on graph learning and self-attention, to address these challenges. First, patient relationship graphs are generated from each omics dataset using graph learning. Next, three-layer graph convolutional networks are employed to extract omic-specific graph embeddings. Self-attention is then used to focus on the most relevant omics, adaptively assigning weights to different graph embeddings for multi-omics integration. Finally, cancer subtypes are classified using a softmax classifier.</p><p><strong>Conclusions: </strong>Experimental results show that MoAGL-SA outperforms several popular algorithms on datasets for breast invasive carcinoma, kidney renal papillary cell carcinoma, and kidney renal clear cell carcinoma. Additionally, MoAGL-SA successfully identifies key biomarkers for breast invasive carcinoma.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"364"},"PeriodicalIF":2.9,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11585958/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142695114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-23DOI: 10.1186/s12859-024-05982-5
Quinlan Furumo, Michelle M Meyer
Background: Over the last decade the drop in short-read sequencing costs has allowed experimental techniques utilizing sequencing to address specific biological questions to proliferate, oftentimes outpacing standardized or effective analysis approaches for the data generated. There are growing amounts of bacterial 3'-end sequencing data, yet there is currently no commonly accepted analysis methodology for this datatype. Most data analysis approaches are somewhat ad hoc and, despite the presence of substantial signal within annotated genes, focus on genomic regions outside the annotated genes (e.g. 3' or 5' UTRs). Furthermore, the lack of consistent systematic analysis approaches, as well as the absence of genome-wide ground truth data, make it impossible to compare conclusions generated by different labs, using different organisms.
Results: We present PIPETS, (Poisson Identification of PEaks from Term-Seq data), an R package available on Bioconductor that provides a novel analysis method for 3'-end sequencing data. PIPETS is a statistically informed, gene-annotation agnostic methodology. Across two different datasets from two different organisms, PIPETS identified significant 3'-end termination signal across a wider range of annotated genomic contexts than existing analysis approaches, suggesting that existing approaches may miss biologically relevant signal. Furthermore, assessment of the previously called 3'-end positions not captured by PIPETS showed that they were uniformly very low coverage.
Conclusions: PIPETS provides a broadly applicable platform to explore and analyze 3'-end sequencing data sets from across different organisms. It requires only the 3'-end sequencing data, and is broadly accessible to non-expert users.
背景:在过去十年中,短线程测序成本的下降使得利用测序解决特定生物学问题的实验技术激增,有时甚至超过了对所产生数据进行标准化或有效分析的方法。细菌 3'- 端测序数据的数量越来越多,但目前还没有针对这种数据类型的公认分析方法。大多数数据分析方法都是临时性的,尽管在已注释的基因中存在大量信号,但分析的重点是已注释基因以外的基因组区域(如 3' 或 5' UTR)。此外,由于缺乏一致的系统分析方法,也没有全基因组的基本真实数据,因此无法比较不同实验室使用不同生物体得出的结论:我们介绍了PIPETS(Poisson Identification of PEaks from Term-Seq data),它是Bioconductor上的一个R软件包,为3'端测序数据提供了一种新颖的分析方法。PIPETS 是一种不考虑基因注释的统计学方法。与现有的分析方法相比,PIPETS 在来自两种不同生物的两个不同数据集中,在更广泛的注释基因组上下文中发现了重要的 3'-end 终止信号,这表明现有的方法可能会遗漏与生物相关的信号。此外,对 PIPETS 未捕获的先前调用的 3'-end 位置进行的评估显示,这些位置的覆盖率都非常低:结论:PIPETS 为探索和分析不同生物的 3'-end 测序数据集提供了一个广泛适用的平台。它只需要 3'-end 测序数据,非专业用户也能广泛使用。
{"title":"PIPETS: a statistically informed, gene-annotation agnostic analysis method to study bacterial termination using 3'-end sequencing.","authors":"Quinlan Furumo, Michelle M Meyer","doi":"10.1186/s12859-024-05982-5","DOIUrl":"10.1186/s12859-024-05982-5","url":null,"abstract":"<p><strong>Background: </strong>Over the last decade the drop in short-read sequencing costs has allowed experimental techniques utilizing sequencing to address specific biological questions to proliferate, oftentimes outpacing standardized or effective analysis approaches for the data generated. There are growing amounts of bacterial 3'-end sequencing data, yet there is currently no commonly accepted analysis methodology for this datatype. Most data analysis approaches are somewhat ad hoc and, despite the presence of substantial signal within annotated genes, focus on genomic regions outside the annotated genes (e.g. 3' or 5' UTRs). Furthermore, the lack of consistent systematic analysis approaches, as well as the absence of genome-wide ground truth data, make it impossible to compare conclusions generated by different labs, using different organisms.</p><p><strong>Results: </strong>We present PIPETS, (Poisson Identification of PEaks from Term-Seq data), an R package available on Bioconductor that provides a novel analysis method for 3'-end sequencing data. PIPETS is a statistically informed, gene-annotation agnostic methodology. Across two different datasets from two different organisms, PIPETS identified significant 3'-end termination signal across a wider range of annotated genomic contexts than existing analysis approaches, suggesting that existing approaches may miss biologically relevant signal. Furthermore, assessment of the previously called 3'-end positions not captured by PIPETS showed that they were uniformly very low coverage.</p><p><strong>Conclusions: </strong>PIPETS provides a broadly applicable platform to explore and analyze 3'-end sequencing data sets from across different organisms. It requires only the 3'-end sequencing data, and is broadly accessible to non-expert users.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"363"},"PeriodicalIF":2.9,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11585934/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142695120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-21DOI: 10.1186/s12859-024-05981-6
Aideen McCabe, Gerard P Quinn, Suneil Jain, Micheál Ó Dálaigh, Kellie Dean, Ross G Murphy, Simon S McDade
Background: Advances in transcriptional profiling methods have enabled the discovery of molecular subtypes within and across traditional tissue-based cancer classifications. Such molecular subgroups hold potential for improving patient outcomes by guiding treatment decisions and revealing physiological distinctions and targetable pathways. Computational methods for stratifying transcriptomic data into molecular subgroups are increasingly abundant. However, assigning samples to these subtypes and other transcriptionally inferred predictions is time-consuming and requires significant bioinformatics expertise. To address this need, we recently reported "ClassifieR," a flexible, interactive cloud application for the functional annotation of colorectal and breast cancer transcriptomes. Here, we report "ClassifieR 2.0" which introduces additional modules for the molecular subtyping of prostate and high-grade serous ovarian cancer (HGSOC).
Results: ClassifieR 2.0 introduces ClassifieRp and ClassifieRov, two specialised modules specifically designed to address the challenges of prostate and HGSOC molecular classification. ClassifieRp includes sigInfer, a method we developed to infer commercial prognostic prostate gene expression signatures from publicly available gene-lists or indeed any user-uploaded gene-list. ClassifieRov utilizes consensus molecular subtyping methods for HGSOC, including tools like consensusOV, for accurate ovarian cancer stratification. Both modules include functionalities present in the original ClassifieR framework for estimating cellular composition, predicting transcription factor (TF) activity and single sample gene set enrichment analysis (ssGSEA).
Conclusions: ClassifieR 2.0 combines molecular subtyping of prostate cancer and HGSOC with commonly used sample annotation tools in a single, user-friendly platform, allowing scientists without bioinformatics training to explore prostate and HGSOC transcriptional data without the need for extensive bioinformatics knowledge or manual data handling to operate various packages. Our sigInfer method within ClassifieRp enables the inference of commercially available gene signatures for prostate cancer, while ClassifieRov incorporates consensus molecular subtyping for HGSOC. Overall, ClassifieR 2.0 aims to make molecular subtyping more accessible to the wider research community. This is crucial for increased understanding of the molecular heterogeneity of these cancers and developing personalised treatment strategies.
{"title":"ClassifieR 2.0: expanding interactive gene expression-based stratification to prostate and high-grade serous ovarian cancer.","authors":"Aideen McCabe, Gerard P Quinn, Suneil Jain, Micheál Ó Dálaigh, Kellie Dean, Ross G Murphy, Simon S McDade","doi":"10.1186/s12859-024-05981-6","DOIUrl":"10.1186/s12859-024-05981-6","url":null,"abstract":"<p><strong>Background: </strong>Advances in transcriptional profiling methods have enabled the discovery of molecular subtypes within and across traditional tissue-based cancer classifications. Such molecular subgroups hold potential for improving patient outcomes by guiding treatment decisions and revealing physiological distinctions and targetable pathways. Computational methods for stratifying transcriptomic data into molecular subgroups are increasingly abundant. However, assigning samples to these subtypes and other transcriptionally inferred predictions is time-consuming and requires significant bioinformatics expertise. To address this need, we recently reported \"ClassifieR,\" a flexible, interactive cloud application for the functional annotation of colorectal and breast cancer transcriptomes. Here, we report \"ClassifieR 2.0\" which introduces additional modules for the molecular subtyping of prostate and high-grade serous ovarian cancer (HGSOC).</p><p><strong>Results: </strong>ClassifieR 2.0 introduces ClassifieRp and ClassifieRov, two specialised modules specifically designed to address the challenges of prostate and HGSOC molecular classification. ClassifieRp includes sigInfer, a method we developed to infer commercial prognostic prostate gene expression signatures from publicly available gene-lists or indeed any user-uploaded gene-list. ClassifieRov utilizes consensus molecular subtyping methods for HGSOC, including tools like consensusOV, for accurate ovarian cancer stratification. Both modules include functionalities present in the original ClassifieR framework for estimating cellular composition, predicting transcription factor (TF) activity and single sample gene set enrichment analysis (ssGSEA).</p><p><strong>Conclusions: </strong>ClassifieR 2.0 combines molecular subtyping of prostate cancer and HGSOC with commonly used sample annotation tools in a single, user-friendly platform, allowing scientists without bioinformatics training to explore prostate and HGSOC transcriptional data without the need for extensive bioinformatics knowledge or manual data handling to operate various packages. Our sigInfer method within ClassifieRp enables the inference of commercially available gene signatures for prostate cancer, while ClassifieRov incorporates consensus molecular subtyping for HGSOC. Overall, ClassifieR 2.0 aims to make molecular subtyping more accessible to the wider research community. This is crucial for increased understanding of the molecular heterogeneity of these cancers and developing personalised treatment strategies.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"362"},"PeriodicalIF":2.9,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11580654/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142685936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}