Pub Date : 2024-11-12eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae177
Stefanie Lück, Uwe Scholz, Dimitar Douchkov
Motivation: Advances in genomics have created an insistent need for accessible tools that simplify complex genetic data analysis, enabling researchers across fields to harness the power of genome-wide association studies and genomic prediction. GWAStic was developed to bridge this gap, providing an intuitive platform that combines artificial intelligence with traditional statistical methods, making sophisticated genomic analysis accessible without requiring deep expertise in statistical software.
Results: We present GWAStic, an intuitive, cross-platform desktop application designed to streamline genome-wide association studies and genomic prediction for biological and medical researchers. With a user-friendly graphical interface, GWAStic integrates machine learning and traditional statistical approaches to support genetic analysis. The application accepts inputs from standard text-based Variant Call Formats and PLINK binary files, generating clear graphical outputs, including Manhattan plots, quantile-quantile plots, and genomic prediction correlation plots to enhance data visualization and analysis.
{"title":"Introducing GWAStic: a user-friendly, cross-platform solution for genome-wide association studies and genomic prediction.","authors":"Stefanie Lück, Uwe Scholz, Dimitar Douchkov","doi":"10.1093/bioadv/vbae177","DOIUrl":"10.1093/bioadv/vbae177","url":null,"abstract":"<p><strong>Motivation: </strong>Advances in genomics have created an insistent need for accessible tools that simplify complex genetic data analysis, enabling researchers across fields to harness the power of genome-wide association studies and genomic prediction. GWAStic was developed to bridge this gap, providing an intuitive platform that combines artificial intelligence with traditional statistical methods, making sophisticated genomic analysis accessible without requiring deep expertise in statistical software.</p><p><strong>Results: </strong>We present GWAStic, an intuitive, cross-platform desktop application designed to streamline genome-wide association studies and genomic prediction for biological and medical researchers. With a user-friendly graphical interface, GWAStic integrates machine learning and traditional statistical approaches to support genetic analysis. The application accepts inputs from standard text-based Variant Call Formats and PLINK binary files, generating clear graphical outputs, including Manhattan plots, quantile-quantile plots, and genomic prediction correlation plots to enhance data visualization and analysis.</p><p><strong>Availability and implementation: </strong>Project page: https://github.com/snowformatics/gwastic_desktop; GWAStic documentation: https://snowformatics.gitbook.io/product-docs; PyPI: https://pypi.org/project/gwastic-desktop/.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae177"},"PeriodicalIF":2.4,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11643344/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142831010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-09eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae176
Xiangnan Li, Yaqi Huang, Shuming Wang, Meng Hao, Yi Li, Hui Zhang, Zixin Hu
Motivation: The UK Biobank data holds immense potential for human health research. However, the complex data preparation and interpretation processes often act as barriers for researchers, diverting them from their core research questions.
Results: We developed LUKB, an R Shiny-based web tool that simplifies UK Biobank data preparation by automating these preprocessing tasks. LUKB reduces preprocessing time and integrates functions for initial data exploration, allowing researchers to dedicate more time to their scientific endeavors. Detailed deployment and usage can be found in the Supplementary Data.
Availability and implementation: LUKB is freely available at https://github.com/HaiGenBuShang/LUKB.
动机英国生物库数据为人类健康研究提供了巨大潜力。然而,复杂的数据准备和解释过程往往成为研究人员的障碍,使他们偏离核心研究问题:我们开发了基于 R Shiny 的网络工具 LUKB,通过自动完成这些预处理任务来简化英国生物库数据的准备工作。LUKB 减少了预处理时间,并集成了用于初始数据探索的功能,使研究人员能够将更多时间投入到科学研究中。详细的部署和使用方法见补充数据:LUKB 可在 https://github.com/HaiGenBuShang/LUKB 免费获取。
{"title":"LUKB: preparing local UK Biobank data for analysis.","authors":"Xiangnan Li, Yaqi Huang, Shuming Wang, Meng Hao, Yi Li, Hui Zhang, Zixin Hu","doi":"10.1093/bioadv/vbae176","DOIUrl":"10.1093/bioadv/vbae176","url":null,"abstract":"<p><strong>Motivation: </strong>The UK Biobank data holds immense potential for human health research. However, the complex data preparation and interpretation processes often act as barriers for researchers, diverting them from their core research questions.</p><p><strong>Results: </strong>We developed LUKB, an R Shiny-based web tool that simplifies UK Biobank data preparation by automating these preprocessing tasks. LUKB reduces preprocessing time and integrates functions for initial data exploration, allowing researchers to dedicate more time to their scientific endeavors. Detailed deployment and usage can be found in the Supplementary Data.</p><p><strong>Availability and implementation: </strong>LUKB is freely available at https://github.com/HaiGenBuShang/LUKB.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae176"},"PeriodicalIF":2.4,"publicationDate":"2024-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11580680/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142689856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-07eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae158
Chaoyue Sun, Yanjun Li, Simone Marini, Alberto Riva, Dapeng Oliver Wu, Ruogu Fang, Marco Salemi, Brittany Rife Magalis
Motivation: In the midst of an outbreak, identification of groups of individuals that represent risk for transmission of the pathogen under investigation is critical to public health efforts. Dynamic transmission patterns within these clusters, whether it be the result of changes at the level of the virus (e.g. infectivity) or host (e.g. vaccination), are critical in strategizing public health interventions, particularly when resources are limited. Phylogenetic trees are widely used not only in the detection of transmission clusters, but the topological shape of the branches within can be useful sources of information regarding the dynamics of the represented population.
Results: We evaluated the limitation of existing tree shape metrics when dealing with dynamic transmission clusters and propose instead a phylogeny-based deep learning system -DeepDynaTree- for dynamic classification. Comprehensive experiments carried out on a variety of simulated epidemic growth models and HIV epidemic data indicate that this graph deep learning approach is effective, robust, and informative for cluster dynamic prediction. Our results confirm that DeepDynaTree is a promising tool for transmission cluster characterization that can be modified to address the existing limitations and deficiencies in knowledge regarding the dynamics of transmission trajectories for groups at risk of pathogen infection.
Availability and implementation: DeepDynaTree is available under an MIT Licence in https://github.com/salemilab/DeepDynaTree.
动机:在疫情爆发期间,确定哪些人群有传播所调查病原体的风险对公共卫生工作至关重要。无论是病毒水平(如传染性)还是宿主水平(如疫苗接种)的变化所导致的这些群组内的动态传播模式,对于制定公共卫生干预战略都至关重要,尤其是在资源有限的情况下。系统发生树不仅被广泛用于检测传播集群,而且其内部分支的拓扑形状也是有关所代表种群动态的有用信息来源:我们评估了现有树形指标在处理动态传播集群时的局限性,并提出了一种基于系统发育的深度学习系统--DeepDynaTree--用于动态分类。在各种模拟流行病增长模型和 HIV 流行病数据上进行的综合实验表明,这种图深度学习方法对于集群动态预测是有效、稳健和有参考价值的。我们的研究结果证实,DeepDynaTree 是一种很有前途的传播集群特征描述工具,它可以进行修改,以解决现有的局限性和病原体感染风险群体传播轨迹动态知识的不足:DeepDynaTree以MIT许可在https://github.com/salemilab/DeepDynaTree。
{"title":"Phylogenetic-informed graph deep learning to classify dynamic transmission clusters in infectious disease epidemics.","authors":"Chaoyue Sun, Yanjun Li, Simone Marini, Alberto Riva, Dapeng Oliver Wu, Ruogu Fang, Marco Salemi, Brittany Rife Magalis","doi":"10.1093/bioadv/vbae158","DOIUrl":"https://doi.org/10.1093/bioadv/vbae158","url":null,"abstract":"<p><strong>Motivation: </strong>In the midst of an outbreak, identification of groups of individuals that represent risk for transmission of the pathogen under investigation is critical to public health efforts. Dynamic transmission patterns within these clusters, whether it be the result of changes at the level of the virus (e.g. infectivity) or host (e.g. vaccination), are critical in strategizing public health interventions, particularly when resources are limited. Phylogenetic trees are widely used not only in the detection of transmission clusters, but the topological shape of the branches within can be useful sources of information regarding the dynamics of the represented population.</p><p><strong>Results: </strong>We evaluated the limitation of existing tree shape metrics when dealing with dynamic transmission clusters and propose instead a phylogeny-based deep learning system -<i>DeepDynaTree</i>- for dynamic classification. Comprehensive experiments carried out on a variety of simulated epidemic growth models and HIV epidemic data indicate that this graph deep learning approach is effective, robust, and informative for cluster dynamic prediction. Our results confirm that <i>DeepDynaTree</i> is a promising tool for transmission cluster characterization that can be modified to address the existing limitations and deficiencies in knowledge regarding the dynamics of transmission trajectories for groups at risk of pathogen infection.</p><p><strong>Availability and implementation: </strong><i>DeepDynaTree</i> is available under an MIT Licence in https://github.com/salemilab/DeepDynaTree.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae158"},"PeriodicalIF":2.4,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11552518/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142633757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-05eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbae172
Stephen Chapman, Theo Brunet, Arnaud Mourier, Bianca H Habermann
Motivation: Mitochondria are essential for cellular metabolism and are inherently flexible to allow correct function in a wide range of tissues. Consequently, dysregulated mitochondrial metabolism affects different tissues in different ways leading to challenges in understanding the pathology of mitochondrial diseases. System-level metabolic modelling is useful in studying tissue-specific mitochondrial metabolism, yet despite the mouse being a common model organism in research, no mouse specific mitochondrial metabolic model is currently available.
Results: Building upon the similarity between human and mouse mitochondrial metabolism, we present mitoMammal, a genome-scale metabolic model that contains human and mouse specific gene-product reaction rules. MitoMammal is able to model mouse and human mitochondrial metabolism. To demonstrate this, using an adapted E-Flux algorithm, we integrated proteomic data from mitochondria of isolated mouse cardiomyocytes and mouse brown adipocyte tissue, as well as transcriptomic data from in vitro differentiated human brown adipocytes and modelled the context specific metabolism using flux balance analysis. In all three simulations, mitoMammal made mostly accurate, and some novel predictions relating to energy metabolism in the context of cardiomyocytes and brown adipocytes. This demonstrates its usefulness in research in cardiac disease and diabetes in both mouse and human contexts.
Availability and implementation: The MitoMammal Jupyter Notebook is available at: https://gitlab.com/habermann_lab/mitomammal.
{"title":"MitoMAMMAL: a genome scale model of mammalian mitochondria predicts cardiac and BAT metabolism.","authors":"Stephen Chapman, Theo Brunet, Arnaud Mourier, Bianca H Habermann","doi":"10.1093/bioadv/vbae172","DOIUrl":"https://doi.org/10.1093/bioadv/vbae172","url":null,"abstract":"<p><strong>Motivation: </strong>Mitochondria are essential for cellular metabolism and are inherently flexible to allow correct function in a wide range of tissues. Consequently, dysregulated mitochondrial metabolism affects different tissues in different ways leading to challenges in understanding the pathology of mitochondrial diseases. System-level metabolic modelling is useful in studying tissue-specific mitochondrial metabolism, yet despite the mouse being a common model organism in research, no mouse specific mitochondrial metabolic model is currently available.</p><p><strong>Results: </strong>Building upon the similarity between human and mouse mitochondrial metabolism, we present mitoMammal, a genome-scale metabolic model that contains human and mouse specific gene-product reaction rules. MitoMammal is able to model mouse and human mitochondrial metabolism. To demonstrate this, using an adapted E-Flux algorithm, we integrated proteomic data from mitochondria of isolated mouse cardiomyocytes and mouse brown adipocyte tissue, as well as transcriptomic data from in vitro differentiated human brown adipocytes and modelled the context specific metabolism using flux balance analysis. In all three simulations, mitoMammal made mostly accurate, and some novel predictions relating to energy metabolism in the context of cardiomyocytes and brown adipocytes. This demonstrates its usefulness in research in cardiac disease and diabetes in both mouse and human contexts.</p><p><strong>Availability and implementation: </strong>The MitoMammal Jupyter Notebook is available at: https://gitlab.com/habermann_lab/mitomammal.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae172"},"PeriodicalIF":2.4,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11696703/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142933933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-05eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae171
Saish Jaiswal, Hema A Murthy, Manikandan Narayanan
Motivation: Genomic signal processing (GSP), which transforms biomolecular sequences into discrete signals for spectral analysis, has provided valuable insights into DNA sequence, structure, and evolution. However, challenges persist with spectral representations of variable-length sequences for tasks like species classification and in interpreting these spectra to identify discriminative DNA regions.
Results: We introduce SpecGMM, a novel framework that integrates sliding window-based Spectral analysis with a Gaussian Mixture Model to transform variable-length DNA sequences into fixed-dimensional spectral representations for taxonomic classification. SpecGMM's hyperparameters were selected using a dataset of plant sequences, and applied unchanged across diverse datasets, including mitochondrial DNA, viral and bacterial genome, and 16S rRNA sequences. Across these datasets, SpecGMM outperformed a baseline method, with 9.45% average and 35.55% maximum improvement in test accuracies for a Linear Discriminant classifier. Regarding interpretability, SpecGMM revealed discriminative hypervariable regions in 16S rRNA sequences-particularly V3/V4 for discriminating higher taxa and V2/V3 for lower taxa-corroborating their known classification relevance. SpecGMM's spectrogram video analysis helped visualize species-specific DNA signatures. SpecGMM thus provides a robust and interpretable method for spectral DNA analysis, opening new avenues in GSP research.
Availability and implementation: SpecGMM's source code is available at https://github.com/BIRDSgroup/SpecGMM.
{"title":"SpecGMM: Integrating Spectral analysis and Gaussian Mixture Models for taxonomic classification and identification of discriminative DNA regions.","authors":"Saish Jaiswal, Hema A Murthy, Manikandan Narayanan","doi":"10.1093/bioadv/vbae171","DOIUrl":"10.1093/bioadv/vbae171","url":null,"abstract":"<p><strong>Motivation: </strong>Genomic signal processing (GSP), which transforms biomolecular sequences into discrete signals for spectral analysis, has provided valuable insights into DNA sequence, structure, and evolution. However, challenges persist with spectral representations of variable-length sequences for tasks like species classification and in interpreting these spectra to identify discriminative DNA regions.</p><p><strong>Results: </strong>We introduce SpecGMM, a novel framework that integrates sliding window-based Spectral analysis with a Gaussian Mixture Model to transform variable-length DNA sequences into fixed-dimensional spectral representations for taxonomic classification. SpecGMM's hyperparameters were selected using a dataset of plant sequences, and applied unchanged across diverse datasets, including mitochondrial DNA, viral and bacterial genome, and 16S rRNA sequences. Across these datasets, SpecGMM outperformed a baseline method, with 9.45% average and 35.55% maximum improvement in test accuracies for a Linear Discriminant classifier. Regarding interpretability, SpecGMM revealed discriminative hypervariable regions in 16S rRNA sequences-particularly V3/V4 for discriminating higher taxa and V2/V3 for lower taxa-corroborating their known classification relevance. SpecGMM's spectrogram video analysis helped visualize species-specific DNA signatures. SpecGMM thus provides a robust and interpretable method for spectral DNA analysis, opening new avenues in GSP research.</p><p><strong>Availability and implementation: </strong>SpecGMM's source code is available at https://github.com/BIRDSgroup/SpecGMM.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae171"},"PeriodicalIF":2.4,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11631429/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142808697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Protein function prediction is crucial in bioinformatics, driven by the growth of protein sequence data from high-throughput technologies. Traditional methods are costly and slow, underscoring the need for computational solutions. While deep learning offers powerful tools, many models lack optimization for brain development datasets, critical for neurodevelopmental disorder research. To address this, we developed RecGOBD (Recognition of Gene Ontology-related Brain Development protein function), a model tailored to predict protein functions essential to brain development.
Result: RecGOBD targets 10 key gene ontology (GO) terms for brain development, embedding protein sequences associated with these terms. Leveraging advanced pre-trained models, it captures both sequence and structure data, aligning them with GO terms through attention mechanisms. The category attention layer enhances prediction accuracy. RecGOBD surpassed five benchmark models in AUROC, AUPR, and Fmax metrics and was further used to predict autism-related protein functions and assess mutation impacts on GO terms. These findings highlight RecGOBD's potential in advancing protein function prediction for neurodevelopmental disorders.
Availability and implementation: All Python codes associated with this study are available at https://github.com/ZL-Xia/RECGOBD.git.
动机在高通量技术带来的蛋白质序列数据增长的推动下,蛋白质功能预测在生物信息学中至关重要。传统方法成本高、速度慢,凸显了对计算解决方案的需求。虽然深度学习提供了强大的工具,但许多模型缺乏对大脑发育数据集的优化,而这对神经发育障碍研究至关重要。为了解决这个问题,我们开发了 RecGOBD(基因本体相关脑发育蛋白功能识别),这是一个为预测对脑发育至关重要的蛋白功能而量身定制的模型:RecGOBD 针对大脑发育的 10 个关键基因本体(GO)术语,嵌入了与这些术语相关的蛋白质序列。利用先进的预训练模型,它可以捕捉序列和结构数据,并通过注意机制将它们与 GO 术语对齐。类别关注层提高了预测的准确性。RecGOBD 在 AUROC、AUPR 和 Fmax 指标上超过了五个基准模型,并被进一步用于预测自闭症相关蛋白质的功能和评估突变对 GO 术语的影响。这些发现凸显了 RecGOBD 在推进神经发育障碍蛋白质功能预测方面的潜力:与本研究相关的所有 Python 代码均可在 https://github.com/ZL-Xia/RECGOBD.git 上获取。
{"title":"RecGOBD: accurate recognition of gene ontology related brain development protein functions through multi-feature fusion and attention mechanisms.","authors":"Zhiliang Xia, Shiqiang Ma, Jiawei Li, Yan Guo, Limin Jiang, Jijun Tang","doi":"10.1093/bioadv/vbae163","DOIUrl":"10.1093/bioadv/vbae163","url":null,"abstract":"<p><strong>Motivation: </strong>Protein function prediction is crucial in bioinformatics, driven by the growth of protein sequence data from high-throughput technologies. Traditional methods are costly and slow, underscoring the need for computational solutions. While deep learning offers powerful tools, many models lack optimization for brain development datasets, critical for neurodevelopmental disorder research. To address this, we developed RecGOBD (Recognition of Gene Ontology-related Brain Development protein function), a model tailored to predict protein functions essential to brain development.</p><p><strong>Result: </strong>RecGOBD targets 10 key gene ontology (GO) terms for brain development, embedding protein sequences associated with these terms. Leveraging advanced pre-trained models, it captures both sequence and structure data, aligning them with GO terms through attention mechanisms. The category attention layer enhances prediction accuracy. RecGOBD surpassed five benchmark models in AUROC, AUPR, and Fmax metrics and was further used to predict autism-related protein functions and assess mutation impacts on GO terms. These findings highlight RecGOBD's potential in advancing protein function prediction for neurodevelopmental disorders.</p><p><strong>Availability and implementation: </strong>All Python codes associated with this study are available at https://github.com/ZL-Xia/RECGOBD.git.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae163"},"PeriodicalIF":2.4,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11639192/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142831054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-30eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae165
Stephan Breimann, Dmitrij Frishman
Summary: Amino acid scales are crucial for sequence-based protein prediction tasks, yet no gold standard scale set or simple scale selection methods exist. We developed AAclust, a wrapper for clustering models that require a pre-defined number of clusters k, such as k-means. AAclust obtains redundancy-reduced scale sets by clustering and selecting one representative scale per cluster, where k can either be optimized by AAclust or defined by the user. The utility of AAclust scale selections was assessed by applying machine learning models to 24 protein benchmark datasets. We found that top-performing scale sets were different for each benchmark dataset and significantly outperformed scale sets used in previous studies. Noteworthy is the strong dependence of the model performance on the scale set size. AAclust enables a systematic optimization of scale-based feature engineering in machine learning applications.
Availability and implementation: The AAclust algorithm is part of AAanalysis, a Python-based framework for interpretable sequence-based protein prediction, which is documented and accessible at https://aaanalysis.readthedocs.io/en/latest and https://github.com/breimanntools/aaanalysis.
摘要:氨基酸尺度对于基于序列的蛋白质预测任务至关重要,但目前还没有黄金标准尺度集或简单的尺度选择方法。我们开发了 AAclust,它是需要预定义簇数 k 的聚类模型(如 k-means)的包装器。AAclust 通过聚类并为每个聚类选择一个具有代表性的标度,从而获得减少冗余的标度集,其中 k 既可以由 AAclust 优化,也可以由用户定义。通过将机器学习模型应用于 24 个蛋白质基准数据集,对 AAclust 标度选择的实用性进行了评估。我们发现,每个基准数据集的最佳规模集都不尽相同,而且明显优于以往研究中使用的规模集。值得注意的是,模型的性能与标度集的大小密切相关。AAclust 能够系统地优化机器学习应用中基于规模的特征工程:AAclust算法是AAanalysis的一部分,AAanalysis是一个基于Python的框架,用于基于序列的可解释蛋白质预测,其文档和访问地址为https://aaanalysis.readthedocs.io/en/latest 和 https://github.com/breimanntools/aaanalysis。
{"title":"AAclust: <i>k</i>-optimized clustering for selecting redundancy-reduced sets of amino acid scales.","authors":"Stephan Breimann, Dmitrij Frishman","doi":"10.1093/bioadv/vbae165","DOIUrl":"10.1093/bioadv/vbae165","url":null,"abstract":"<p><strong>Summary: </strong>Amino acid scales are crucial for sequence-based protein prediction tasks, yet no gold standard scale set or simple scale selection methods exist. We developed AAclust, a wrapper for clustering models that require a pre-defined number of clusters <i>k</i>, such as <i>k</i>-means. AAclust obtains redundancy-reduced scale sets by clustering and selecting one representative scale per cluster, where <i>k</i> can either be optimized by AAclust or defined by the user. The utility of AAclust scale selections was assessed by applying machine learning models to 24 protein benchmark datasets. We found that top-performing scale sets were different for each benchmark dataset and significantly outperformed scale sets used in previous studies. Noteworthy is the strong dependence of the model performance on the scale set size. AAclust enables a systematic optimization of scale-based feature engineering in machine learning applications.</p><p><strong>Availability and implementation: </strong>The AAclust algorithm is part of AAanalysis, a Python-based framework for interpretable sequence-based protein prediction, which is documented and accessible at https://aaanalysis.readthedocs.io/en/latest and https://github.com/breimanntools/aaanalysis.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae165"},"PeriodicalIF":2.4,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562964/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142636210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-29eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae157
Paras Verma, Deeksha Thakur, Shashi B Pandit
Motivation: Gene transcripts are distinguished by the composition of their exons, and this different exon composition may contribute to advancing proteome complexity. Despite the availability of alternative splicing information documented in various databases, a ready association of exonic variations to the protein sequence remains a mammoth task.
Results: To associate exonic variation(s) with the protein systematically, we designed the Exon Nomenclature and Classification of Transcripts (ENACT) framework for uniquely annotating exons that tracks their loci in gene architecture context with encapsulating variations in splice site(s) and amino acid coding status. After ENACT annotation, predicted protein features (secondary structure/disorder/Pfam domains) are mapped to exon attributes. Thus, ENACTdb provides trackable exonic variation(s) association to isoform(s) and protein features, enabling the assessment of functional variation due to changes in exon composition. Such analyses can be readily performed through multiple views supported by the server. The exon-centric visualizations of ENACT annotated isoforms could provide insights on the functional repertoire of genes due to alternative splicing and its related processes and can serve as an important resource for the research community.
Availability and implementation: The database is publicly available at https://www.iscbglab.in/enactdb/. It contains protein-coding genes and isoforms for Caenorhabditis elegans, Drosophila melanogaster, Danio rerio, Mus musculus, and Homo sapiens.
{"title":"Exon nomenclature and classification of transcripts database (ENACTdb): a resource for analyzing alternative splicing mediated proteome diversity.","authors":"Paras Verma, Deeksha Thakur, Shashi B Pandit","doi":"10.1093/bioadv/vbae157","DOIUrl":"10.1093/bioadv/vbae157","url":null,"abstract":"<p><strong>Motivation: </strong>Gene transcripts are distinguished by the composition of their exons, and this different exon composition may contribute to advancing proteome complexity. Despite the availability of alternative splicing information documented in various databases, a ready association of exonic variations to the protein sequence remains a mammoth task.</p><p><strong>Results: </strong>To associate exonic variation(s) with the protein systematically, we designed the Exon Nomenclature and Classification of Transcripts (ENACT) framework for uniquely annotating exons that tracks their loci in gene architecture context with encapsulating variations in splice site(s) and amino acid coding status. After ENACT annotation, predicted protein features (secondary structure/disorder/Pfam domains) are mapped to exon attributes. Thus, ENACTdb provides trackable exonic variation(s) association to isoform(s) and protein features, enabling the assessment of functional variation due to changes in exon composition. Such analyses can be readily performed through multiple views supported by the server. The exon-centric visualizations of ENACT annotated isoforms could provide insights on the functional repertoire of genes due to alternative splicing and its related processes and can serve as an important resource for the research community.</p><p><strong>Availability and implementation: </strong>The database is publicly available at https://www.iscbglab.in/enactdb/. It contains protein-coding genes and isoforms for <i>Caenorhabditis elegans</i>, <i>Drosophila melanogaster</i>, <i>Danio rerio</i>, <i>Mus musculus</i>, and <i>Homo sapiens</i>.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae157"},"PeriodicalIF":2.4,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11576355/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142683380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: The human microbiome, comprises complex associations and communication networks among microbial communities, which are crucial for maintaining health. The construction of microbial networks is vital for elucidating these associations. However, existing microbial networks inference methods cannot solve the issues of zero-inflation and non-linear associations. Therefore, necessitating novel methods to improve the accuracy of microbial networks inference.
Results: In this study, we introduce the Microbial Network based on Mutual Information and Markov Random Fields (MicroNet-MIMRF) as a novel approach for inferring microbial networks. Abundance data of microbes are modeled through the zero-inflated Poisson distribution, and the discrete matrix is estimated for further calculation. Markov random fields based on mutual information are used to construct accurate microbial networks. MicroNet-MIMRF excels at estimating pairwise associations between microbes, effectively addressing zero-inflation and non-linear associations in microbial abundance data. It outperforms commonly used techniques in simulation experiments, achieving area under the curve values exceeding 0.75 for all parameters. A case study on inflammatory bowel disease data further demonstrates the method's ability to identify insightful associations. Conclusively, MicroNet-MIMRF is a powerful tool for microbial network inference that handles the biases caused by zero-inflation and overestimation of associations.
Availability and implementation: The MicroNet-MIMRF is provided at https://github.com/Fionabiostats/MicroNet-MIMRF.
{"title":"MicroNet-MIMRF: a microbial network inference approach based on mutual information and Markov random fields.","authors":"Chenqionglu Feng, Huiqun Jia, Hui Wang, Jiaojiao Wang, Mengxuan Lin, Xiaoyan Hu, Chenjing Yu, Hongbin Song, Ligui Wang","doi":"10.1093/bioadv/vbae167","DOIUrl":"https://doi.org/10.1093/bioadv/vbae167","url":null,"abstract":"<p><strong>Motivation: </strong>The human microbiome, comprises complex associations and communication networks among microbial communities, which are crucial for maintaining health. The construction of microbial networks is vital for elucidating these associations. However, existing microbial networks inference methods cannot solve the issues of zero-inflation and non-linear associations. Therefore, necessitating novel methods to improve the accuracy of microbial networks inference.</p><p><strong>Results: </strong>In this study, we introduce the Microbial Network based on Mutual Information and Markov Random Fields (MicroNet-MIMRF) as a novel approach for inferring microbial networks. Abundance data of microbes are modeled through the zero-inflated Poisson distribution, and the discrete matrix is estimated for further calculation. Markov random fields based on mutual information are used to construct accurate microbial networks. MicroNet-MIMRF excels at estimating pairwise associations between microbes, effectively addressing zero-inflation and non-linear associations in microbial abundance data. It outperforms commonly used techniques in simulation experiments, achieving area under the curve values exceeding 0.75 for all parameters. A case study on inflammatory bowel disease data further demonstrates the method's ability to identify insightful associations. Conclusively, MicroNet-MIMRF is a powerful tool for microbial network inference that handles the biases caused by zero-inflation and overestimation of associations.</p><p><strong>Availability and implementation: </strong>The MicroNet-MIMRF is provided at https://github.com/Fionabiostats/MicroNet-MIMRF.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae167"},"PeriodicalIF":2.4,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11549015/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142633755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae142
Ruey Leng Loo, Javier Osorio Mosquera, Michael Zasso, Jacqueline Mathews, Desmond G Johnston, Jeremy K Nicholson, Luc Patiny, Elaine Holmes, Julien Wist
Motivation: Metabolic phenotyping, using high-resolution spectroscopic molecular fingerprints of biological samples, has demonstrated diagnostic, prognostic, and mechanistic value in clinical studies. However, clinical translation is hindered by the lack of viable workflows and challenges in converting spectral data into usable information.
Results: MetaboScope is an analytical and statistical workflow for learning, designing and analyzing clinically relevant 1H nuclear magnetic resonance data. It features modular preprocessing pipelines, multivariate modeling tools including Principal Components Analysis (PCA), Orthogonal-Projection to Latent Structure Discriminant Analysis (OPLS-DA), and biomarker discovery tools (multiblock PCA and statistical spectroscopy). A simulation tool is also provided, allowing users to create synthetic spectra for hypothesis testing and power calculations.
Availability and implementation: MetaboScope is built as a pipeline where each module accepts the output generated by the previous one. This provides flexibility and simplicity of use, while being straightforward to maintain. The system and its libraries were developed in JavaScript and run as a web app; therefore, all the operations are performed on the local computer, circumventing the need to upload data. The MetaboScope tool is available at https://www.cheminfo.org/flavor/metabolomics/index.html. The code is open-source and can be deployed locally if necessary. Module notes, video tutorials, and clinical spectral datasets are provided for modeling.
{"title":"MetaboScope: a statistical toolbox for analyzing <sup>1</sup>H nuclear magnetic resonance spectra from human clinical studies.","authors":"Ruey Leng Loo, Javier Osorio Mosquera, Michael Zasso, Jacqueline Mathews, Desmond G Johnston, Jeremy K Nicholson, Luc Patiny, Elaine Holmes, Julien Wist","doi":"10.1093/bioadv/vbae142","DOIUrl":"10.1093/bioadv/vbae142","url":null,"abstract":"<p><strong>Motivation: </strong>Metabolic phenotyping, using high-resolution spectroscopic molecular fingerprints of biological samples, has demonstrated diagnostic, prognostic, and mechanistic value in clinical studies. However, clinical translation is hindered by the lack of viable workflows and challenges in converting spectral data into usable information.</p><p><strong>Results: </strong>MetaboScope is an analytical and statistical workflow for learning, designing and analyzing clinically relevant <sup>1</sup>H nuclear magnetic resonance data. It features modular preprocessing pipelines, multivariate modeling tools including Principal Components Analysis (PCA), Orthogonal-Projection to Latent Structure Discriminant Analysis (OPLS-DA), and biomarker discovery tools (multiblock PCA and statistical spectroscopy). A simulation tool is also provided, allowing users to create synthetic spectra for hypothesis testing and power calculations.</p><p><strong>Availability and implementation: </strong>MetaboScope is built as a pipeline where each module accepts the output generated by the previous one. This provides flexibility and simplicity of use, while being straightforward to maintain. The system and its libraries were developed in JavaScript and run as a web app; therefore, all the operations are performed on the local computer, circumventing the need to upload data. The MetaboScope tool is available at https://www.cheminfo.org/flavor/metabolomics/index.html. The code is open-source and can be deployed locally if necessary. Module notes, video tutorials, and clinical spectral datasets are provided for modeling.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae142"},"PeriodicalIF":2.4,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11576352/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142683385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}