Bioinformatics advances最新文献_第4页

Introducing GWAStic: a user-friendly, cross-platform solution for genome-wide association studies and genomic prediction. 介绍 GWAStic：全基因组关联研究和基因组预测的用户友好型跨平台解决方案。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-12 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae177

Stefanie Lück, Uwe Scholz, Dimitar Douchkov

Motivation: Advances in genomics have created an insistent need for accessible tools that simplify complex genetic data analysis, enabling researchers across fields to harness the power of genome-wide association studies and genomic prediction. GWAStic was developed to bridge this gap, providing an intuitive platform that combines artificial intelligence with traditional statistical methods, making sophisticated genomic analysis accessible without requiring deep expertise in statistical software.

Results: We present GWAStic, an intuitive, cross-platform desktop application designed to streamline genome-wide association studies and genomic prediction for biological and medical researchers. With a user-friendly graphical interface, GWAStic integrates machine learning and traditional statistical approaches to support genetic analysis. The application accepts inputs from standard text-based Variant Call Formats and PLINK binary files, generating clear graphical outputs, including Manhattan plots, quantile-quantile plots, and genomic prediction correlation plots to enhance data visualization and analysis.

Availability and implementation: Project page: https://github.com/snowformatics/gwastic_desktop; GWAStic documentation: https://snowformatics.gitbook.io/product-docs; PyPI: https://pypi.org/project/gwastic-desktop/.

动机随着基因组学的发展，人们亟需能够简化复杂基因数据分析的工具，使各领域的研究人员能够利用全基因组关联研究和基因组预测的力量。GWAStic 就是为了弥补这一差距而开发的，它提供了一个将人工智能与传统统计方法相结合的直观平台，使复杂的基因组分析变得易学易用，而无需深厚的统计软件专业知识：我们介绍的 GWAStic 是一款直观、跨平台的桌面应用程序，旨在为生物和医学研究人员简化全基因组关联研究和基因组预测。GWAStic 采用用户友好的图形界面，整合了机器学习和传统统计方法，为遗传分析提供支持。该应用程序接受基于标准文本的变异调用格式和 PLINK 二进制文件的输入，生成清晰的图形输出，包括曼哈顿图、量纲-量纲图和基因组预测相关图，以加强数据的可视化和分析：项目页面：https://github.com/snowformatics/gwastic_desktop；GWAStic 文档：https://snowformatics.gitbook.io/product-docs；PyPI：https://pypi.org/project/gwastic-desktop/。

{"title":"Introducing GWAStic: a user-friendly, cross-platform solution for genome-wide association studies and genomic prediction.","authors":"Stefanie Lück, Uwe Scholz, Dimitar Douchkov","doi":"10.1093/bioadv/vbae177","DOIUrl":"10.1093/bioadv/vbae177","url":null,"abstract":"Motivation: Advances in genomics have created an insistent need for accessible tools that simplify complex genetic data analysis, enabling researchers across fields to harness the power of genome-wide association studies and genomic prediction. GWAStic was developed to bridge this gap, providing an intuitive platform that combines artificial intelligence with traditional statistical methods, making sophisticated genomic analysis accessible without requiring deep expertise in statistical software.Results: We present GWAStic, an intuitive, cross-platform desktop application designed to streamline genome-wide association studies and genomic prediction for biological and medical researchers. With a user-friendly graphical interface, GWAStic integrates machine learning and traditional statistical approaches to support genetic analysis. The application accepts inputs from standard text-based Variant Call Formats and PLINK binary files, generating clear graphical outputs, including Manhattan plots, quantile-quantile plots, and genomic prediction correlation plots to enhance data visualization and analysis.Availability and implementation: Project page: https://github.com/snowformatics/gwastic_desktop; GWAStic documentation: https://snowformatics.gitbook.io/product-docs; PyPI: https://pypi.org/project/gwastic-desktop/.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae177"},"PeriodicalIF":2.4,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11643344/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142831010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LUKB: preparing local UK Biobank data for analysis. LUKB：准备用于分析的英国生物库本地数据。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-09 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae176

Xiangnan Li, Yaqi Huang, Shuming Wang, Meng Hao, Yi Li, Hui Zhang, Zixin Hu

Motivation: The UK Biobank data holds immense potential for human health research. However, the complex data preparation and interpretation processes often act as barriers for researchers, diverting them from their core research questions.

Results: We developed LUKB, an R Shiny-based web tool that simplifies UK Biobank data preparation by automating these preprocessing tasks. LUKB reduces preprocessing time and integrates functions for initial data exploration, allowing researchers to dedicate more time to their scientific endeavors. Detailed deployment and usage can be found in the Supplementary Data.

Availability and implementation: LUKB is freely available at https://github.com/HaiGenBuShang/LUKB.

动机英国生物库数据为人类健康研究提供了巨大潜力。然而，复杂的数据准备和解释过程往往成为研究人员的障碍，使他们偏离核心研究问题：我们开发了基于 R Shiny 的网络工具 LUKB，通过自动完成这些预处理任务来简化英国生物库数据的准备工作。LUKB 减少了预处理时间，并集成了用于初始数据探索的功能，使研究人员能够将更多时间投入到科学研究中。详细的部署和使用方法见补充数据：LUKB 可在 https://github.com/HaiGenBuShang/LUKB 免费获取。

引用次数: 0

Phylogenetic-informed graph deep learning to classify dynamic transmission clusters in infectious disease epidemics. 以系统发育为基础的图深度学习对传染病流行中的动态传播集群进行分类。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-07 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae158

Chaoyue Sun, Yanjun Li, Simone Marini, Alberto Riva, Dapeng Oliver Wu, Ruogu Fang, Marco Salemi, Brittany Rife Magalis

Motivation: In the midst of an outbreak, identification of groups of individuals that represent risk for transmission of the pathogen under investigation is critical to public health efforts. Dynamic transmission patterns within these clusters, whether it be the result of changes at the level of the virus (e.g. infectivity) or host (e.g. vaccination), are critical in strategizing public health interventions, particularly when resources are limited. Phylogenetic trees are widely used not only in the detection of transmission clusters, but the topological shape of the branches within can be useful sources of information regarding the dynamics of the represented population.

Results: We evaluated the limitation of existing tree shape metrics when dealing with dynamic transmission clusters and propose instead a phylogeny-based deep learning system -DeepDynaTree- for dynamic classification. Comprehensive experiments carried out on a variety of simulated epidemic growth models and HIV epidemic data indicate that this graph deep learning approach is effective, robust, and informative for cluster dynamic prediction. Our results confirm that DeepDynaTree is a promising tool for transmission cluster characterization that can be modified to address the existing limitations and deficiencies in knowledge regarding the dynamics of transmission trajectories for groups at risk of pathogen infection.

Availability and implementation: DeepDynaTree is available under an MIT Licence in https://github.com/salemilab/DeepDynaTree.

动机：在疫情爆发期间，确定哪些人群有传播所调查病原体的风险对公共卫生工作至关重要。无论是病毒水平（如传染性）还是宿主水平（如疫苗接种）的变化所导致的这些群组内的动态传播模式，对于制定公共卫生干预战略都至关重要，尤其是在资源有限的情况下。系统发生树不仅被广泛用于检测传播集群，而且其内部分支的拓扑形状也是有关所代表种群动态的有用信息来源：我们评估了现有树形指标在处理动态传播集群时的局限性，并提出了一种基于系统发育的深度学习系统--DeepDynaTree--用于动态分类。在各种模拟流行病增长模型和 HIV 流行病数据上进行的综合实验表明，这种图深度学习方法对于集群动态预测是有效、稳健和有参考价值的。我们的研究结果证实，DeepDynaTree 是一种很有前途的传播集群特征描述工具，它可以进行修改，以解决现有的局限性和病原体感染风险群体传播轨迹动态知识的不足：DeepDynaTree以MIT许可在https://github.com/salemilab/DeepDynaTree。

{"title":"Phylogenetic-informed graph deep learning to classify dynamic transmission clusters in infectious disease epidemics.","authors":"Chaoyue Sun, Yanjun Li, Simone Marini, Alberto Riva, Dapeng Oliver Wu, Ruogu Fang, Marco Salemi, Brittany Rife Magalis","doi":"10.1093/bioadv/vbae158","DOIUrl":"https://doi.org/10.1093/bioadv/vbae158","url":null,"abstract":"Motivation: In the midst of an outbreak, identification of groups of individuals that represent risk for transmission of the pathogen under investigation is critical to public health efforts. Dynamic transmission patterns within these clusters, whether it be the result of changes at the level of the virus (e.g. infectivity) or host (e.g. vaccination), are critical in strategizing public health interventions, particularly when resources are limited. Phylogenetic trees are widely used not only in the detection of transmission clusters, but the topological shape of the branches within can be useful sources of information regarding the dynamics of the represented population.Results: We evaluated the limitation of existing tree shape metrics when dealing with dynamic transmission clusters and propose instead a phylogeny-based deep learning system -DeepDynaTree- for dynamic classification. Comprehensive experiments carried out on a variety of simulated epidemic growth models and HIV epidemic data indicate that this graph deep learning approach is effective, robust, and informative for cluster dynamic prediction. Our results confirm that DeepDynaTree is a promising tool for transmission cluster characterization that can be modified to address the existing limitations and deficiencies in knowledge regarding the dynamics of transmission trajectories for groups at risk of pathogen infection.Availability and implementation: DeepDynaTree is available under an MIT Licence in https://github.com/salemilab/DeepDynaTree.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae158"},"PeriodicalIF":2.4,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11552518/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142633757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MitoMAMMAL: a genome scale model of mammalian mitochondria predicts cardiac and BAT metabolism.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-05 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbae172

Stephen Chapman, Theo Brunet, Arnaud Mourier, Bianca H Habermann

Motivation: Mitochondria are essential for cellular metabolism and are inherently flexible to allow correct function in a wide range of tissues. Consequently, dysregulated mitochondrial metabolism affects different tissues in different ways leading to challenges in understanding the pathology of mitochondrial diseases. System-level metabolic modelling is useful in studying tissue-specific mitochondrial metabolism, yet despite the mouse being a common model organism in research, no mouse specific mitochondrial metabolic model is currently available.

Results: Building upon the similarity between human and mouse mitochondrial metabolism, we present mitoMammal, a genome-scale metabolic model that contains human and mouse specific gene-product reaction rules. MitoMammal is able to model mouse and human mitochondrial metabolism. To demonstrate this, using an adapted E-Flux algorithm, we integrated proteomic data from mitochondria of isolated mouse cardiomyocytes and mouse brown adipocyte tissue, as well as transcriptomic data from in vitro differentiated human brown adipocytes and modelled the context specific metabolism using flux balance analysis. In all three simulations, mitoMammal made mostly accurate, and some novel predictions relating to energy metabolism in the context of cardiomyocytes and brown adipocytes. This demonstrates its usefulness in research in cardiac disease and diabetes in both mouse and human contexts.

Availability and implementation: The MitoMammal Jupyter Notebook is available at: https://gitlab.com/habermann_lab/mitomammal.

{"title":"MitoMAMMAL: a genome scale model of mammalian mitochondria predicts cardiac and BAT metabolism.","authors":"Stephen Chapman, Theo Brunet, Arnaud Mourier, Bianca H Habermann","doi":"10.1093/bioadv/vbae172","DOIUrl":"https://doi.org/10.1093/bioadv/vbae172","url":null,"abstract":"Motivation: Mitochondria are essential for cellular metabolism and are inherently flexible to allow correct function in a wide range of tissues. Consequently, dysregulated mitochondrial metabolism affects different tissues in different ways leading to challenges in understanding the pathology of mitochondrial diseases. System-level metabolic modelling is useful in studying tissue-specific mitochondrial metabolism, yet despite the mouse being a common model organism in research, no mouse specific mitochondrial metabolic model is currently available.Results: Building upon the similarity between human and mouse mitochondrial metabolism, we present mitoMammal, a genome-scale metabolic model that contains human and mouse specific gene-product reaction rules. MitoMammal is able to model mouse and human mitochondrial metabolism. To demonstrate this, using an adapted E-Flux algorithm, we integrated proteomic data from mitochondria of isolated mouse cardiomyocytes and mouse brown adipocyte tissue, as well as transcriptomic data from in vitro differentiated human brown adipocytes and modelled the context specific metabolism using flux balance analysis. In all three simulations, mitoMammal made mostly accurate, and some novel predictions relating to energy metabolism in the context of cardiomyocytes and brown adipocytes. This demonstrates its usefulness in research in cardiac disease and diabetes in both mouse and human contexts.Availability and implementation: The MitoMammal Jupyter Notebook is available at: https://gitlab.com/habermann_lab/mitomammal.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae172"},"PeriodicalIF":2.4,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11696703/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142933933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SpecGMM: Integrating Spectral analysis and Gaussian Mixture Models for taxonomic classification and identification of discriminative DNA regions.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-05 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae171

Saish Jaiswal, Hema A Murthy, Manikandan Narayanan

Motivation: Genomic signal processing (GSP), which transforms biomolecular sequences into discrete signals for spectral analysis, has provided valuable insights into DNA sequence, structure, and evolution. However, challenges persist with spectral representations of variable-length sequences for tasks like species classification and in interpreting these spectra to identify discriminative DNA regions.

Results: We introduce SpecGMM, a novel framework that integrates sliding window-based Spectral analysis with a Gaussian Mixture Model to transform variable-length DNA sequences into fixed-dimensional spectral representations for taxonomic classification. SpecGMM's hyperparameters were selected using a dataset of plant sequences, and applied unchanged across diverse datasets, including mitochondrial DNA, viral and bacterial genome, and 16S rRNA sequences. Across these datasets, SpecGMM outperformed a baseline method, with 9.45% average and 35.55% maximum improvement in test accuracies for a Linear Discriminant classifier. Regarding interpretability, SpecGMM revealed discriminative hypervariable regions in 16S rRNA sequences-particularly V3/V4 for discriminating higher taxa and V2/V3 for lower taxa-corroborating their known classification relevance. SpecGMM's spectrogram video analysis helped visualize species-specific DNA signatures. SpecGMM thus provides a robust and interpretable method for spectral DNA analysis, opening new avenues in GSP research.

Availability and implementation: SpecGMM's source code is available at https://github.com/BIRDSgroup/SpecGMM.

{"title":"SpecGMM: Integrating Spectral analysis and Gaussian Mixture Models for taxonomic classification and identification of discriminative DNA regions.","authors":"Saish Jaiswal, Hema A Murthy, Manikandan Narayanan","doi":"10.1093/bioadv/vbae171","DOIUrl":"10.1093/bioadv/vbae171","url":null,"abstract":"Motivation: Genomic signal processing (GSP), which transforms biomolecular sequences into discrete signals for spectral analysis, has provided valuable insights into DNA sequence, structure, and evolution. However, challenges persist with spectral representations of variable-length sequences for tasks like species classification and in interpreting these spectra to identify discriminative DNA regions.Results: We introduce SpecGMM, a novel framework that integrates sliding window-based Spectral analysis with a Gaussian Mixture Model to transform variable-length DNA sequences into fixed-dimensional spectral representations for taxonomic classification. SpecGMM's hyperparameters were selected using a dataset of plant sequences, and applied unchanged across diverse datasets, including mitochondrial DNA, viral and bacterial genome, and 16S rRNA sequences. Across these datasets, SpecGMM outperformed a baseline method, with 9.45% average and 35.55% maximum improvement in test accuracies for a Linear Discriminant classifier. Regarding interpretability, SpecGMM revealed discriminative hypervariable regions in 16S rRNA sequences-particularly V3/V4 for discriminating higher taxa and V2/V3 for lower taxa-corroborating their known classification relevance. SpecGMM's spectrogram video analysis helped visualize species-specific DNA signatures. SpecGMM thus provides a robust and interpretable method for spectral DNA analysis, opening new avenues in GSP research.Availability and implementation: SpecGMM's source code is available at https://github.com/BIRDSgroup/SpecGMM.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae171"},"PeriodicalIF":2.4,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11631429/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142808697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RecGOBD: accurate recognition of gene ontology related brain development protein functions through multi-feature fusion and attention mechanisms.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-04 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae163

Zhiliang Xia, Shiqiang Ma, Jiawei Li, Yan Guo, Limin Jiang, Jijun Tang

Motivation: Protein function prediction is crucial in bioinformatics, driven by the growth of protein sequence data from high-throughput technologies. Traditional methods are costly and slow, underscoring the need for computational solutions. While deep learning offers powerful tools, many models lack optimization for brain development datasets, critical for neurodevelopmental disorder research. To address this, we developed RecGOBD (Recognition of Gene Ontology-related Brain Development protein function), a model tailored to predict protein functions essential to brain development.

Result: RecGOBD targets 10 key gene ontology (GO) terms for brain development, embedding protein sequences associated with these terms. Leveraging advanced pre-trained models, it captures both sequence and structure data, aligning them with GO terms through attention mechanisms. The category attention layer enhances prediction accuracy. RecGOBD surpassed five benchmark models in AUROC, AUPR, and Fmax metrics and was further used to predict autism-related protein functions and assess mutation impacts on GO terms. These findings highlight RecGOBD's potential in advancing protein function prediction for neurodevelopmental disorders.

Availability and implementation: All Python codes associated with this study are available at https://github.com/ZL-Xia/RECGOBD.git.

动机在高通量技术带来的蛋白质序列数据增长的推动下，蛋白质功能预测在生物信息学中至关重要。传统方法成本高、速度慢，凸显了对计算解决方案的需求。虽然深度学习提供了强大的工具，但许多模型缺乏对大脑发育数据集的优化，而这对神经发育障碍研究至关重要。为了解决这个问题，我们开发了 RecGOBD（基因本体相关脑发育蛋白功能识别），这是一个为预测对脑发育至关重要的蛋白功能而量身定制的模型：RecGOBD 针对大脑发育的 10 个关键基因本体（GO）术语，嵌入了与这些术语相关的蛋白质序列。利用先进的预训练模型，它可以捕捉序列和结构数据，并通过注意机制将它们与 GO 术语对齐。类别关注层提高了预测的准确性。RecGOBD 在 AUROC、AUPR 和 Fmax 指标上超过了五个基准模型，并被进一步用于预测自闭症相关蛋白质的功能和评估突变对 GO 术语的影响。这些发现凸显了 RecGOBD 在推进神经发育障碍蛋白质功能预测方面的潜力：与本研究相关的所有 Python 代码均可在 https://github.com/ZL-Xia/RECGOBD.git 上获取。

{"title":"RecGOBD: accurate recognition of gene ontology related brain development protein functions through multi-feature fusion and attention mechanisms.","authors":"Zhiliang Xia, Shiqiang Ma, Jiawei Li, Yan Guo, Limin Jiang, Jijun Tang","doi":"10.1093/bioadv/vbae163","DOIUrl":"10.1093/bioadv/vbae163","url":null,"abstract":"Motivation: Protein function prediction is crucial in bioinformatics, driven by the growth of protein sequence data from high-throughput technologies. Traditional methods are costly and slow, underscoring the need for computational solutions. While deep learning offers powerful tools, many models lack optimization for brain development datasets, critical for neurodevelopmental disorder research. To address this, we developed RecGOBD (Recognition of Gene Ontology-related Brain Development protein function), a model tailored to predict protein functions essential to brain development.Result: RecGOBD targets 10 key gene ontology (GO) terms for brain development, embedding protein sequences associated with these terms. Leveraging advanced pre-trained models, it captures both sequence and structure data, aligning them with GO terms through attention mechanisms. The category attention layer enhances prediction accuracy. RecGOBD surpassed five benchmark models in AUROC, AUPR, and Fmax metrics and was further used to predict autism-related protein functions and assess mutation impacts on GO terms. These findings highlight RecGOBD's potential in advancing protein function prediction for neurodevelopmental disorders.Availability and implementation: All Python codes associated with this study are available at https://github.com/ZL-Xia/RECGOBD.git.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae163"},"PeriodicalIF":2.4,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11639192/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142831054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AAclust: k-optimized clustering for selecting redundancy-reduced sets of amino acid scales. AAclust：用于选择减少冗余的氨基酸尺度集的 k 优化聚类。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-10-30 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae165

Stephan Breimann, Dmitrij Frishman

Summary: Amino acid scales are crucial for sequence-based protein prediction tasks, yet no gold standard scale set or simple scale selection methods exist. We developed AAclust, a wrapper for clustering models that require a pre-defined number of clusters k, such as k-means. AAclust obtains redundancy-reduced scale sets by clustering and selecting one representative scale per cluster, where k can either be optimized by AAclust or defined by the user. The utility of AAclust scale selections was assessed by applying machine learning models to 24 protein benchmark datasets. We found that top-performing scale sets were different for each benchmark dataset and significantly outperformed scale sets used in previous studies. Noteworthy is the strong dependence of the model performance on the scale set size. AAclust enables a systematic optimization of scale-based feature engineering in machine learning applications.

Availability and implementation: The AAclust algorithm is part of AAanalysis, a Python-based framework for interpretable sequence-based protein prediction, which is documented and accessible at https://aaanalysis.readthedocs.io/en/latest and https://github.com/breimanntools/aaanalysis.

摘要：氨基酸尺度对于基于序列的蛋白质预测任务至关重要，但目前还没有黄金标准尺度集或简单的尺度选择方法。我们开发了 AAclust，它是需要预定义簇数 k 的聚类模型（如 k-means）的包装器。AAclust 通过聚类并为每个聚类选择一个具有代表性的标度，从而获得减少冗余的标度集，其中 k 既可以由 AAclust 优化，也可以由用户定义。通过将机器学习模型应用于 24 个蛋白质基准数据集，对 AAclust 标度选择的实用性进行了评估。我们发现，每个基准数据集的最佳规模集都不尽相同，而且明显优于以往研究中使用的规模集。值得注意的是，模型的性能与标度集的大小密切相关。AAclust 能够系统地优化机器学习应用中基于规模的特征工程：AAclust算法是AAanalysis的一部分，AAanalysis是一个基于Python的框架，用于基于序列的可解释蛋白质预测，其文档和访问地址为https://aaanalysis.readthedocs.io/en/latest 和 https://github.com/breimanntools/aaanalysis。

{"title":"AAclust: k-optimized clustering for selecting redundancy-reduced sets of amino acid scales.","authors":"Stephan Breimann, Dmitrij Frishman","doi":"10.1093/bioadv/vbae165","DOIUrl":"10.1093/bioadv/vbae165","url":null,"abstract":"Summary: Amino acid scales are crucial for sequence-based protein prediction tasks, yet no gold standard scale set or simple scale selection methods exist. We developed AAclust, a wrapper for clustering models that require a pre-defined number of clusters k, such as k-means. AAclust obtains redundancy-reduced scale sets by clustering and selecting one representative scale per cluster, where k can either be optimized by AAclust or defined by the user. The utility of AAclust scale selections was assessed by applying machine learning models to 24 protein benchmark datasets. We found that top-performing scale sets were different for each benchmark dataset and significantly outperformed scale sets used in previous studies. Noteworthy is the strong dependence of the model performance on the scale set size. AAclust enables a systematic optimization of scale-based feature engineering in machine learning applications.Availability and implementation: The AAclust algorithm is part of AAanalysis, a Python-based framework for interpretable sequence-based protein prediction, which is documented and accessible at https://aaanalysis.readthedocs.io/en/latest and https://github.com/breimanntools/aaanalysis.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae165"},"PeriodicalIF":2.4,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562964/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142636210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exon nomenclature and classification of transcripts database (ENACTdb): a resource for analyzing alternative splicing mediated proteome diversity. 外显子命名和转录本分类数据库（ENACTdb）：分析替代剪接介导的蛋白质组多样性的资源。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-10-29 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae157

Paras Verma, Deeksha Thakur, Shashi B Pandit

Motivation: Gene transcripts are distinguished by the composition of their exons, and this different exon composition may contribute to advancing proteome complexity. Despite the availability of alternative splicing information documented in various databases, a ready association of exonic variations to the protein sequence remains a mammoth task.

Results: To associate exonic variation(s) with the protein systematically, we designed the Exon Nomenclature and Classification of Transcripts (ENACT) framework for uniquely annotating exons that tracks their loci in gene architecture context with encapsulating variations in splice site(s) and amino acid coding status. After ENACT annotation, predicted protein features (secondary structure/disorder/Pfam domains) are mapped to exon attributes. Thus, ENACTdb provides trackable exonic variation(s) association to isoform(s) and protein features, enabling the assessment of functional variation due to changes in exon composition. Such analyses can be readily performed through multiple views supported by the server. The exon-centric visualizations of ENACT annotated isoforms could provide insights on the functional repertoire of genes due to alternative splicing and its related processes and can serve as an important resource for the research community.

Availability and implementation: The database is publicly available at https://www.iscbglab.in/enactdb/. It contains protein-coding genes and isoforms for Caenorhabditis elegans, Drosophila melanogaster, Danio rerio, Mus musculus, and Homo sapiens.

动机基因转录本是通过其外显子的组成来区分的，而这种不同的外显子组成可能有助于提高蛋白质组的复杂性。尽管各种数据库都记录了替代剪接信息，但要将外显子变异与蛋白质序列联系起来仍是一项艰巨的任务：为了系统地将外显子变异与蛋白质联系起来，我们设计了外显子命名和转录本分类（ENACT）框架，用于唯一注释外显子，跟踪其在基因结构中的位置，包括剪接位点和氨基酸编码状态的变异。在 ENACT 注释之后，预测的蛋白质特征（二级结构/紊乱/Pfam 结构域）会映射到外显子属性。因此，ENACTdb 提供了可追踪的外显子变异与同工酶和蛋白质特征的关联，从而可以评估外显子组成变化引起的功能变异。此类分析可通过服务器支持的多种视图轻松完成。以外显子为中心的ENACT注释异构体可深入了解基因因替代剪接及其相关过程而产生的功能，并可作为研究界的重要资源：该数据库可通过 https://www.iscbglab.in/enactdb/ 公开获取。该数据库包含秀丽隐杆线虫（Caenorhabditis elegans）、黑腹果蝇（Drosophila melanogaster）、红腹锦鸡（Danio rerio）、麝香猫（Mus musculus）和智人（Homo sapiens）的蛋白质编码基因和同工酶。

{"title":"Exon nomenclature and classification of transcripts database (ENACTdb): a resource for analyzing alternative splicing mediated proteome diversity.","authors":"Paras Verma, Deeksha Thakur, Shashi B Pandit","doi":"10.1093/bioadv/vbae157","DOIUrl":"10.1093/bioadv/vbae157","url":null,"abstract":"Motivation: Gene transcripts are distinguished by the composition of their exons, and this different exon composition may contribute to advancing proteome complexity. Despite the availability of alternative splicing information documented in various databases, a ready association of exonic variations to the protein sequence remains a mammoth task.Results: To associate exonic variation(s) with the protein systematically, we designed the Exon Nomenclature and Classification of Transcripts (ENACT) framework for uniquely annotating exons that tracks their loci in gene architecture context with encapsulating variations in splice site(s) and amino acid coding status. After ENACT annotation, predicted protein features (secondary structure/disorder/Pfam domains) are mapped to exon attributes. Thus, ENACTdb provides trackable exonic variation(s) association to isoform(s) and protein features, enabling the assessment of functional variation due to changes in exon composition. Such analyses can be readily performed through multiple views supported by the server. The exon-centric visualizations of ENACT annotated isoforms could provide insights on the functional repertoire of genes due to alternative splicing and its related processes and can serve as an important resource for the research community.Availability and implementation: The database is publicly available at https://www.iscbglab.in/enactdb/. It contains protein-coding genes and isoforms for Caenorhabditis elegans, Drosophila melanogaster, Danio rerio, Mus musculus, and Homo sapiens.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae157"},"PeriodicalIF":2.4,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11576355/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142683380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MicroNet-MIMRF: a microbial network inference approach based on mutual information and Markov random fields. MicroNet-MIMRF：基于互信息和马尔可夫随机场的微生物网络推断方法。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-10-28 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae167

Chenqionglu Feng, Huiqun Jia, Hui Wang, Jiaojiao Wang, Mengxuan Lin, Xiaoyan Hu, Chenjing Yu, Hongbin Song, Ligui Wang

Motivation: The human microbiome, comprises complex associations and communication networks among microbial communities, which are crucial for maintaining health. The construction of microbial networks is vital for elucidating these associations. However, existing microbial networks inference methods cannot solve the issues of zero-inflation and non-linear associations. Therefore, necessitating novel methods to improve the accuracy of microbial networks inference.

Results: In this study, we introduce the Microbial Network based on Mutual Information and Markov Random Fields (MicroNet-MIMRF) as a novel approach for inferring microbial networks. Abundance data of microbes are modeled through the zero-inflated Poisson distribution, and the discrete matrix is estimated for further calculation. Markov random fields based on mutual information are used to construct accurate microbial networks. MicroNet-MIMRF excels at estimating pairwise associations between microbes, effectively addressing zero-inflation and non-linear associations in microbial abundance data. It outperforms commonly used techniques in simulation experiments, achieving area under the curve values exceeding 0.75 for all parameters. A case study on inflammatory bowel disease data further demonstrates the method's ability to identify insightful associations. Conclusively, MicroNet-MIMRF is a powerful tool for microbial network inference that handles the biases caused by zero-inflation and overestimation of associations.

Availability and implementation: The MicroNet-MIMRF is provided at https://github.com/Fionabiostats/MicroNet-MIMRF.

动机人类微生物组包括微生物群落之间复杂的关联和交流网络，这对维持健康至关重要。构建微生物网络对阐明这些关联至关重要。然而，现有的微生物网络推断方法无法解决零膨胀和非线性关联问题。因此，有必要采用新方法来提高微生物网络推断的准确性：在这项研究中，我们引入了基于互信息和马尔可夫随机场的微生物网络（MicroNet-MIMRF），作为推断微生物网络的一种新方法。微生物的丰度数据通过零膨胀泊松分布建模，并估计离散矩阵以进一步计算。基于互信息的马尔可夫随机场用于构建精确的微生物网络。MicroNet-MIMRF 擅长估计微生物之间的成对关联，能有效解决微生物丰度数据中的零膨胀和非线性关联问题。它在模拟实验中的表现优于常用技术，所有参数的曲线下面积值都超过了 0.75。一项关于炎症性肠病数据的案例研究进一步证明了该方法有能力识别有洞察力的关联。总之，MicroNet-MIMRF 是微生物网络推断的强大工具，可以处理零膨胀和高估关联所造成的偏差：MicroNet-MIMRF 在 https://github.com/Fionabiostats/MicroNet-MIMRF 上提供。

{"title":"MicroNet-MIMRF: a microbial network inference approach based on mutual information and Markov random fields.","authors":"Chenqionglu Feng, Huiqun Jia, Hui Wang, Jiaojiao Wang, Mengxuan Lin, Xiaoyan Hu, Chenjing Yu, Hongbin Song, Ligui Wang","doi":"10.1093/bioadv/vbae167","DOIUrl":"https://doi.org/10.1093/bioadv/vbae167","url":null,"abstract":"Motivation: The human microbiome, comprises complex associations and communication networks among microbial communities, which are crucial for maintaining health. The construction of microbial networks is vital for elucidating these associations. However, existing microbial networks inference methods cannot solve the issues of zero-inflation and non-linear associations. Therefore, necessitating novel methods to improve the accuracy of microbial networks inference.Results: In this study, we introduce the Microbial Network based on Mutual Information and Markov Random Fields (MicroNet-MIMRF) as a novel approach for inferring microbial networks. Abundance data of microbes are modeled through the zero-inflated Poisson distribution, and the discrete matrix is estimated for further calculation. Markov random fields based on mutual information are used to construct accurate microbial networks. MicroNet-MIMRF excels at estimating pairwise associations between microbes, effectively addressing zero-inflation and non-linear associations in microbial abundance data. It outperforms commonly used techniques in simulation experiments, achieving area under the curve values exceeding 0.75 for all parameters. A case study on inflammatory bowel disease data further demonstrates the method's ability to identify insightful associations. Conclusively, MicroNet-MIMRF is a powerful tool for microbial network inference that handles the biases caused by zero-inflation and overestimation of associations.Availability and implementation: The MicroNet-MIMRF is provided at https://github.com/Fionabiostats/MicroNet-MIMRF.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae167"},"PeriodicalIF":2.4,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11549015/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142633755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MetaboScope: a statistical toolbox for analyzing ¹H nuclear magnetic resonance spectra from human clinical studies. MetaboScope：用于分析人体临床研究 1H 核磁共振谱的统计工具箱。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-10-28 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae142

Ruey Leng Loo, Javier Osorio Mosquera, Michael Zasso, Jacqueline Mathews, Desmond G Johnston, Jeremy K Nicholson, Luc Patiny, Elaine Holmes, Julien Wist

Motivation: Metabolic phenotyping, using high-resolution spectroscopic molecular fingerprints of biological samples, has demonstrated diagnostic, prognostic, and mechanistic value in clinical studies. However, clinical translation is hindered by the lack of viable workflows and challenges in converting spectral data into usable information.

Results: MetaboScope is an analytical and statistical workflow for learning, designing and analyzing clinically relevant ¹H nuclear magnetic resonance data. It features modular preprocessing pipelines, multivariate modeling tools including Principal Components Analysis (PCA), Orthogonal-Projection to Latent Structure Discriminant Analysis (OPLS-DA), and biomarker discovery tools (multiblock PCA and statistical spectroscopy). A simulation tool is also provided, allowing users to create synthetic spectra for hypothesis testing and power calculations.

Availability and implementation: MetaboScope is built as a pipeline where each module accepts the output generated by the previous one. This provides flexibility and simplicity of use, while being straightforward to maintain. The system and its libraries were developed in JavaScript and run as a web app; therefore, all the operations are performed on the local computer, circumventing the need to upload data. The MetaboScope tool is available at https://www.cheminfo.org/flavor/metabolomics/index.html. The code is open-source and can be deployed locally if necessary. Module notes, video tutorials, and clinical spectral datasets are provided for modeling.

动机利用生物样本的高分辨率光谱分子指纹进行代谢表型分析，已在临床研究中显示出诊断、预后和机理价值。然而，由于缺乏可行的工作流程，以及将光谱数据转化为可用信息方面的挑战，临床转化受到了阻碍：MetaboScope 是一种分析和统计工作流程，用于学习、设计和分析临床相关的 1H 核磁共振数据。它具有模块化预处理管道、多元建模工具（包括主成分分析（PCA）、正交投影潜结构判别分析（OPLS-DA））和生物标记发现工具（多区块 PCA 和统计光谱学）。此外还提供了一个模拟工具，允许用户创建用于假设检验和功率计算的合成光谱：MetaboScope 以流水线的形式构建，每个模块都接受前一个模块生成的输出。这不仅提供了使用的灵活性和简便性，而且易于维护。该系统及其库使用 JavaScript 开发，以网络应用程序的形式运行；因此，所有操作都在本地计算机上执行，无需上传数据。MetaboScope 工具可在 https://www.cheminfo.org/flavor/metabolomics/index.html 上获取。代码是开源的，必要时可在本地部署。建模时会提供模块说明、视频教程和临床光谱数据集。

{"title":"MetaboScope: a statistical toolbox for analyzing 1H nuclear magnetic resonance spectra from human clinical studies.","authors":"Ruey Leng Loo, Javier Osorio Mosquera, Michael Zasso, Jacqueline Mathews, Desmond G Johnston, Jeremy K Nicholson, Luc Patiny, Elaine Holmes, Julien Wist","doi":"10.1093/bioadv/vbae142","DOIUrl":"10.1093/bioadv/vbae142","url":null,"abstract":"Motivation: Metabolic phenotyping, using high-resolution spectroscopic molecular fingerprints of biological samples, has demonstrated diagnostic, prognostic, and mechanistic value in clinical studies. However, clinical translation is hindered by the lack of viable workflows and challenges in converting spectral data into usable information.Results: MetaboScope is an analytical and statistical workflow for learning, designing and analyzing clinically relevant 1H nuclear magnetic resonance data. It features modular preprocessing pipelines, multivariate modeling tools including Principal Components Analysis (PCA), Orthogonal-Projection to Latent Structure Discriminant Analysis (OPLS-DA), and biomarker discovery tools (multiblock PCA and statistical spectroscopy). A simulation tool is also provided, allowing users to create synthetic spectra for hypothesis testing and power calculations.Availability and implementation: MetaboScope is built as a pipeline where each module accepts the output generated by the previous one. This provides flexibility and simplicity of use, while being straightforward to maintain. The system and its libraries were developed in JavaScript and run as a web app; therefore, all the operations are performed on the local computer, circumventing the need to upload data. The MetaboScope tool is available at https://www.cheminfo.org/flavor/metabolomics/index.html. The code is open-source and can be deployed locally if necessary. Module notes, video tutorials, and clinical spectral datasets are provided for modeling.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae142"},"PeriodicalIF":2.4,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11576352/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142683385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0