Bioinformatics advances最新文献_第5页

Efficient genome monomer higher-order structure annotation and identification using the GRMhor algorithm. 基于GRMhor算法的高效基因组单体高阶结构标注与识别。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-28 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae191

Matko Glunčić, Domjan Barić, Vladimir Paar

Motivation: Tandem monomeric units, integral components of eukaryotic genomes, form higher-order repeat (HOR) structures that play crucial roles in maintaining chromosome integrity and regulating gene expression and protein abundance. Given their significant influence on processes such as evolution, chromosome segregation, and disease, developing a sensitive and automated tool for identifying HORs across diverse genomic sequences is essential.

Results: In this study, we applied the GRMhor (Global Repeat Map hor) algorithm to analyse the centromeric region of chromosome 20 in three individual human genomes, as well as in the centromeric regions of three higher primates. In all three human genomes, we identified six distinct HOR arrays, which revealed significantly greater differences in the number of canonical and variant copies, as well as in their overall structure, than would be expected given the 99.9% genetic similarity among humans. Furthermore, our analysis of higher primate genomes, which revealed entirely different HOR sequences, indicates a much larger genomic divergence between humans and higher primates than previously recognized. These results underscore the suitability of the GRMhor algorithm for studying specificities in individual genomes, particularly those involving repetitive monomers in centromere structure, which is essential for proper chromosome segregation during cell division, while also highlighting its utility in exploring centromere evolution and other repetitive genomic regions.

Availability and implementation: Source code and example binaries freely available for download at github.com/gluncic/GRM2023.

动机：串联单体单位是真核生物基因组的组成部分，形成高阶重复序列（HOR）结构，在维持染色体完整性、调节基因表达和蛋白质丰度方面发挥着至关重要的作用。考虑到它们对进化、染色体分离和疾病等过程的重大影响，开发一种敏感和自动化的工具来识别不同基因组序列中的HORs至关重要。结果：在本研究中，我们应用GRMhor （Global Repeat Map hor）算法分析了3个人类个体基因组中20号染色体的着丝粒区域，以及3种高等灵长类动物的着丝粒区域。在所有三个人类基因组中，我们鉴定了六个不同的HOR阵列，这些阵列显示出在规范拷贝和变异拷贝的数量以及它们的整体结构上的显著差异，比人类之间99.9%的遗传相似性所期望的要大得多。此外，我们对高等灵长类动物基因组的分析显示，人类和高等灵长类动物之间的基因组差异比之前认识到的要大得多。这些结果强调了GRMhor算法在研究个体基因组特异性方面的适用性，特别是那些涉及着丝粒结构中重复单体的研究，这对于细胞分裂过程中正确的染色体分离至关重要，同时也强调了它在探索着丝粒进化和其他重复基因组区域方面的实用性。可用性和实现：源代码和示例二进制文件可从github.com/gluncic/GRM2023免费下载。

{"title":"Efficient genome monomer higher-order structure annotation and identification using the GRMhor algorithm.","authors":"Matko Glunčić, Domjan Barić, Vladimir Paar","doi":"10.1093/bioadv/vbae191","DOIUrl":"10.1093/bioadv/vbae191","url":null,"abstract":"Motivation: Tandem monomeric units, integral components of eukaryotic genomes, form higher-order repeat (HOR) structures that play crucial roles in maintaining chromosome integrity and regulating gene expression and protein abundance. Given their significant influence on processes such as evolution, chromosome segregation, and disease, developing a sensitive and automated tool for identifying HORs across diverse genomic sequences is essential.Results: In this study, we applied the GRMhor (Global Repeat Map hor) algorithm to analyse the centromeric region of chromosome 20 in three individual human genomes, as well as in the centromeric regions of three higher primates. In all three human genomes, we identified six distinct HOR arrays, which revealed significantly greater differences in the number of canonical and variant copies, as well as in their overall structure, than would be expected given the 99.9% genetic similarity among humans. Furthermore, our analysis of higher primate genomes, which revealed entirely different HOR sequences, indicates a much larger genomic divergence between humans and higher primates than previously recognized. These results underscore the suitability of the GRMhor algorithm for studying specificities in individual genomes, particularly those involving repetitive monomers in centromere structure, which is essential for proper chromosome segregation during cell division, while also highlighting its utility in exploring centromere evolution and other repetitive genomic regions.Availability and implementation: Source code and example binaries freely available for download at github.com/gluncic/GRM2023.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae191"},"PeriodicalIF":2.4,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11630843/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142808682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Protomix: a Python package for ¹H-NMR metabolomics data preprocessing. Protomix：用于1H-NMR代谢组学数据预处理的Python包。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-27 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbae192

Mohammed Zniber, Youssef Fatihi, Tan-Phat Huynh

Motivation: NMR-based metabolomics is a field driven by technological advancements, necessitating the use of advanced preprocessing tools. Despite this need, there is a remarkable scarcity of comprehensive and user-friendly preprocessing tools in Python. To bridge this gap, we have developed Protomix-a Python package designed for metabolomics research. Protomix offers a set of automated, efficient, and user-friendly signal-preprocessing steps, tailored to streamline and enhance the preprocessing phase in metabolomics studies.

Results: This package presents a comprehensive preprocessing pipeline compatible with various data analysis tools. It encompasses a suite of functionalities for data extraction, preprocessing, and interactive visualization. Additionally, it includes a tutorial in the form of a Python Jupyter notebook, specifically designed for the analysis of 1D ¹H-NMR metabolomics data related to prostate cancer and benign prostatic hyperplasia.

Availability and implementation: Protomix can be accessed at https://github.com/mzniber/protomix and https://protomix.readthedocs.io/en/latest/index.html.

动机：基于核磁共振的代谢组学是一个由技术进步驱动的领域，需要使用先进的预处理工具。尽管有这种需求，但Python中缺乏全面且用户友好的预处理工具。为了弥补这一差距，我们开发了protomix -一个专门用于代谢组学研究的Python包。Protomix提供了一套自动化、高效、用户友好的信号预处理步骤，旨在简化和增强代谢组学研究中的预处理阶段。结果：该软件包提供了一个全面的预处理管道，兼容各种数据分析工具。它包含一套用于数据提取、预处理和交互式可视化的功能。此外，它还包括一个Python Jupyter笔记本形式的教程，专门用于分析与前列腺癌和良性前列腺增生相关的1D 1H-NMR代谢组学数据。可用性和实现：Protomix可以通过https://github.com/mzniber/protomix和https://protomix.readthedocs.io/en/latest/index.html访问。

{"title":"Protomix: a Python package for 1H-NMR metabolomics data preprocessing.","authors":"Mohammed Zniber, Youssef Fatihi, Tan-Phat Huynh","doi":"10.1093/bioadv/vbae192","DOIUrl":"10.1093/bioadv/vbae192","url":null,"abstract":"Motivation: NMR-based metabolomics is a field driven by technological advancements, necessitating the use of advanced preprocessing tools. Despite this need, there is a remarkable scarcity of comprehensive and user-friendly preprocessing tools in Python. To bridge this gap, we have developed Protomix-a Python package designed for metabolomics research. Protomix offers a set of automated, efficient, and user-friendly signal-preprocessing steps, tailored to streamline and enhance the preprocessing phase in metabolomics studies.Results: This package presents a comprehensive preprocessing pipeline compatible with various data analysis tools. It encompasses a suite of functionalities for data extraction, preprocessing, and interactive visualization. Additionally, it includes a tutorial in the form of a Python Jupyter notebook, specifically designed for the analysis of 1D 1H-NMR metabolomics data related to prostate cancer and benign prostatic hyperplasia.Availability and implementation: Protomix can be accessed at https://github.com/mzniber/protomix and https://protomix.readthedocs.io/en/latest/index.html.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae192"},"PeriodicalIF":2.4,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11671038/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142904222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

wQFM-DISCO: DISCO-enabled wQFM improves phylogenomic analyses despite the presence of paralogs. wQFM- disco: DISCO-enabled wQFM改进了系统发育分析，尽管存在类似物。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-27 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae189

Sheikh Azizul Hakim, Md Rownok Zahan Ratul, Md Shamsuzzoha Bayzid

Motivation: Gene trees often differ from the species trees that contain them due to various factors, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL). Several highly accurate species tree estimation methods have been introduced to explicitly address ILS, including ASTRAL, a widely used statistically consistent method, and wQFM, a quartet amalgamation approach experimentally shown to be more accurate than ASTRAL. Two recent advancements, ASTRAL-Pro and DISCO, have emerged in phylogenomics to consider GDL. ASTRAL-Pro introduces a refined quartet similarity measure, accounting for orthology and paralogy. On the other hand, DISCO offers a general strategy to decompose multi-copy gene trees into a collection of single-copy trees, allowing the utilization of methods previously designed for species tree inference in the context of single-copy gene trees.

Results: In this study, we first introduce some variants of DISCO to examine its underlying hypotheses and present analytical results on the statistical guarantees of DISCO. In particular, we introduce DISCO-R, a variant of DISCO with a refined and improved pruning strategy that provides more accurate and robust results. We then demonstrate with extensive evaluation studies on a collection of simulated and real data sets that wQFM paired with DISCO variants consistently matches or outperforms ASTRAL-Pro and other competing methods.

Availability and implementation: DISCO-R and other variants are freely available at https://github.com/skhakim/DISCO-variants.

动机：由于各种因素，包括不完整的谱系分类（ILS）和基因复制和丢失（GDL），基因树往往与包含它们的物种树不同。一些高精度的物种树估计方法已经被引入来明确地解决ILS问题，包括ASTRAL，一种广泛使用的统计一致性方法，以及wQFM，一种实验证明比ASTRAL更准确的四重奏合并方法。最近在系统基因组学中出现了考虑GDL的两项进展，ASTRAL-Pro和DISCO。ASTRAL-Pro引入了一个精致的四重奏相似度测量，考虑到正畸和谬误。另一方面，DISCO提供了一种将多拷贝基因树分解为单拷贝树集合的通用策略，允许在单拷贝基因树的背景下使用先前设计的物种树推断方法。结果：在本研究中，我们首先引入了DISCO的一些变体来检验其潜在的假设，并给出了DISCO的统计保证的分析结果。特别地，我们介绍DISCO- r， DISCO的一个变种，具有改进和改进的修剪策略，提供更准确和稳健的结果。然后，我们通过对模拟和真实数据集的广泛评估研究证明，wQFM与DISCO变体配对始终匹配或优于ASTRAL-Pro和其他竞争方法。可用性和实现：DISCO-R和其他变体可以在https://github.com/skhakim/DISCO-variants上免费获得。

{"title":"wQFM-DISCO: DISCO-enabled wQFM improves phylogenomic analyses despite the presence of paralogs.","authors":"Sheikh Azizul Hakim, Md Rownok Zahan Ratul, Md Shamsuzzoha Bayzid","doi":"10.1093/bioadv/vbae189","DOIUrl":"10.1093/bioadv/vbae189","url":null,"abstract":"Motivation: Gene trees often differ from the species trees that contain them due to various factors, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL). Several highly accurate species tree estimation methods have been introduced to explicitly address ILS, including ASTRAL, a widely used statistically consistent method, and wQFM, a quartet amalgamation approach experimentally shown to be more accurate than ASTRAL. Two recent advancements, ASTRAL-Pro and DISCO, have emerged in phylogenomics to consider GDL. ASTRAL-Pro introduces a refined quartet similarity measure, accounting for orthology and paralogy. On the other hand, DISCO offers a general strategy to decompose multi-copy gene trees into a collection of single-copy trees, allowing the utilization of methods previously designed for species tree inference in the context of single-copy gene trees.Results: In this study, we first introduce some variants of DISCO to examine its underlying hypotheses and present analytical results on the statistical guarantees of DISCO. In particular, we introduce DISCO-R, a variant of DISCO with a refined and improved pruning strategy that provides more accurate and robust results. We then demonstrate with extensive evaluation studies on a collection of simulated and real data sets that wQFM paired with DISCO variants consistently matches or outperforms ASTRAL-Pro and other competing methods.Availability and implementation: DISCO-R and other variants are freely available at https://github.com/skhakim/DISCO-variants.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae189"},"PeriodicalIF":2.4,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11634537/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142815229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Keeping it in the family: using protein family templates to rescue low confidence AlphaFold2 models. 保持它在家族中：使用蛋白质家族模板来拯救低信心的AlphaFold2模型。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-25 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae188

Francesco Costa, Matthias Blum, Alex Bateman

Motivation: High confidence structure prediction models have become available for nearly all protein sequences. More than 200 million AlphaFold2 models are now publicly available. We observe that there can be significant variability in the prediction confidence as judged by plDDT scores across a protein family. We have explored whether the predictions with lower plDDT in a family can be improved by the use of higher plDDT templates from the family as template structures in AlphaFold2.

Results: Our work shows that about one-third of the time structures with a low plDDT can be "rescued," moved from low to reasonable confidence. We also find that surprisingly in many cases we get a higher plDDT model when we switch off the multiple sequence alignment (MSA) option in AlphaFold2 and solely rely on a high-quality template. However, we find the best overall strategy is to make predictions both with and without the MSA information and select the model with the highest average plDDT. We also find that using high plDDT models as templates can increase the speed of AlphaFold2 as implemented in ColabFold. Additionally, we try to demonstrate that as well as having increased overall plDDT, the models are likely to have higher quality structures as judged by two metrics.

Availability and implementation: We have implemented our pipeline in NextFlow and it is available in GitHub: https://github.com/FranceCosta/AF2Fix.

动机：高置信度的结构预测模型已经可以用于几乎所有的蛋白质序列。目前有超过2亿个AlphaFold2模型可供公开使用。我们观察到，通过一个蛋白质家族的plDDT评分来判断，预测置信度可能存在显著的可变性。我们已经探索了是否可以通过使用来自家族的较高plDDT模板作为AlphaFold2中的模板结构来改善家族中较低plDDT的预测。结果：我们的工作表明，大约三分之一的低plDDT的时间结构可以被“拯救”，从低到合理的置信度。我们还发现，在许多情况下，当我们关闭AlphaFold2中的多序列比对（MSA）选项并完全依赖于高质量模板时，我们会得到更高的plDDT模型。然而，我们发现最好的整体策略是在有和没有MSA信息的情况下进行预测，并选择平均plDDT最高的模型。我们还发现，使用高plDDT模型作为模板可以提高在ColabFold中实现的AlphaFold2的速度。此外，我们试图证明，随着总体plDDT的增加，通过两个度量来判断，模型可能具有更高质量的结构。可用性和实现：我们已经在NextFlow中实现了我们的管道，它可以在GitHub中使用：https://github.com/FranceCosta/AF2Fix。

{"title":"Keeping it in the family: using protein family templates to rescue low confidence AlphaFold2 models.","authors":"Francesco Costa, Matthias Blum, Alex Bateman","doi":"10.1093/bioadv/vbae188","DOIUrl":"10.1093/bioadv/vbae188","url":null,"abstract":"Motivation: High confidence structure prediction models have become available for nearly all protein sequences. More than 200 million AlphaFold2 models are now publicly available. We observe that there can be significant variability in the prediction confidence as judged by plDDT scores across a protein family. We have explored whether the predictions with lower plDDT in a family can be improved by the use of higher plDDT templates from the family as template structures in AlphaFold2.Results: Our work shows that about one-third of the time structures with a low plDDT can be \"rescued,\" moved from low to reasonable confidence. We also find that surprisingly in many cases we get a higher plDDT model when we switch off the multiple sequence alignment (MSA) option in AlphaFold2 and solely rely on a high-quality template. However, we find the best overall strategy is to make predictions both with and without the MSA information and select the model with the highest average plDDT. We also find that using high plDDT models as templates can increase the speed of AlphaFold2 as implemented in ColabFold. Additionally, we try to demonstrate that as well as having increased overall plDDT, the models are likely to have higher quality structures as judged by two metrics.Availability and implementation: We have implemented our pipeline in NextFlow and it is available in GitHub: https://github.com/FranceCosta/AF2Fix.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae188"},"PeriodicalIF":2.4,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11630841/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142808689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dissecting AlphaFold2's capabilities with limited sequence information.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-25 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbae187

Jannik Adrian Gut, Thomas Lemmin

Summary: Protein structure prediction aims to infer a protein's three-dimensional (3D) structure from its amino acid sequence. Protein structure is pivotal for elucidating protein functions, interactions, and driving biotechnological innovation. The deep learning model AlphaFold2, has revolutionized this field by leveraging phylogenetic information from multiple sequence alignments (MSAs) to achieve remarkable accuracy in protein structure prediction. However, a key question remains: how well does AlphaFold2 understand protein structures? This study investigates AlphaFold2's capabilities when relying primarily on high-quality template structures, without the additional information provided by MSAs. By designing experiments that probe local and global structural understanding, we aimed to dissect its dependence on specific features and its ability to handle missing information. Our findings revealed AlphaFold2's reliance on sterically valid C $β$ for correctly interpreting structural templates. Additionally, we observed its remarkable ability to recover 3D structures from certain perturbations and the negligible impact of the previous structure in recycling. Collectively, these results support the hypothesis that AlphaFold2 has learned an accurate biophysical energy function. However, this function seems most effective for local interactions. Our work advances understanding of how deep learning models predict protein structures and provides guidance for researchers aiming to overcome limitations in these models.

Availability and implementation: Data and implementation are available at https://github.com/ibmm-unibe-ch/template-analysis.

{"title":"Dissecting AlphaFold2's capabilities with limited sequence information.","authors":"Jannik Adrian Gut, Thomas Lemmin","doi":"10.1093/bioadv/vbae187","DOIUrl":"10.1093/bioadv/vbae187","url":null,"abstract":"Summary: Protein structure prediction aims to infer a protein's three-dimensional (3D) structure from its amino acid sequence. Protein structure is pivotal for elucidating protein functions, interactions, and driving biotechnological innovation. The deep learning model AlphaFold2, has revolutionized this field by leveraging phylogenetic information from multiple sequence alignments (MSAs) to achieve remarkable accuracy in protein structure prediction. However, a key question remains: how well does AlphaFold2 understand protein structures? This study investigates AlphaFold2's capabilities when relying primarily on high-quality template structures, without the additional information provided by MSAs. By designing experiments that probe local and global structural understanding, we aimed to dissect its dependence on specific features and its ability to handle missing information. Our findings revealed AlphaFold2's reliance on sterically valid C <math><mi>β</mi></math> for correctly interpreting structural templates. Additionally, we observed its remarkable ability to recover 3D structures from certain perturbations and the negligible impact of the previous structure in recycling. Collectively, these results support the hypothesis that AlphaFold2 has learned an accurate biophysical energy function. However, this function seems most effective for local interactions. Our work advances understanding of how deep learning models predict protein structures and provides guidance for researchers aiming to overcome limitations in these models.Availability and implementation: Data and implementation are available at https://github.com/ibmm-unibe-ch/template-analysis.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae187"},"PeriodicalIF":2.4,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11751578/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143025999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Predicting the genetic component of gene expression using gene regulatory networks. 利用基因调控网络预测基因表达的遗传成分。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-23 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae180

Gutama Ibrahim Mohammad, Tom Michoel

Motivation: Gene expression prediction plays a vital role in transcriptome-wide association studies. Traditional models rely on genetic variants in close genomic proximity to the gene of interest to predict the genetic component of gene expression. Here, we propose a novel approach incorporating distal genetic variants acting through gene regulatory networks, in line with the omnigenic model of complex traits.

Results: Using causal and coexpression Bayesian networks reconstructed from genomic and transcriptomic data, inference of gene expression from genotypic data is achieved through a two-step process. Initially, the expression level of each gene is predicted using its local genetic variants. The residual differences between the observed and predicted expression levels are then modeled using the genotype information of parent and/or grandparent nodes in the network. The final predicted expression level is obtained by summing the predictions from both models, effectively incorporating both local and distal genetic influences. Using regularized regression techniques for parameter estimation, we found that gene regulatory network-based gene expression prediction outperformed the traditional approach on simulated data and real data from yeast and humans. This study provides important insights into the challenge of gene expression prediction for transcriptome-wide association studies.

Availability and implementation: The code is available on Github at github.com/guutama/GRN-TI.

动机：基因表达预测在全转录组关联研究中起着至关重要的作用。传统的模型依赖于基因组中接近目标基因的遗传变异来预测基因表达的遗传成分。在这里，我们提出了一种新的方法，结合通过基因调控网络作用的远端遗传变异，符合复杂性状的全基因模型。结果：利用从基因组和转录组数据重建的因果和共表达贝叶斯网络，通过两步过程实现了从基因型数据推断基因表达。最初，每个基因的表达水平是通过其局部遗传变异来预测的。然后使用网络中亲代和/或祖父母节点的基因型信息对观察到的和预测的表达水平之间的剩余差异进行建模。最终的预测表达水平是通过将两种模型的预测相加得到的，有效地结合了局部和远端遗传影响。使用正则化回归技术进行参数估计，我们发现基于基因调控网络的基因表达预测在酵母和人类的模拟数据和真实数据上优于传统方法。这项研究为转录组关联研究的基因表达预测挑战提供了重要见解。可用性和实现：代码可在Github上获得github.com/guutama/GRN-TI。

{"title":"Predicting the genetic component of gene expression using gene regulatory networks.","authors":"Gutama Ibrahim Mohammad, Tom Michoel","doi":"10.1093/bioadv/vbae180","DOIUrl":"10.1093/bioadv/vbae180","url":null,"abstract":"Motivation: Gene expression prediction plays a vital role in transcriptome-wide association studies. Traditional models rely on genetic variants in close genomic proximity to the gene of interest to predict the genetic component of gene expression. Here, we propose a novel approach incorporating distal genetic variants acting through gene regulatory networks, in line with the omnigenic model of complex traits.Results: Using causal and coexpression Bayesian networks reconstructed from genomic and transcriptomic data, inference of gene expression from genotypic data is achieved through a two-step process. Initially, the expression level of each gene is predicted using its local genetic variants. The residual differences between the observed and predicted expression levels are then modeled using the genotype information of parent and/or grandparent nodes in the network. The final predicted expression level is obtained by summing the predictions from both models, effectively incorporating both local and distal genetic influences. Using regularized regression techniques for parameter estimation, we found that gene regulatory network-based gene expression prediction outperformed the traditional approach on simulated data and real data from yeast and humans. This study provides important insights into the challenge of gene expression prediction for transcriptome-wide association studies.Availability and implementation: The code is available on Github at github.com/guutama/GRN-TI.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae180"},"PeriodicalIF":2.4,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11665636/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142883453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ICoN: integration using co-attention across biological networks. ICoN：利用生物网络的共同关注进行整合。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-22 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbae182

Nure Tasnina, T M Murali

Motivation: Molecular interaction networks are powerful tools for studying cellular functions. Integrating diverse types of networks enhances performance in downstream tasks such as gene module detection and protein function prediction. The challenge lies in extracting meaningful protein feature representations due to varying levels of sparsity and noise across these heterogeneous networks.

Results: We propose ICoN, a novel unsupervised graph neural network model that takes multiple protein-protein association networks as inputs and generates a feature representation for each protein that integrates the topological information from all the networks. A key contribution of ICoN is exploiting a mechanism called "co-attention" that enables cross-network communication during training. The model also incorporates a denoising training technique, introducing perturbations to each input network and training the model to reconstruct the original network from its corrupted version. Our experimental results demonstrate that ICoN surpasses individual networks across three downstream tasks: gene module detection, gene coannotation prediction, and protein function prediction. Compared to existing unsupervised network integration models, ICoN exhibits superior performance across the majority of downstream tasks and shows enhanced robustness against noise. This work introduces a promising approach for effectively integrating diverse protein-protein association networks, aiming to achieve a biologically meaningful representation of proteins.

Availability and implementation: The ICoN software is available under the GNU Public License v3 at https://github.com/Murali-group/ICoN.

动机：分子相互作用网络是研究细胞功能的有力工具。整合不同类型的网络可以提高下游任务的性能，如基因模块检测和蛋白质功能预测。挑战在于提取有意义的蛋白质特征表示，由于不同程度的稀疏和噪声在这些异质网络。结果：我们提出了一种新的无监督图神经网络模型ICoN，该模型将多个蛋白质-蛋白质关联网络作为输入，并为每个蛋白质生成一个特征表示，该特征表示集成了来自所有网络的拓扑信息。ICoN的一个关键贡献是利用了一种称为“共同注意”的机制，使训练期间的跨网络通信成为可能。该模型还结合了去噪训练技术，向每个输入网络引入扰动，并训练模型从其损坏版本重建原始网络。我们的实验结果表明，ICoN在三个下游任务上优于单个网络：基因模块检测、基因共注释预测和蛋白质功能预测。与现有的无监督网络集成模型相比，ICoN在大多数下游任务中表现出卓越的性能，并显示出增强的抗噪声鲁棒性。这项工作引入了一种有前途的方法来有效地整合各种蛋白质-蛋白质关联网络，旨在实现蛋白质的生物学意义表示。可用性和实现：ICoN软件在GNU公共许可证v3下可在https://github.com/Murali-group/ICoN获得。

{"title":"ICoN: integration using co-attention across biological networks.","authors":"Nure Tasnina, T M Murali","doi":"10.1093/bioadv/vbae182","DOIUrl":"10.1093/bioadv/vbae182","url":null,"abstract":"Motivation: Molecular interaction networks are powerful tools for studying cellular functions. Integrating diverse types of networks enhances performance in downstream tasks such as gene module detection and protein function prediction. The challenge lies in extracting meaningful protein feature representations due to varying levels of sparsity and noise across these heterogeneous networks.Results: We propose ICoN, a novel unsupervised graph neural network model that takes multiple protein-protein association networks as inputs and generates a feature representation for each protein that integrates the topological information from all the networks. A key contribution of ICoN is exploiting a mechanism called \"co-attention\" that enables cross-network communication during training. The model also incorporates a denoising training technique, introducing perturbations to each input network and training the model to reconstruct the original network from its corrupted version. Our experimental results demonstrate that ICoN surpasses individual networks across three downstream tasks: gene module detection, gene coannotation prediction, and protein function prediction. Compared to existing unsupervised network integration models, ICoN exhibits superior performance across the majority of downstream tasks and shows enhanced robustness against noise. This work introduces a promising approach for effectively integrating diverse protein-protein association networks, aiming to achieve a biologically meaningful representation of proteins.Availability and implementation: The ICoN software is available under the GNU Public License v3 at https://github.com/Murali-group/ICoN.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae182"},"PeriodicalIF":2.4,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11723530/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142973601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The combined focal loss and dice loss function improves the segmentation of beta-sheets in medium-resolution cryo-electron-microscopy density maps. 结合焦点损失和骰子损失函数可改善中分辨率冷冻电子显微镜密度图中贝塔片的分割。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-22 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae169

Yongcheng Mu, Thu Nguyen, Bryan Hawickhorst, Willy Wriggers, Jiangwen Sun, Jing He

Summary: Although multiple neural networks have been proposed for detecting secondary structures from medium-resolution (5-10 Å) cryo-electron microscopy (cryo-EM) maps, the loss functions used in the existing deep learning networks are primarily based on cross-entropy loss, which is known to be sensitive to class imbalances. We investigated five loss functions: cross-entropy, Focal loss, Dice loss, and two combined loss functions. Using a U-Net architecture in our DeepSSETracer method and a dataset composed of 1355 box-cropped atomic-structure/density-map pairs, we found that a newly designed loss function that combines Focal loss and Dice loss provides the best overall detection accuracy for secondary structures. For β-sheet voxels, which are generally much harder to detect than helix voxels, the combined loss function achieved a significant improvement (an 8.8% increase in the F₁ score) compared to the cross-entropy loss function and a noticeable improvement from the Dice loss function. This study demonstrates the potential for designing more effective loss functions for hard cases in the segmentation of secondary structures. The newly trained model was incorporated into DeepSSETracer 1.1 for the segmentation of protein secondary structures in medium-resolution cryo-EM map components. DeepSSETracer can be integrated into ChimeraX, a popular molecular visualization software.

Availability and implementation: https://www.cs.odu.edu/∼bioinfo/B2I_Tools/.

摘要：虽然已经提出了多种神经网络来检测中等分辨率（5-10 Å）冷冻电镜（cryo-EM）图中的二级结构，但现有深度学习网络中使用的损失函数主要基于交叉熵损失，而已知交叉熵损失对类不平衡很敏感。我们研究了五种损失函数：交叉熵损失、焦点损失、骰子损失和两种组合损失函数。我们在 DeepSSETracer 方法中使用了 U-Net 架构，并使用了由 1355 个盒式裁剪的原子结构/密度图对组成的数据集，发现新设计的损失函数结合了 Focal 损失和 Dice 损失，为二级结构提供了最佳的整体检测精度。对于通常比螺旋体体素更难检测的 β 片状体素，与交叉熵损失函数相比，组合损失函数取得了显著的改进（F1 分数提高了 8.8%），与 Dice 损失函数相比也有明显的改进。这项研究证明了针对二级结构分割中的困难情况设计更有效损失函数的潜力。新训练的模型被纳入 DeepSSETracer 1.1，用于分割中等分辨率冷冻电子显微镜图成分中的蛋白质二级结构。DeepSSETracer可集成到流行的分子可视化软件ChimeraX中。可用性和实现：https://www.cs.odu.edu/∼bioinfo/B2I_Tools/。

{"title":"The combined focal loss and dice loss function improves the segmentation of beta-sheets in medium-resolution cryo-electron-microscopy density maps.","authors":"Yongcheng Mu, Thu Nguyen, Bryan Hawickhorst, Willy Wriggers, Jiangwen Sun, Jing He","doi":"10.1093/bioadv/vbae169","DOIUrl":"10.1093/bioadv/vbae169","url":null,"abstract":"Summary: Although multiple neural networks have been proposed for detecting secondary structures from medium-resolution (5-10 Å) cryo-electron microscopy (cryo-EM) maps, the loss functions used in the existing deep learning networks are primarily based on cross-entropy loss, which is known to be sensitive to class imbalances. We investigated five loss functions: cross-entropy, Focal loss, Dice loss, and two combined loss functions. Using a U-Net architecture in our DeepSSETracer method and a dataset composed of 1355 box-cropped atomic-structure/density-map pairs, we found that a newly designed loss function that combines Focal loss and Dice loss provides the best overall detection accuracy for secondary structures. For β-sheet voxels, which are generally much harder to detect than helix voxels, the combined loss function achieved a significant improvement (an 8.8% increase in the F1 score) compared to the cross-entropy loss function and a noticeable improvement from the Dice loss function. This study demonstrates the potential for designing more effective loss functions for hard cases in the segmentation of secondary structures. The newly trained model was incorporated into DeepSSETracer 1.1 for the segmentation of protein secondary structures in medium-resolution cryo-EM map components. DeepSSETracer can be integrated into ChimeraX, a popular molecular visualization software.Availability and implementation: https://www.cs.odu.edu/∼bioinfo/B2I_Tools/.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae169"},"PeriodicalIF":2.4,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11590252/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142735054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

blast2galaxy: a CLI and Python API for BLAST+ and DIAMOND searches on Galaxy servers. blast2galaxy：用于在Galaxy服务器上进行BLAST+和DIAMOND搜索的CLI和Python API。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-22 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae185

Patrick König, Anne Fiebig, Thomas Münch, Björn Grüning, Uwe Scholz

Motivation: The Galaxy workflow system is an open-source platform supporting data-intensive research in life sciences, featuring a user-friendly web interface for complex analyses without extensive programming. It also offers a representational state transfer based API, enabling remote execution of specific tools. Galaxy supports similarity searches for nucleotide and amino acid sequences, with integrated tools like NCBI BLAST+ and DIAMOND. However, no specialized software currently exists for convenient use of NCBI BLAST+ and DIAMOND via the Galaxy API.

Results: blast2galaxy is a Python package that uses the Galaxy API to run sequence alignments with NCBI BLAST+ and DIAMOND as Galaxy-wrapped tools on compatible servers. It includes a command-line interface that mirrors the CLI of BLAST+ and DIAMOND and a high-level Python API for direct alignments from Python applications. The package relies on bioblend for communication with the Galaxy API.

Availability and implementation: blast2galaxy is available as open-source software under the MIT license. The source code is available on Github: https://github.com/IPK-BIT/blast2galaxy. It can be installed from the Python Package Index using "pip install blast2galaxy" or from the Bioconda channel using "conda install -c bioconda blast2galaxy". Docker and Apptainer images are available and referenced in the documentation which is available under https://blast2galaxy.readthedocs.io.

动机：Galaxy工作流系统是一个支持生命科学数据密集型研究的开源平台，具有用户友好的网络界面，无需大量编程即可进行复杂分析。它还提供了一个基于表示状态传输的API，支持特定工具的远程执行。Galaxy支持核苷酸和氨基酸序列的相似性搜索，集成了NCBI BLAST+和DIAMOND等工具。然而，目前还没有专门的软件可以通过Galaxy API方便地使用NCBI BLAST+和DIAMOND。结果：blast2galaxy是一个Python包，它使用Galaxy API运行序列比对，NCBI BLAST+和DIAMOND作为兼容服务器上的星系封装工具。它包括一个命令行界面，镜像BLAST+和DIAMOND的CLI，以及一个高级Python API，用于从Python应用程序直接对齐。该软件包依靠bioblend与Galaxy API进行通信。可用性和实现：blast2galaxy是MIT许可下的开源软件。源代码可在Github上获得：https://github.com/IPK-BIT/blast2galaxy。它可以使用“pip install blast2galaxy”从Python包索引中安装，也可以使用“conda install -c Bioconda blast2galaxy”从Bioconda通道安装。Docker和Apptainer映像可以在https://blast2galaxy.readthedocs.io下的文档中获得和引用。

{"title":"blast2galaxy: a CLI and Python API for BLAST+ and DIAMOND searches on Galaxy servers.","authors":"Patrick König, Anne Fiebig, Thomas Münch, Björn Grüning, Uwe Scholz","doi":"10.1093/bioadv/vbae185","DOIUrl":"10.1093/bioadv/vbae185","url":null,"abstract":"Motivation: The Galaxy workflow system is an open-source platform supporting data-intensive research in life sciences, featuring a user-friendly web interface for complex analyses without extensive programming. It also offers a representational state transfer based API, enabling remote execution of specific tools. Galaxy supports similarity searches for nucleotide and amino acid sequences, with integrated tools like NCBI BLAST+ and DIAMOND. However, no specialized software currently exists for convenient use of NCBI BLAST+ and DIAMOND via the Galaxy API.Results: blast2galaxy is a Python package that uses the Galaxy API to run sequence alignments with NCBI BLAST+ and DIAMOND as Galaxy-wrapped tools on compatible servers. It includes a command-line interface that mirrors the CLI of BLAST+ and DIAMOND and a high-level Python API for direct alignments from Python applications. The package relies on bioblend for communication with the Galaxy API.Availability and implementation: blast2galaxy is available as open-source software under the MIT license. The source code is available on Github: https://github.com/IPK-BIT/blast2galaxy. It can be installed from the Python Package Index using \"pip install blast2galaxy\" or from the Bioconda channel using \"conda install -c bioconda blast2galaxy\". Docker and Apptainer images are available and referenced in the documentation which is available under https://blast2galaxy.readthedocs.io.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae185"},"PeriodicalIF":2.4,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11629687/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142808680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LncLSTA: a versatile predictor unveiling subcellular localization of lncRNAs through long-short term attention. LncLSTA：一个多功能的预测器，通过长期、短期的关注揭示lncrna的亚细胞定位。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-22 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbae173

Kai Wang, Yueming Hu, Sida Li, Ming Chen, Zhong Li

Motivation: Much evidence suggests that the subcellular localization of long-stranded noncoding RNAs (LncRNAs) provides key insights for the study of their biological function.

Results: This study proposes a novel deep learning framework, LncLSTA, designed for predicting the subcellular localization of LncRNAs. It firstly exploits LncRNA sequence, electron-ion interaction pseudopotentials, and nucleotide chemical property as feature inputs. Departing from conventional k-mer approaches, this model uses a set of 1D convolutional and maxpooling operations for dynamical feature aggregation. Furthermore, LncLSTA integrates a long-short term attention module with a bidirectional long and short term memory network to comprehensively extract sequence information. In addition, it incorporates a TextCNN module to enhance accuracy and robustness in subcellular localization tasks. Experimental results demonstrate the efficacy of LncLSTA, showcasing its superior performance compared to other state-of-the-art methods. Notably, LncLSTA exhibits the transfer learning capability, extending its utility to predict the subcellular localization prediction of mRNAs, while maintaining consistently satisfactory prediction results. This research contributes valuable insights into understanding the biological functions of LncRNAs through subcellular localization, emphasizing the potential of deep learning approaches in advancing RNA-related studies.

Availability and implementation: The source code is publicly available at https://bis.zju.edu.cn/LncLSTA.

动机：大量证据表明，长链非编码rna （LncRNAs）的亚细胞定位为研究其生物学功能提供了关键见解。结果：本研究提出了一种新的深度学习框架LncLSTA，用于预测lncrna的亚细胞定位。它首先利用LncRNA序列、电子-离子相互作用赝势和核苷酸化学性质作为特征输入。与传统的k-mer方法不同，该模型使用一组1D卷积和maxpooling操作进行动态特征聚合。此外，LncLSTA将长短期注意模块与双向长短期记忆网络相结合，全面提取序列信息。此外，它还结合了TextCNN模块来提高亚细胞定位任务的准确性和鲁棒性。实验结果证明了LncLSTA的有效性，与其他最先进的方法相比，它具有优越的性能。值得注意的是，LncLSTA表现出迁移学习能力，将其应用于预测mrna的亚细胞定位预测，同时保持一致的令人满意的预测结果。这项研究通过亚细胞定位为理解LncRNAs的生物学功能提供了有价值的见解，强调了深度学习方法在推进rna相关研究中的潜力。可用性和实现：源代码可在https://bis.zju.edu.cn/LncLSTA上公开获得。

{"title":"LncLSTA: a versatile predictor unveiling subcellular localization of lncRNAs through long-short term attention.","authors":"Kai Wang, Yueming Hu, Sida Li, Ming Chen, Zhong Li","doi":"10.1093/bioadv/vbae173","DOIUrl":"https://doi.org/10.1093/bioadv/vbae173","url":null,"abstract":"Motivation: Much evidence suggests that the subcellular localization of long-stranded noncoding RNAs (LncRNAs) provides key insights for the study of their biological function.Results: This study proposes a novel deep learning framework, LncLSTA, designed for predicting the subcellular localization of LncRNAs. It firstly exploits LncRNA sequence, electron-ion interaction pseudopotentials, and nucleotide chemical property as feature inputs. Departing from conventional k-mer approaches, this model uses a set of 1D convolutional and maxpooling operations for dynamical feature aggregation. Furthermore, LncLSTA integrates a long-short term attention module with a bidirectional long and short term memory network to comprehensively extract sequence information. In addition, it incorporates a TextCNN module to enhance accuracy and robustness in subcellular localization tasks. Experimental results demonstrate the efficacy of LncLSTA, showcasing its superior performance compared to other state-of-the-art methods. Notably, LncLSTA exhibits the transfer learning capability, extending its utility to predict the subcellular localization prediction of mRNAs, while maintaining consistently satisfactory prediction results. This research contributes valuable insights into understanding the biological functions of LncRNAs through subcellular localization, emphasizing the potential of deep learning approaches in advancing RNA-related studies.Availability and implementation: The source code is publicly available at https://bis.zju.edu.cn/LncLSTA.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae173"},"PeriodicalIF":2.4,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11700581/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142933930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0