Pub Date : 2024-11-28eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae191
Matko Glunčić, Domjan Barić, Vladimir Paar
Motivation: Tandem monomeric units, integral components of eukaryotic genomes, form higher-order repeat (HOR) structures that play crucial roles in maintaining chromosome integrity and regulating gene expression and protein abundance. Given their significant influence on processes such as evolution, chromosome segregation, and disease, developing a sensitive and automated tool for identifying HORs across diverse genomic sequences is essential.
Results: In this study, we applied the GRMhor (Global Repeat Map hor) algorithm to analyse the centromeric region of chromosome 20 in three individual human genomes, as well as in the centromeric regions of three higher primates. In all three human genomes, we identified six distinct HOR arrays, which revealed significantly greater differences in the number of canonical and variant copies, as well as in their overall structure, than would be expected given the 99.9% genetic similarity among humans. Furthermore, our analysis of higher primate genomes, which revealed entirely different HOR sequences, indicates a much larger genomic divergence between humans and higher primates than previously recognized. These results underscore the suitability of the GRMhor algorithm for studying specificities in individual genomes, particularly those involving repetitive monomers in centromere structure, which is essential for proper chromosome segregation during cell division, while also highlighting its utility in exploring centromere evolution and other repetitive genomic regions.
Availability and implementation: Source code and example binaries freely available for download at github.com/gluncic/GRM2023.
{"title":"Efficient genome monomer higher-order structure annotation and identification using the GRMhor algorithm.","authors":"Matko Glunčić, Domjan Barić, Vladimir Paar","doi":"10.1093/bioadv/vbae191","DOIUrl":"10.1093/bioadv/vbae191","url":null,"abstract":"<p><strong>Motivation: </strong>Tandem monomeric units, integral components of eukaryotic genomes, form higher-order repeat (HOR) structures that play crucial roles in maintaining chromosome integrity and regulating gene expression and protein abundance. Given their significant influence on processes such as evolution, chromosome segregation, and disease, developing a sensitive and automated tool for identifying HORs across diverse genomic sequences is essential.</p><p><strong>Results: </strong>In this study, we applied the GRMhor (Global Repeat Map hor) algorithm to analyse the centromeric region of chromosome 20 in three individual human genomes, as well as in the centromeric regions of three higher primates. In all three human genomes, we identified six distinct HOR arrays, which revealed significantly greater differences in the number of canonical and variant copies, as well as in their overall structure, than would be expected given the 99.9% genetic similarity among humans. Furthermore, our analysis of higher primate genomes, which revealed entirely different HOR sequences, indicates a much larger genomic divergence between humans and higher primates than previously recognized. These results underscore the suitability of the GRMhor algorithm for studying specificities in individual genomes, particularly those involving repetitive monomers in centromere structure, which is essential for proper chromosome segregation during cell division, while also highlighting its utility in exploring centromere evolution and other repetitive genomic regions.</p><p><strong>Availability and implementation: </strong>Source code and example binaries freely available for download at github.com/gluncic/GRM2023.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae191"},"PeriodicalIF":2.4,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11630843/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142808682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-27eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbae192
Mohammed Zniber, Youssef Fatihi, Tan-Phat Huynh
Motivation: NMR-based metabolomics is a field driven by technological advancements, necessitating the use of advanced preprocessing tools. Despite this need, there is a remarkable scarcity of comprehensive and user-friendly preprocessing tools in Python. To bridge this gap, we have developed Protomix-a Python package designed for metabolomics research. Protomix offers a set of automated, efficient, and user-friendly signal-preprocessing steps, tailored to streamline and enhance the preprocessing phase in metabolomics studies.
Results: This package presents a comprehensive preprocessing pipeline compatible with various data analysis tools. It encompasses a suite of functionalities for data extraction, preprocessing, and interactive visualization. Additionally, it includes a tutorial in the form of a Python Jupyter notebook, specifically designed for the analysis of 1D 1H-NMR metabolomics data related to prostate cancer and benign prostatic hyperplasia.
Availability and implementation: Protomix can be accessed at https://github.com/mzniber/protomix and https://protomix.readthedocs.io/en/latest/index.html.
{"title":"Protomix: a Python package for <sup>1</sup>H-NMR metabolomics data preprocessing.","authors":"Mohammed Zniber, Youssef Fatihi, Tan-Phat Huynh","doi":"10.1093/bioadv/vbae192","DOIUrl":"10.1093/bioadv/vbae192","url":null,"abstract":"<p><strong>Motivation: </strong>NMR-based metabolomics is a field driven by technological advancements, necessitating the use of advanced preprocessing tools. Despite this need, there is a remarkable scarcity of comprehensive and user-friendly preprocessing tools in Python. To bridge this gap, we have developed Protomix-a Python package designed for metabolomics research. Protomix offers a set of automated, efficient, and user-friendly signal-preprocessing steps, tailored to streamline and enhance the preprocessing phase in metabolomics studies.</p><p><strong>Results: </strong>This package presents a comprehensive preprocessing pipeline compatible with various data analysis tools. It encompasses a suite of functionalities for data extraction, preprocessing, and interactive visualization. Additionally, it includes a tutorial in the form of a Python Jupyter notebook, specifically designed for the analysis of 1D <sup>1</sup>H-NMR metabolomics data related to prostate cancer and benign prostatic hyperplasia.</p><p><strong>Availability and implementation: </strong>Protomix can be accessed at https://github.com/mzniber/protomix and https://protomix.readthedocs.io/en/latest/index.html.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae192"},"PeriodicalIF":2.4,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11671038/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142904222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Gene trees often differ from the species trees that contain them due to various factors, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL). Several highly accurate species tree estimation methods have been introduced to explicitly address ILS, including ASTRAL, a widely used statistically consistent method, and wQFM, a quartet amalgamation approach experimentally shown to be more accurate than ASTRAL. Two recent advancements, ASTRAL-Pro and DISCO, have emerged in phylogenomics to consider GDL. ASTRAL-Pro introduces a refined quartet similarity measure, accounting for orthology and paralogy. On the other hand, DISCO offers a general strategy to decompose multi-copy gene trees into a collection of single-copy trees, allowing the utilization of methods previously designed for species tree inference in the context of single-copy gene trees.
Results: In this study, we first introduce some variants of DISCO to examine its underlying hypotheses and present analytical results on the statistical guarantees of DISCO. In particular, we introduce DISCO-R, a variant of DISCO with a refined and improved pruning strategy that provides more accurate and robust results. We then demonstrate with extensive evaluation studies on a collection of simulated and real data sets that wQFM paired with DISCO variants consistently matches or outperforms ASTRAL-Pro and other competing methods.
Availability and implementation: DISCO-R and other variants are freely available at https://github.com/skhakim/DISCO-variants.
{"title":"wQFM-DISCO: DISCO-enabled wQFM improves phylogenomic analyses despite the presence of paralogs.","authors":"Sheikh Azizul Hakim, Md Rownok Zahan Ratul, Md Shamsuzzoha Bayzid","doi":"10.1093/bioadv/vbae189","DOIUrl":"10.1093/bioadv/vbae189","url":null,"abstract":"<p><strong>Motivation: </strong>Gene trees often differ from the species trees that contain them due to various factors, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL). Several highly accurate species tree estimation methods have been introduced to explicitly address ILS, including ASTRAL, a widely used statistically consistent method, and wQFM, a quartet amalgamation approach experimentally shown to be more accurate than ASTRAL. Two recent advancements, ASTRAL-Pro and DISCO, have emerged in phylogenomics to consider GDL. ASTRAL-Pro introduces a refined quartet similarity measure, accounting for orthology and paralogy. On the other hand, DISCO offers a general strategy to decompose multi-copy gene trees into a collection of single-copy trees, allowing the utilization of methods previously designed for species tree inference in the context of single-copy gene trees.</p><p><strong>Results: </strong>In this study, we first introduce some variants of DISCO to examine its underlying hypotheses and present analytical results on the statistical guarantees of DISCO. In particular, we introduce DISCO-R, a variant of DISCO with a refined and improved pruning strategy that provides more accurate and robust results. We then demonstrate with extensive evaluation studies on a collection of simulated and real data sets that wQFM paired with DISCO variants consistently matches or outperforms ASTRAL-Pro and other competing methods.</p><p><strong>Availability and implementation: </strong>DISCO-R and other variants are freely available at https://github.com/skhakim/DISCO-variants.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae189"},"PeriodicalIF":2.4,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11634537/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142815229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-25eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae188
Francesco Costa, Matthias Blum, Alex Bateman
Motivation: High confidence structure prediction models have become available for nearly all protein sequences. More than 200 million AlphaFold2 models are now publicly available. We observe that there can be significant variability in the prediction confidence as judged by plDDT scores across a protein family. We have explored whether the predictions with lower plDDT in a family can be improved by the use of higher plDDT templates from the family as template structures in AlphaFold2.
Results: Our work shows that about one-third of the time structures with a low plDDT can be "rescued," moved from low to reasonable confidence. We also find that surprisingly in many cases we get a higher plDDT model when we switch off the multiple sequence alignment (MSA) option in AlphaFold2 and solely rely on a high-quality template. However, we find the best overall strategy is to make predictions both with and without the MSA information and select the model with the highest average plDDT. We also find that using high plDDT models as templates can increase the speed of AlphaFold2 as implemented in ColabFold. Additionally, we try to demonstrate that as well as having increased overall plDDT, the models are likely to have higher quality structures as judged by two metrics.
Availability and implementation: We have implemented our pipeline in NextFlow and it is available in GitHub: https://github.com/FranceCosta/AF2Fix.
{"title":"Keeping it in the family: using protein family templates to rescue low confidence AlphaFold2 models.","authors":"Francesco Costa, Matthias Blum, Alex Bateman","doi":"10.1093/bioadv/vbae188","DOIUrl":"10.1093/bioadv/vbae188","url":null,"abstract":"<p><strong>Motivation: </strong>High confidence structure prediction models have become available for nearly all protein sequences. More than 200 million AlphaFold2 models are now publicly available. We observe that there can be significant variability in the prediction confidence as judged by plDDT scores across a protein family. We have explored whether the predictions with lower plDDT in a family can be improved by the use of higher plDDT templates from the family as template structures in AlphaFold2.</p><p><strong>Results: </strong>Our work shows that about one-third of the time structures with a low plDDT can be \"rescued,\" moved from low to reasonable confidence. We also find that surprisingly in many cases we get a higher plDDT model when we switch off the multiple sequence alignment (MSA) option in AlphaFold2 and solely rely on a high-quality template. However, we find the best overall strategy is to make predictions both with and without the MSA information and select the model with the highest average plDDT. We also find that using high plDDT models as templates can increase the speed of AlphaFold2 as implemented in ColabFold. Additionally, we try to demonstrate that as well as having increased overall plDDT, the models are likely to have higher quality structures as judged by two metrics.</p><p><strong>Availability and implementation: </strong>We have implemented our pipeline in NextFlow and it is available in GitHub: https://github.com/FranceCosta/AF2Fix.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae188"},"PeriodicalIF":2.4,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11630841/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142808689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-25eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbae187
Jannik Adrian Gut, Thomas Lemmin
Summary: Protein structure prediction aims to infer a protein's three-dimensional (3D) structure from its amino acid sequence. Protein structure is pivotal for elucidating protein functions, interactions, and driving biotechnological innovation. The deep learning model AlphaFold2, has revolutionized this field by leveraging phylogenetic information from multiple sequence alignments (MSAs) to achieve remarkable accuracy in protein structure prediction. However, a key question remains: how well does AlphaFold2 understand protein structures? This study investigates AlphaFold2's capabilities when relying primarily on high-quality template structures, without the additional information provided by MSAs. By designing experiments that probe local and global structural understanding, we aimed to dissect its dependence on specific features and its ability to handle missing information. Our findings revealed AlphaFold2's reliance on sterically valid C for correctly interpreting structural templates. Additionally, we observed its remarkable ability to recover 3D structures from certain perturbations and the negligible impact of the previous structure in recycling. Collectively, these results support the hypothesis that AlphaFold2 has learned an accurate biophysical energy function. However, this function seems most effective for local interactions. Our work advances understanding of how deep learning models predict protein structures and provides guidance for researchers aiming to overcome limitations in these models.
Availability and implementation: Data and implementation are available at https://github.com/ibmm-unibe-ch/template-analysis.
{"title":"Dissecting AlphaFold2's capabilities with limited sequence information.","authors":"Jannik Adrian Gut, Thomas Lemmin","doi":"10.1093/bioadv/vbae187","DOIUrl":"10.1093/bioadv/vbae187","url":null,"abstract":"<p><strong>Summary: </strong>Protein structure prediction aims to infer a protein's three-dimensional (3D) structure from its amino acid sequence. Protein structure is pivotal for elucidating protein functions, interactions, and driving biotechnological innovation. The deep learning model AlphaFold2, has revolutionized this field by leveraging phylogenetic information from multiple sequence alignments (MSAs) to achieve remarkable accuracy in protein structure prediction. However, a key question remains: how well does AlphaFold2 understand protein structures? This study investigates AlphaFold2's capabilities when relying primarily on high-quality template structures, without the additional information provided by MSAs. By designing experiments that probe local and global structural understanding, we aimed to dissect its dependence on specific features and its ability to handle missing information. Our findings revealed AlphaFold2's reliance on sterically valid C <math><mi>β</mi></math> for correctly interpreting structural templates. Additionally, we observed its remarkable ability to recover 3D structures from certain perturbations and the negligible impact of the previous structure in recycling. Collectively, these results support the hypothesis that AlphaFold2 has learned an accurate biophysical energy function. However, this function seems most effective for local interactions. Our work advances understanding of how deep learning models predict protein structures and provides guidance for researchers aiming to overcome limitations in these models.</p><p><strong>Availability and implementation: </strong>Data and implementation are available at https://github.com/ibmm-unibe-ch/template-analysis.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae187"},"PeriodicalIF":2.4,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11751578/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143025999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-23eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae180
Gutama Ibrahim Mohammad, Tom Michoel
Motivation: Gene expression prediction plays a vital role in transcriptome-wide association studies. Traditional models rely on genetic variants in close genomic proximity to the gene of interest to predict the genetic component of gene expression. Here, we propose a novel approach incorporating distal genetic variants acting through gene regulatory networks, in line with the omnigenic model of complex traits.
Results: Using causal and coexpression Bayesian networks reconstructed from genomic and transcriptomic data, inference of gene expression from genotypic data is achieved through a two-step process. Initially, the expression level of each gene is predicted using its local genetic variants. The residual differences between the observed and predicted expression levels are then modeled using the genotype information of parent and/or grandparent nodes in the network. The final predicted expression level is obtained by summing the predictions from both models, effectively incorporating both local and distal genetic influences. Using regularized regression techniques for parameter estimation, we found that gene regulatory network-based gene expression prediction outperformed the traditional approach on simulated data and real data from yeast and humans. This study provides important insights into the challenge of gene expression prediction for transcriptome-wide association studies.
Availability and implementation: The code is available on Github at github.com/guutama/GRN-TI.
{"title":"Predicting the genetic component of gene expression using gene regulatory networks.","authors":"Gutama Ibrahim Mohammad, Tom Michoel","doi":"10.1093/bioadv/vbae180","DOIUrl":"10.1093/bioadv/vbae180","url":null,"abstract":"<p><strong>Motivation: </strong>Gene expression prediction plays a vital role in transcriptome-wide association studies. Traditional models rely on genetic variants in close genomic proximity to the gene of interest to predict the genetic component of gene expression. Here, we propose a novel approach incorporating distal genetic variants acting through gene regulatory networks, in line with the omnigenic model of complex traits.</p><p><strong>Results: </strong>Using causal and coexpression Bayesian networks reconstructed from genomic and transcriptomic data, inference of gene expression from genotypic data is achieved through a two-step process. Initially, the expression level of each gene is predicted using its local genetic variants. The residual differences between the observed and predicted expression levels are then modeled using the genotype information of parent and/or grandparent nodes in the network. The final predicted expression level is obtained by summing the predictions from both models, effectively incorporating both local and distal genetic influences. Using regularized regression techniques for parameter estimation, we found that gene regulatory network-based gene expression prediction outperformed the traditional approach on simulated data and real data from yeast and humans. This study provides important insights into the challenge of gene expression prediction for transcriptome-wide association studies.</p><p><strong>Availability and implementation: </strong>The code is available on Github at github.com/guutama/GRN-TI.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae180"},"PeriodicalIF":2.4,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11665636/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142883453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-22eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbae182
Nure Tasnina, T M Murali
Motivation: Molecular interaction networks are powerful tools for studying cellular functions. Integrating diverse types of networks enhances performance in downstream tasks such as gene module detection and protein function prediction. The challenge lies in extracting meaningful protein feature representations due to varying levels of sparsity and noise across these heterogeneous networks.
Results: We propose ICoN, a novel unsupervised graph neural network model that takes multiple protein-protein association networks as inputs and generates a feature representation for each protein that integrates the topological information from all the networks. A key contribution of ICoN is exploiting a mechanism called "co-attention" that enables cross-network communication during training. The model also incorporates a denoising training technique, introducing perturbations to each input network and training the model to reconstruct the original network from its corrupted version. Our experimental results demonstrate that ICoN surpasses individual networks across three downstream tasks: gene module detection, gene coannotation prediction, and protein function prediction. Compared to existing unsupervised network integration models, ICoN exhibits superior performance across the majority of downstream tasks and shows enhanced robustness against noise. This work introduces a promising approach for effectively integrating diverse protein-protein association networks, aiming to achieve a biologically meaningful representation of proteins.
Availability and implementation: The ICoN software is available under the GNU Public License v3 at https://github.com/Murali-group/ICoN.
{"title":"ICoN: integration using co-attention across biological networks.","authors":"Nure Tasnina, T M Murali","doi":"10.1093/bioadv/vbae182","DOIUrl":"10.1093/bioadv/vbae182","url":null,"abstract":"<p><strong>Motivation: </strong>Molecular interaction networks are powerful tools for studying cellular functions. Integrating diverse types of networks enhances performance in downstream tasks such as gene module detection and protein function prediction. The challenge lies in extracting meaningful protein feature representations due to varying levels of sparsity and noise across these heterogeneous networks.</p><p><strong>Results: </strong>We propose ICoN, a novel unsupervised graph neural network model that takes multiple protein-protein association networks as inputs and generates a feature representation for each protein that integrates the topological information from all the networks. A key contribution of ICoN is exploiting a mechanism called \"co-attention\" that enables cross-network communication during training. The model also incorporates a denoising training technique, introducing perturbations to each input network and training the model to reconstruct the original network from its corrupted version. Our experimental results demonstrate that ICoN surpasses individual networks across three downstream tasks: gene module detection, gene coannotation prediction, and protein function prediction. Compared to existing unsupervised network integration models, ICoN exhibits superior performance across the majority of downstream tasks and shows enhanced robustness against noise. This work introduces a promising approach for effectively integrating diverse protein-protein association networks, aiming to achieve a biologically meaningful representation of proteins.</p><p><strong>Availability and implementation: </strong>The ICoN software is available under the GNU Public License v3 at https://github.com/Murali-group/ICoN.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae182"},"PeriodicalIF":2.4,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11723530/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142973601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary: Although multiple neural networks have been proposed for detecting secondary structures from medium-resolution (5-10 Å) cryo-electron microscopy (cryo-EM) maps, the loss functions used in the existing deep learning networks are primarily based on cross-entropy loss, which is known to be sensitive to class imbalances. We investigated five loss functions: cross-entropy, Focal loss, Dice loss, and two combined loss functions. Using a U-Net architecture in our DeepSSETracer method and a dataset composed of 1355 box-cropped atomic-structure/density-map pairs, we found that a newly designed loss function that combines Focal loss and Dice loss provides the best overall detection accuracy for secondary structures. For β-sheet voxels, which are generally much harder to detect than helix voxels, the combined loss function achieved a significant improvement (an 8.8% increase in the F1 score) compared to the cross-entropy loss function and a noticeable improvement from the Dice loss function. This study demonstrates the potential for designing more effective loss functions for hard cases in the segmentation of secondary structures. The newly trained model was incorporated into DeepSSETracer 1.1 for the segmentation of protein secondary structures in medium-resolution cryo-EM map components. DeepSSETracer can be integrated into ChimeraX, a popular molecular visualization software.
Availability and implementation: https://www.cs.odu.edu/∼bioinfo/B2I_Tools/.
{"title":"The combined focal loss and dice loss function improves the segmentation of beta-sheets in medium-resolution cryo-electron-microscopy density maps.","authors":"Yongcheng Mu, Thu Nguyen, Bryan Hawickhorst, Willy Wriggers, Jiangwen Sun, Jing He","doi":"10.1093/bioadv/vbae169","DOIUrl":"10.1093/bioadv/vbae169","url":null,"abstract":"<p><strong>Summary: </strong>Although multiple neural networks have been proposed for detecting secondary structures from medium-resolution (5-10 Å) cryo-electron microscopy (cryo-EM) maps, the loss functions used in the existing deep learning networks are primarily based on cross-entropy loss, which is known to be sensitive to class imbalances. We investigated five loss functions: cross-entropy, Focal loss, Dice loss, and two combined loss functions. Using a U-Net architecture in our DeepSSETracer method and a dataset composed of 1355 box-cropped atomic-structure/density-map pairs, we found that a newly designed loss function that combines Focal loss and Dice loss provides the best overall detection accuracy for secondary structures. For β-sheet voxels, which are generally much harder to detect than helix voxels, the combined loss function achieved a significant improvement (an 8.8% increase in the F<sub>1</sub> score) compared to the cross-entropy loss function and a noticeable improvement from the Dice loss function. This study demonstrates the potential for designing more effective loss functions for hard cases in the segmentation of secondary structures. The newly trained model was incorporated into DeepSSETracer 1.1 for the segmentation of protein secondary structures in medium-resolution cryo-EM map components. DeepSSETracer can be integrated into ChimeraX, a popular molecular visualization software.</p><p><strong>Availability and implementation: </strong>https://www.cs.odu.edu/∼bioinfo/B2I_Tools/.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae169"},"PeriodicalIF":2.4,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11590252/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142735054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-22eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae185
Patrick König, Anne Fiebig, Thomas Münch, Björn Grüning, Uwe Scholz
Motivation: The Galaxy workflow system is an open-source platform supporting data-intensive research in life sciences, featuring a user-friendly web interface for complex analyses without extensive programming. It also offers a representational state transfer based API, enabling remote execution of specific tools. Galaxy supports similarity searches for nucleotide and amino acid sequences, with integrated tools like NCBI BLAST+ and DIAMOND. However, no specialized software currently exists for convenient use of NCBI BLAST+ and DIAMOND via the Galaxy API.
Results: blast2galaxy is a Python package that uses the Galaxy API to run sequence alignments with NCBI BLAST+ and DIAMOND as Galaxy-wrapped tools on compatible servers. It includes a command-line interface that mirrors the CLI of BLAST+ and DIAMOND and a high-level Python API for direct alignments from Python applications. The package relies on bioblend for communication with the Galaxy API.
Availability and implementation: blast2galaxy is available as open-source software under the MIT license. The source code is available on Github: https://github.com/IPK-BIT/blast2galaxy. It can be installed from the Python Package Index using "pip install blast2galaxy" or from the Bioconda channel using "conda install -c bioconda blast2galaxy". Docker and Apptainer images are available and referenced in the documentation which is available under https://blast2galaxy.readthedocs.io.
{"title":"blast2galaxy: a CLI and Python API for BLAST+ and DIAMOND searches on Galaxy servers.","authors":"Patrick König, Anne Fiebig, Thomas Münch, Björn Grüning, Uwe Scholz","doi":"10.1093/bioadv/vbae185","DOIUrl":"10.1093/bioadv/vbae185","url":null,"abstract":"<p><strong>Motivation: </strong>The Galaxy workflow system is an open-source platform supporting data-intensive research in life sciences, featuring a user-friendly web interface for complex analyses without extensive programming. It also offers a representational state transfer based API, enabling remote execution of specific tools. Galaxy supports similarity searches for nucleotide and amino acid sequences, with integrated tools like NCBI BLAST+ and DIAMOND. However, no specialized software currently exists for convenient use of NCBI BLAST+ and DIAMOND via the Galaxy API.</p><p><strong>Results: </strong>blast2galaxy is a Python package that uses the Galaxy API to run sequence alignments with NCBI BLAST+ and DIAMOND as Galaxy-wrapped tools on compatible servers. It includes a command-line interface that mirrors the CLI of BLAST+ and DIAMOND and a high-level Python API for direct alignments from Python applications. The package relies on bioblend for communication with the Galaxy API.</p><p><strong>Availability and implementation: </strong>blast2galaxy is available as open-source software under the MIT license. The source code is available on Github: https://github.com/IPK-BIT/blast2galaxy. It can be installed from the Python Package Index using \"pip install blast2galaxy\" or from the Bioconda channel using \"conda install -c bioconda blast2galaxy\". Docker and Apptainer images are available and referenced in the documentation which is available under https://blast2galaxy.readthedocs.io.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae185"},"PeriodicalIF":2.4,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11629687/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142808680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-22eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbae173
Kai Wang, Yueming Hu, Sida Li, Ming Chen, Zhong Li
Motivation: Much evidence suggests that the subcellular localization of long-stranded noncoding RNAs (LncRNAs) provides key insights for the study of their biological function.
Results: This study proposes a novel deep learning framework, LncLSTA, designed for predicting the subcellular localization of LncRNAs. It firstly exploits LncRNA sequence, electron-ion interaction pseudopotentials, and nucleotide chemical property as feature inputs. Departing from conventional k-mer approaches, this model uses a set of 1D convolutional and maxpooling operations for dynamical feature aggregation. Furthermore, LncLSTA integrates a long-short term attention module with a bidirectional long and short term memory network to comprehensively extract sequence information. In addition, it incorporates a TextCNN module to enhance accuracy and robustness in subcellular localization tasks. Experimental results demonstrate the efficacy of LncLSTA, showcasing its superior performance compared to other state-of-the-art methods. Notably, LncLSTA exhibits the transfer learning capability, extending its utility to predict the subcellular localization prediction of mRNAs, while maintaining consistently satisfactory prediction results. This research contributes valuable insights into understanding the biological functions of LncRNAs through subcellular localization, emphasizing the potential of deep learning approaches in advancing RNA-related studies.
Availability and implementation: The source code is publicly available at https://bis.zju.edu.cn/LncLSTA.
{"title":"LncLSTA: a versatile predictor unveiling subcellular localization of lncRNAs through long-short term attention.","authors":"Kai Wang, Yueming Hu, Sida Li, Ming Chen, Zhong Li","doi":"10.1093/bioadv/vbae173","DOIUrl":"https://doi.org/10.1093/bioadv/vbae173","url":null,"abstract":"<p><strong>Motivation: </strong>Much evidence suggests that the subcellular localization of long-stranded noncoding RNAs (LncRNAs) provides key insights for the study of their biological function.</p><p><strong>Results: </strong>This study proposes a novel deep learning framework, LncLSTA, designed for predicting the subcellular localization of LncRNAs. It firstly exploits LncRNA sequence, electron-ion interaction pseudopotentials, and nucleotide chemical property as feature inputs. Departing from conventional <i>k</i>-mer approaches, this model uses a set of 1D convolutional and maxpooling operations for dynamical feature aggregation. Furthermore, LncLSTA integrates a long-short term attention module with a bidirectional long and short term memory network to comprehensively extract sequence information. In addition, it incorporates a TextCNN module to enhance accuracy and robustness in subcellular localization tasks. Experimental results demonstrate the efficacy of LncLSTA, showcasing its superior performance compared to other state-of-the-art methods. Notably, LncLSTA exhibits the transfer learning capability, extending its utility to predict the subcellular localization prediction of mRNAs, while maintaining consistently satisfactory prediction results. This research contributes valuable insights into understanding the biological functions of LncRNAs through subcellular localization, emphasizing the potential of deep learning approaches in advancing RNA-related studies.</p><p><strong>Availability and implementation: </strong>The source code is publicly available at https://bis.zju.edu.cn/LncLSTA.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae173"},"PeriodicalIF":2.4,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11700581/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142933930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}