Pub Date : 2024-12-26DOI: 10.1093/bioinformatics/btae749
Jacek Karolczak, Anna Przybyłowska, Konrad Szewczyk, Witold Taisner, John M Heumann, Michael H B Stowell, Michał Nowicki, Dariusz Brzezinski
Motivation: Accurately identifying ligands plays a crucial role in the process of structure-guided drug design. Based on density maps from X-ray diffraction or cryogenic-sample electron microscopy (cryoEM), scientists verify whether small-molecule ligands bind to active sites of interest. However, the interpretation of density maps is challenging, and cognitive bias can sometimes mislead investigators into modeling fictitious compounds. Ligand identification can be aided by automatic methods, but existing approaches are available only for X-ray diffraction and are based on iterative fitting or feature-engineered machine learning rather than end-to-end deep learning.
Results: Here, we propose to identify ligands using a deep-learning approach that treats density maps as 3D point clouds. We show that the proposed model is on par with existing machine learning methods for X-ray crystallography while also being applicable to cryoEM density maps. Our study demonstrates that electron density map fragments can aid the training of models that can later be applied to cryoEM structures but also highlights challenges associated with the standardization of electron microscopy maps and the quality assessment of cryoEM ligands.
Availability and implementation: Code and model weights are available on GitHub at https://github.com/jkarolczak/ligands-classification. An accompanying ChimeraX bundle is available at https://github.com/wtaisner/chimerax-ligand-recognizer.
{"title":"Ligand identification in CryoEM and X-ray maps using deep learning.","authors":"Jacek Karolczak, Anna Przybyłowska, Konrad Szewczyk, Witold Taisner, John M Heumann, Michael H B Stowell, Michał Nowicki, Dariusz Brzezinski","doi":"10.1093/bioinformatics/btae749","DOIUrl":"10.1093/bioinformatics/btae749","url":null,"abstract":"<p><strong>Motivation: </strong>Accurately identifying ligands plays a crucial role in the process of structure-guided drug design. Based on density maps from X-ray diffraction or cryogenic-sample electron microscopy (cryoEM), scientists verify whether small-molecule ligands bind to active sites of interest. However, the interpretation of density maps is challenging, and cognitive bias can sometimes mislead investigators into modeling fictitious compounds. Ligand identification can be aided by automatic methods, but existing approaches are available only for X-ray diffraction and are based on iterative fitting or feature-engineered machine learning rather than end-to-end deep learning.</p><p><strong>Results: </strong>Here, we propose to identify ligands using a deep-learning approach that treats density maps as 3D point clouds. We show that the proposed model is on par with existing machine learning methods for X-ray crystallography while also being applicable to cryoEM density maps. Our study demonstrates that electron density map fragments can aid the training of models that can later be applied to cryoEM structures but also highlights challenges associated with the standardization of electron microscopy maps and the quality assessment of cryoEM ligands.</p><p><strong>Availability and implementation: </strong>Code and model weights are available on GitHub at https://github.com/jkarolczak/ligands-classification. An accompanying ChimeraX bundle is available at https://github.com/wtaisner/chimerax-ligand-recognizer.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11709248/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142866571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1093/bioinformatics/btae723
Wannes Mores, Satyajeet S Bhonsale, Filip Logist, Jan F M Van Impe
Motivation: Analysis of metabolic networks through extreme rays such as extreme pathways and elementary flux modes has been shown to be effective for many applications. However, due to the combinatorial explosion of candidate vectors, their enumeration is currently limited to small- and medium-scale networks (typically <200 reactions). Partial enumeration of the extreme rays is shown to be possible, but either relies on generating them one-by-one or by implementing a sampling step in the enumeration algorithms. Sampling-based enumeration can be achieved through the canonical basis approach (CBA) or the nullspace approach (NSA). Both algorithms are very efficient in medium-scale networks, but struggle with elementarity testing in sampling-based enumeration of larger networks.
Results: In this paper, a novel elementarity test is defined and exploited, resulting in significant speedup of the enumeration. Even though NSA is currently considered more effective, the novel elementarity test allows CBA to significantly outpace NSA. This is shown through two case studies, ranging from a medium-scale network to a genome-scale metabolic network with over 600 reactions. In this study, extreme pathways are chosen as the extreme rays, but the novel elementarity test and CBA are equally applicable to the other types. With the increasing complexity of metabolic networks in recent years, CBA with the novel elementarity test shows even more promise as its advantages grows with increased network complexity. Given this scaling aspect, CBA is now the faster method for enumerating extreme rays in genome-scale metabolic networks.
Availability and implementation: All case studies are implemented in Python. The codebase used to generate extreme pathways using the different approaches is available at https://gitlab.kuleuven.be/biotec-plus/pos-def-ep.
{"title":"Accelerated enumeration of extreme rays through a positive-definite elementarity test.","authors":"Wannes Mores, Satyajeet S Bhonsale, Filip Logist, Jan F M Van Impe","doi":"10.1093/bioinformatics/btae723","DOIUrl":"10.1093/bioinformatics/btae723","url":null,"abstract":"<p><strong>Motivation: </strong>Analysis of metabolic networks through extreme rays such as extreme pathways and elementary flux modes has been shown to be effective for many applications. However, due to the combinatorial explosion of candidate vectors, their enumeration is currently limited to small- and medium-scale networks (typically <200 reactions). Partial enumeration of the extreme rays is shown to be possible, but either relies on generating them one-by-one or by implementing a sampling step in the enumeration algorithms. Sampling-based enumeration can be achieved through the canonical basis approach (CBA) or the nullspace approach (NSA). Both algorithms are very efficient in medium-scale networks, but struggle with elementarity testing in sampling-based enumeration of larger networks.</p><p><strong>Results: </strong>In this paper, a novel elementarity test is defined and exploited, resulting in significant speedup of the enumeration. Even though NSA is currently considered more effective, the novel elementarity test allows CBA to significantly outpace NSA. This is shown through two case studies, ranging from a medium-scale network to a genome-scale metabolic network with over 600 reactions. In this study, extreme pathways are chosen as the extreme rays, but the novel elementarity test and CBA are equally applicable to the other types. With the increasing complexity of metabolic networks in recent years, CBA with the novel elementarity test shows even more promise as its advantages grows with increased network complexity. Given this scaling aspect, CBA is now the faster method for enumerating extreme rays in genome-scale metabolic networks.</p><p><strong>Availability and implementation: </strong>All case studies are implemented in Python. The codebase used to generate extreme pathways using the different approaches is available at https://gitlab.kuleuven.be/biotec-plus/pos-def-ep.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11724715/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142869798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1093/bioinformatics/btae735
Stijn Wittouck, Tom Eilers, Vera van Noort, Sarah Lebeer
Motivation: Much of prokaryotic comparative genomics currently relies on two critical computational tasks: pangenome inference and core genome inference. Pangenome inference involves clustering genes from a set of genomes into gene families, enabling genome-wide association studies and evolutionary history analysis. The core genome represents gene families present in nearly all genomes and is required to infer a high-quality phylogeny. For species-level datasets, fast pangenome inference tools have been developed. However, tools applicable to more diverse datasets are currently slow and scale poorly.
Results: Here, we introduce SCARAP, a program containing three modules for comparative genomics analyses: a fast and scalable pangenome inference module, a direct core genome inference module, and a module for subsampling representative genomes. When benchmarked against existing tools, the SCARAP pan module proved up to an order of magnitude faster with comparable accuracy. The core module was validated by comparing its result against a core genome extracted from a full pangenome. The sample module demonstrated the rapid sampling of genomes with decreasing novelty. Applied to a dataset of over 31 000 Lactobacillales genomes, SCARAP showcased its ability to derive a representative pangenome. Finally, we applied the novel concept of gene fixation frequency to this pangenome, showing that Lactobacillales genes that are prevalent but rarely fixate in species often encode bacteriophage functions.
Availability and implementation: The SCARAP toolkit is publicly available at https://github.com/swittouck/scarap.
{"title":"SCARAP: scalable cross-species comparative genomics of prokaryotes.","authors":"Stijn Wittouck, Tom Eilers, Vera van Noort, Sarah Lebeer","doi":"10.1093/bioinformatics/btae735","DOIUrl":"10.1093/bioinformatics/btae735","url":null,"abstract":"<p><strong>Motivation: </strong>Much of prokaryotic comparative genomics currently relies on two critical computational tasks: pangenome inference and core genome inference. Pangenome inference involves clustering genes from a set of genomes into gene families, enabling genome-wide association studies and evolutionary history analysis. The core genome represents gene families present in nearly all genomes and is required to infer a high-quality phylogeny. For species-level datasets, fast pangenome inference tools have been developed. However, tools applicable to more diverse datasets are currently slow and scale poorly.</p><p><strong>Results: </strong>Here, we introduce SCARAP, a program containing three modules for comparative genomics analyses: a fast and scalable pangenome inference module, a direct core genome inference module, and a module for subsampling representative genomes. When benchmarked against existing tools, the SCARAP pan module proved up to an order of magnitude faster with comparable accuracy. The core module was validated by comparing its result against a core genome extracted from a full pangenome. The sample module demonstrated the rapid sampling of genomes with decreasing novelty. Applied to a dataset of over 31 000 Lactobacillales genomes, SCARAP showcased its ability to derive a representative pangenome. Finally, we applied the novel concept of gene fixation frequency to this pangenome, showing that Lactobacillales genes that are prevalent but rarely fixate in species often encode bacteriophage functions.</p><p><strong>Availability and implementation: </strong>The SCARAP toolkit is publicly available at https://github.com/swittouck/scarap.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11681940/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142815257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1093/bioinformatics/btaf001
{"title":"Expression of Concern: Cleavage-Stage Embryo Segmentation Using SAM-Based Dual Branch Pipeline: Development and Evaluation with the CleavageEmbryo Dataset.","authors":"","doi":"10.1093/bioinformatics/btaf001","DOIUrl":"10.1093/bioinformatics/btaf001","url":null,"abstract":"","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":"41 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11724708/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142973811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1093/bioinformatics/btae744
Denis Beslic, Martin Kucklick, Susanne Engelmann, Stephan Fuchs, Bernhard Y Renard, Nils Körber
Motivation: Nanopore sequencing represents a significant advancement in genomics, enabling direct long-read DNA sequencing at the single-molecule level. Accurate simulation of nanopore sequencing signals from nucleotide sequences is crucial for method development and for complementing experimental data. Most existing approaches rely on predefined statistical models, which may not adequately capture the properties of experimental signal data. Furthermore, these simulators were developed for earlier versions of nanopore chemistry, which limits their applicability and adaptability to the latest flow cell data.
Results: To enhance the quality of artificial signals, we introduce seq2squiggle, a novel transformer-based, non-autoregressive model designed to generate nanopore sequencing signals from nucleotide sequences. Unlike existing simulators that rely on static k-mer models, our approach learns sequential contextual information from segmented signal data. We benchmark seq2squiggle against state-of-the-art simulators on real experimental R9.4.1 and R10.4.1 data, evaluating signal similarity, basecalling accuracy, and variant detection rates. Seq2squiggle consistently outperforms existing tools across multiple datasets, demonstrating superior similarity to real data and offering a robust solution for simulating nanopore sequencing signals with the latest flow cell generation.
Availability and implementation: seq2squiggle is freely available on GitHub at: github.com/ZKI-PH-ImageAnalysis/seq2squiggle.
{"title":"End-to-end simulation of nanopore sequencing signals with feed-forward transformers.","authors":"Denis Beslic, Martin Kucklick, Susanne Engelmann, Stephan Fuchs, Bernhard Y Renard, Nils Körber","doi":"10.1093/bioinformatics/btae744","DOIUrl":"10.1093/bioinformatics/btae744","url":null,"abstract":"<p><strong>Motivation: </strong>Nanopore sequencing represents a significant advancement in genomics, enabling direct long-read DNA sequencing at the single-molecule level. Accurate simulation of nanopore sequencing signals from nucleotide sequences is crucial for method development and for complementing experimental data. Most existing approaches rely on predefined statistical models, which may not adequately capture the properties of experimental signal data. Furthermore, these simulators were developed for earlier versions of nanopore chemistry, which limits their applicability and adaptability to the latest flow cell data.</p><p><strong>Results: </strong>To enhance the quality of artificial signals, we introduce seq2squiggle, a novel transformer-based, non-autoregressive model designed to generate nanopore sequencing signals from nucleotide sequences. Unlike existing simulators that rely on static k-mer models, our approach learns sequential contextual information from segmented signal data. We benchmark seq2squiggle against state-of-the-art simulators on real experimental R9.4.1 and R10.4.1 data, evaluating signal similarity, basecalling accuracy, and variant detection rates. Seq2squiggle consistently outperforms existing tools across multiple datasets, demonstrating superior similarity to real data and offering a robust solution for simulating nanopore sequencing signals with the latest flow cell generation.</p><p><strong>Availability and implementation: </strong>seq2squiggle is freely available on GitHub at: github.com/ZKI-PH-ImageAnalysis/seq2squiggle.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11729726/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142878935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1093/bioinformatics/btaf003
Miles D Woodcock-Girard, Eric C Bretz, Holly M Robertson, Karolis Ramanauskas, Jarrad T Hampton-Marcell, Joseph F Walker
Motivation: Recent advancements in parallel sequencing methods have precipitated a surge in publicly available short-read sequence data. This has encouraged the development of novel computational tools for the de novo assembly of transcriptomes from RNA-seq data. Despite the availability of these tools, performing an end-to-end transcriptome assembly remains a programmatically involved task necessitating familiarity with best practices. Aside from quality control steps, including error correction, adapter trimming, and chimera filtration needing to be correctly used, moving data between programs often requires manual reformatting or restructuring, which can further impede throughput. Here, we introduce Semblans, a tool for streamlining the assembly process that efficiently and consistently produces high-quality transcriptome assemblies.
Results: Semblans abstracts the key quality control, reconstitution, and postprocessing steps of transcriptome assembly from raw short-read sequences to annotated coding sequences. Evaluating its performance against previously assembled transcriptomes on the basis of assembly quality, we find that Semblans produced higher quality assemblies for 98 of the 101 short-read runs tested.
Availability and implementation: Semblans is written in C++ and runs on Unix-compliant operating systems. Source code, documentation, and compiled binaries are hosted under the GNU General Public License at https://github.com/gladshire/Semblans.
{"title":"Semblans: automated assembly and processing of RNA-seq data.","authors":"Miles D Woodcock-Girard, Eric C Bretz, Holly M Robertson, Karolis Ramanauskas, Jarrad T Hampton-Marcell, Joseph F Walker","doi":"10.1093/bioinformatics/btaf003","DOIUrl":"10.1093/bioinformatics/btaf003","url":null,"abstract":"<p><strong>Motivation: </strong>Recent advancements in parallel sequencing methods have precipitated a surge in publicly available short-read sequence data. This has encouraged the development of novel computational tools for the de novo assembly of transcriptomes from RNA-seq data. Despite the availability of these tools, performing an end-to-end transcriptome assembly remains a programmatically involved task necessitating familiarity with best practices. Aside from quality control steps, including error correction, adapter trimming, and chimera filtration needing to be correctly used, moving data between programs often requires manual reformatting or restructuring, which can further impede throughput. Here, we introduce Semblans, a tool for streamlining the assembly process that efficiently and consistently produces high-quality transcriptome assemblies.</p><p><strong>Results: </strong>Semblans abstracts the key quality control, reconstitution, and postprocessing steps of transcriptome assembly from raw short-read sequences to annotated coding sequences. Evaluating its performance against previously assembled transcriptomes on the basis of assembly quality, we find that Semblans produced higher quality assemblies for 98 of the 101 short-read runs tested.</p><p><strong>Availability and implementation: </strong>Semblans is written in C++ and runs on Unix-compliant operating systems. Source code, documentation, and compiled binaries are hosted under the GNU General Public License at https://github.com/gladshire/Semblans.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11748423/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142960041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1093/bioinformatics/btae761
Andre C Faubert, Shang Wang
Summary: Time-lapse 3D imaging is fundamental for studying biological processes but requires software able to handle terabytes of voxel data. Although many multidimensional viewing applications exist, they mostly lack support for heterogeneous voxel counts, datatypes, and modalities in a single timeline. Open Chrono-Morph Viewer provides a straightforward graphical user interface to quickly investigate multi-timescale datasets represented as separate volume files in the common NRRD format for compatibility between toolchains. It features dynamic clipping surfaces for rapid investigation of 3D morphology and a scriptable animation API for quantitative, repeatable, publication-quality visualization. It is implemented in pure Python using common libraries to facilitate community-driven development.
Availability and implementation: OCMV is available at https://github.com/ShangWangLab/OpenChronoMorphViewer for Windows, Linux, and macOS. Supporting tutorials, documentation, and installation instructions can be found in the supplementary information. Our modified Fiji I/O plugin for up to 5D NRRD file conversion is available at https://github.com/afaubert/IO.
{"title":"Open Chrono-Morph Viewer: visualize big bioimage time series containing heterogeneous volumes.","authors":"Andre C Faubert, Shang Wang","doi":"10.1093/bioinformatics/btae761","DOIUrl":"10.1093/bioinformatics/btae761","url":null,"abstract":"<p><strong>Summary: </strong>Time-lapse 3D imaging is fundamental for studying biological processes but requires software able to handle terabytes of voxel data. Although many multidimensional viewing applications exist, they mostly lack support for heterogeneous voxel counts, datatypes, and modalities in a single timeline. Open Chrono-Morph Viewer provides a straightforward graphical user interface to quickly investigate multi-timescale datasets represented as separate volume files in the common NRRD format for compatibility between toolchains. It features dynamic clipping surfaces for rapid investigation of 3D morphology and a scriptable animation API for quantitative, repeatable, publication-quality visualization. It is implemented in pure Python using common libraries to facilitate community-driven development.</p><p><strong>Availability and implementation: </strong>OCMV is available at https://github.com/ShangWangLab/OpenChronoMorphViewer for Windows, Linux, and macOS. Supporting tutorials, documentation, and installation instructions can be found in the supplementary information. Our modified Fiji I/O plugin for up to 5D NRRD file conversion is available at https://github.com/afaubert/IO.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":"41 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11751631/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143018191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1093/bioinformatics/btae741
Nicholas E Newell
Motivation: Beta turns are the most common type of secondary structure in proteins after alpha helices and beta sheets and play many key structural and functional roles. Turn backbone (BB) geometry has been classified at multiple levels of precision, but the current picture of side chain (SC) structure and interaction in turns is incomplete, because the distribution of SC conformations associated with each sequence motif has commonly been represented only by a static image of a single, typical structure for each turn BB geometry, and only motifs which specify a single amino acid (e.g. aspartic acid at turn position 1) have been systematically investigated. Furthermore, no general evaluation has been made of the SC interactions between turns and their BB neighborhoods. Finally, the visualization and comparison of the wide range of turn conformations has been hampered by the almost exclusive characterization of turn structure in BB dihedral-angle (Ramachandran) space.
Results: This work introduces MapTurns, a web server for motif maps, which employ a turn-local Euclidean-space coordinate system and a global turn alignment to comprehensively map the distributions of BB and SC structure and H-bonding associated with sequence motifs in beta turns and their local BB contexts. Maps characterize many new SC motifs, provide detailed rationalizations of sequence preferences, and support mutational analysis and the general study of SC interactions, and they should prove useful in applications such as protein design.
Availability and implementation: MapTurns is available at www.betaturn.com. Sample code is available at: https://github.com/nenewell/MapTurns/tree/main.
{"title":"MapTurns: mapping the structure, H-bonding, and contexts of beta turns in proteins.","authors":"Nicholas E Newell","doi":"10.1093/bioinformatics/btae741","DOIUrl":"10.1093/bioinformatics/btae741","url":null,"abstract":"<p><strong>Motivation: </strong>Beta turns are the most common type of secondary structure in proteins after alpha helices and beta sheets and play many key structural and functional roles. Turn backbone (BB) geometry has been classified at multiple levels of precision, but the current picture of side chain (SC) structure and interaction in turns is incomplete, because the distribution of SC conformations associated with each sequence motif has commonly been represented only by a static image of a single, typical structure for each turn BB geometry, and only motifs which specify a single amino acid (e.g. aspartic acid at turn position 1) have been systematically investigated. Furthermore, no general evaluation has been made of the SC interactions between turns and their BB neighborhoods. Finally, the visualization and comparison of the wide range of turn conformations has been hampered by the almost exclusive characterization of turn structure in BB dihedral-angle (Ramachandran) space.</p><p><strong>Results: </strong>This work introduces MapTurns, a web server for motif maps, which employ a turn-local Euclidean-space coordinate system and a global turn alignment to comprehensively map the distributions of BB and SC structure and H-bonding associated with sequence motifs in beta turns and their local BB contexts. Maps characterize many new SC motifs, provide detailed rationalizations of sequence preferences, and support mutational analysis and the general study of SC interactions, and they should prove useful in applications such as protein design.</p><p><strong>Availability and implementation: </strong>MapTurns is available at www.betaturn.com. Sample code is available at: https://github.com/nenewell/MapTurns/tree/main.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11671037/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142840607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1093/bioinformatics/btae708
Xiaohong Jin, Zimeng Chen, Dan Yu, Qianhui Jiang, Zhuobin Chen, Bin Yan, Jing Qin, Yong Liu, Junwen Wang
Motivation: Peptides and their derivatives hold potential as therapeutic agents. The rising interest in developing peptide drugs is evidenced by increasing approval rates by the FDA of USA. To identify the most potential peptides, study on peptide-protein interactions (PepPIs) presents a very important approach but poses considerable technical challenges. In experimental aspects, the transient nature of PepPIs and the high flexibility of peptides contribute to elevated costs and inefficiency. Traditional docking and molecular dynamics simulation methods require substantial computational resources, and the predictive accuracy of their results remain unsatisfactory.
Results: To address this gap, we proposed TPepPro, a Transformer-based model for PepPI prediction. We trained TPepPro on a dataset of 19,187 pairs of peptide-protein complexes with both sequential and structural features. TPepPro utilizes a strategy that combines local protein sequence feature extraction with global protein structure feature extraction. Moreover, TPepPro optimizes the architecture of structural featuring neural network in BN-ReLU arrangement, which notably reduced the amount of computing resources required for PepPIs prediction. According to comparison analysis, the accuracy reached 0.855 in TPepPro, achieving an 8.1% improvement compared to the second-best model TAGPPI. TPepPro achieved an AUC of 0.922, surpassing the second-best model TAGPPI with 0.844. Moreover, the newly developed TPepPro identify certain PepPIs that can be validated according to previous experimental evidence, thus indicating the efficiency of TPepPro to detect high potential PepPIs that would be helpful for amino acid drug applications.
Availability and implementation: The source code of TPepPro is available at https://github.com/wanglabhku/TPepPro.
动机肽及其衍生物具有作为治疗药物的潜力。美国食品和药物管理局(FDA)对多肽药物的批准率不断提高,证明了人们对开发多肽药物的兴趣日益高涨。要找出最有潜力的多肽,研究多肽与蛋白质的相互作用是一个非常重要的方法,但也带来了相当大的技术挑战。在实验方面,肽与蛋白质相互作用(PepPIs)的瞬时性和肽的高度灵活性导致成本和效率的提高。传统的对接和分子动力学模拟方法需要大量的计算资源,其结果的预测准确性仍不能令人满意:为了弥补这一不足,我们提出了基于 Transformer 的 PepPI 预测模型 TPepPro。我们在一个包含 19,187 对多肽-蛋白质复合物的数据集上训练了 TPepPro,该数据集同时具有序列和结构特征。TPepPro 采用了一种将局部蛋白质序列特征提取与全局蛋白质结构特征提取相结合的策略。此外,TPepPro 还优化了 BN-ReLU 排列的结构特征神经网络架构,从而显著降低了肽-蛋白质相互作用预测所需的计算资源。根据对比分析,TPepPro 的准确率达到了 0.855,比排名第二的 TAGPPI 提高了 8.1%。TPepPro 的 AUC 为 0.922,超过了排名第二的 TAGPPI 的 0.844。此外,新开发的 TPepPro 还能识别出某些可根据以前的实验证据进行验证的 PepPIs,从而表明 TPepPro 能有效地检测出有助于氨基酸药物应用的高潜力 PepPIs:TPepPro 的源代码可从 https://github.com/wanglabhku/TPepPro.Supplementary 信息中获取:Supplementary data are available at Bioinformatics online..
{"title":"TPepPro: a deep learning model for predicting peptide-protein interactions.","authors":"Xiaohong Jin, Zimeng Chen, Dan Yu, Qianhui Jiang, Zhuobin Chen, Bin Yan, Jing Qin, Yong Liu, Junwen Wang","doi":"10.1093/bioinformatics/btae708","DOIUrl":"10.1093/bioinformatics/btae708","url":null,"abstract":"<p><strong>Motivation: </strong>Peptides and their derivatives hold potential as therapeutic agents. The rising interest in developing peptide drugs is evidenced by increasing approval rates by the FDA of USA. To identify the most potential peptides, study on peptide-protein interactions (PepPIs) presents a very important approach but poses considerable technical challenges. In experimental aspects, the transient nature of PepPIs and the high flexibility of peptides contribute to elevated costs and inefficiency. Traditional docking and molecular dynamics simulation methods require substantial computational resources, and the predictive accuracy of their results remain unsatisfactory.</p><p><strong>Results: </strong>To address this gap, we proposed TPepPro, a Transformer-based model for PepPI prediction. We trained TPepPro on a dataset of 19,187 pairs of peptide-protein complexes with both sequential and structural features. TPepPro utilizes a strategy that combines local protein sequence feature extraction with global protein structure feature extraction. Moreover, TPepPro optimizes the architecture of structural featuring neural network in BN-ReLU arrangement, which notably reduced the amount of computing resources required for PepPIs prediction. According to comparison analysis, the accuracy reached 0.855 in TPepPro, achieving an 8.1% improvement compared to the second-best model TAGPPI. TPepPro achieved an AUC of 0.922, surpassing the second-best model TAGPPI with 0.844. Moreover, the newly developed TPepPro identify certain PepPIs that can be validated according to previous experimental evidence, thus indicating the efficiency of TPepPro to detect high potential PepPIs that would be helpful for amino acid drug applications.</p><p><strong>Availability and implementation: </strong>The source code of TPepPro is available at https://github.com/wanglabhku/TPepPro.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11681936/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142711615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1093/bioinformatics/btae763
Fabienne Thelen, Jannis Hochmuth, Sven Griep, Benedikt Schwab, Alexander Goesmann, Frank Förster
Motivation and results: Crypt4GH-JS is a browser-ready implementation of the Crypt4GH file encryption standard written in JavaScript. While having minimal to no impact on data upload and download throughput this library enables on-the-fly encryption of arbitrary data in web applications, regardless of whether on the client or server side. As development moves more and more toward cloud-native applications, this library represents a significant step forward for flexible data security in the context of opaque cloud storage systems.
Availability and implementation: Crypt4GH-JS can be installed via Node Package Manager (https://www.npmjs.com/package/crypt4gh_js) or through its public GitHub Repository (https://github.com/fathelen/crypt4ghJS), where the source code is available. Crypt4GH-JS can be tested in the browser using our demonstration website, which can be found at: https://fathelen.github.io/crypt4ghJS/.
{"title":"Crypt4GH-JS: securely storing sensitive data online with client-side encryption.","authors":"Fabienne Thelen, Jannis Hochmuth, Sven Griep, Benedikt Schwab, Alexander Goesmann, Frank Förster","doi":"10.1093/bioinformatics/btae763","DOIUrl":"10.1093/bioinformatics/btae763","url":null,"abstract":"<p><strong>Motivation and results: </strong>Crypt4GH-JS is a browser-ready implementation of the Crypt4GH file encryption standard written in JavaScript. While having minimal to no impact on data upload and download throughput this library enables on-the-fly encryption of arbitrary data in web applications, regardless of whether on the client or server side. As development moves more and more toward cloud-native applications, this library represents a significant step forward for flexible data security in the context of opaque cloud storage systems.</p><p><strong>Availability and implementation: </strong>Crypt4GH-JS can be installed via Node Package Manager (https://www.npmjs.com/package/crypt4gh_js) or through its public GitHub Repository (https://github.com/fathelen/crypt4ghJS), where the source code is available. Crypt4GH-JS can be tested in the browser using our demonstration website, which can be found at: https://fathelen.github.io/crypt4ghJS/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11771768/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142933805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}