Pub Date : 2025-02-06DOI: 10.1093/bioinformatics/btaf064
Javier Delgado, Raul Reche, Damiano Cianferoni, Gabriele Orlando, Rob van der Kant, Frederic Rousseau, Joost Schymkowitz, Luis Serrano
Motivation: The FoldX force field was originally validated with a database of 1000 mutants at a time when there were few high-resolution structures. Here we have manually curated a database of 5556 mutants affecting protein stability, resulting in 2484 highly confident mutations denominated FoldX Stability Dataset (FSD), represented in non-redundant X-ray structures with less than 2.5 Å resolution, not involving duplicates, metals or prosthetic groups. Using this database, we have created a new version of the FoldX force field by introducing Pi stacking, pH dependency for all charged residues, improving aromatic-aromatic interactions, modifying the Ncap contribution and α-helix dipole, recalibrating the side chain entropy of Methionine, adjusting the H-bond parameters, and modifying the solvation contribution of Tryptophan and others.
Results: These changes have led to significant improvements for the prediction of specific mutants involving the above residues/interactions and a statistically significant increase of FoldX predictions, as well as for the majority of the 20 aa. Removing all training sets data from FSD (VFSD dataset), resulted in improved predictions from R = 0.693 (RMSE = 1.277 kcal/mol) to R = 0.706 (RMSE = 1.252 kcal/mol) when compared with the previously released version. FoldX achieves 95% accuracy considering an error of ± 0.85 kcal/mol in prediction, and an AUC = 0.78, for the VFSD, predicting the sign of the energy change upon mutation.
Availability: FoldX versions 4.1 & 5.1 are freely available for academics at https://foldxsuite.crg.eu/.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"FoldX Force Field revisited, an improved version.","authors":"Javier Delgado, Raul Reche, Damiano Cianferoni, Gabriele Orlando, Rob van der Kant, Frederic Rousseau, Joost Schymkowitz, Luis Serrano","doi":"10.1093/bioinformatics/btaf064","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf064","url":null,"abstract":"<p><strong>Motivation: </strong>The FoldX force field was originally validated with a database of 1000 mutants at a time when there were few high-resolution structures. Here we have manually curated a database of 5556 mutants affecting protein stability, resulting in 2484 highly confident mutations denominated FoldX Stability Dataset (FSD), represented in non-redundant X-ray structures with less than 2.5 Å resolution, not involving duplicates, metals or prosthetic groups. Using this database, we have created a new version of the FoldX force field by introducing Pi stacking, pH dependency for all charged residues, improving aromatic-aromatic interactions, modifying the Ncap contribution and α-helix dipole, recalibrating the side chain entropy of Methionine, adjusting the H-bond parameters, and modifying the solvation contribution of Tryptophan and others.</p><p><strong>Results: </strong>These changes have led to significant improvements for the prediction of specific mutants involving the above residues/interactions and a statistically significant increase of FoldX predictions, as well as for the majority of the 20 aa. Removing all training sets data from FSD (VFSD dataset), resulted in improved predictions from R = 0.693 (RMSE = 1.277 kcal/mol) to R = 0.706 (RMSE = 1.252 kcal/mol) when compared with the previously released version. FoldX achieves 95% accuracy considering an error of ± 0.85 kcal/mol in prediction, and an AUC = 0.78, for the VFSD, predicting the sign of the energy change upon mutation.</p><p><strong>Availability: </strong>FoldX versions 4.1 & 5.1 are freely available for academics at https://foldxsuite.crg.eu/.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143367114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-05DOI: 10.1093/bioinformatics/btaf048
Victor Paton, Denes Türei, Olga Ivanova, Sophia Müller-Dott, Pablo Rodriguez-Mier, Veron I Ca Venafra, Livia Perfetto, Martin Garrido-Rodriguez, Julio Saez-Rodriguez
Summary: We present NetworkCommons, a platform for integrating prior knowledge, omics data, and network inference methods, facilitating their usage and evaluation. NetworkCommons aims to be an infrastructure for the network biology community that supports the development of better methods and benchmarks, by enhancing interoperability and integration.
Availability and implementation: NetworkCommons is implemented in Python and offers programmatic access to multiple omics datasets, network inference methods, and benchmarking setups. It is a free software, available at https://github.com/saezlab/networkcommons, and deposited in Zenodo at https://doi.org/10.5281/zenodo.14719118 .
Supplementary data: Contribution guidelines, additional figures, and descriptions for data, knowledge, methods, evaluation strategies and their implementation are available in the Supplementary Data and in the NetworkCommons documentation at https://networkcommons.readthedocs.io/.
{"title":"NetworkCommons: bridging data, knowledge and methods to build and evaluate context-specific biological networks.","authors":"Victor Paton, Denes Türei, Olga Ivanova, Sophia Müller-Dott, Pablo Rodriguez-Mier, Veron I Ca Venafra, Livia Perfetto, Martin Garrido-Rodriguez, Julio Saez-Rodriguez","doi":"10.1093/bioinformatics/btaf048","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf048","url":null,"abstract":"<p><strong>Summary: </strong>We present NetworkCommons, a platform for integrating prior knowledge, omics data, and network inference methods, facilitating their usage and evaluation. NetworkCommons aims to be an infrastructure for the network biology community that supports the development of better methods and benchmarks, by enhancing interoperability and integration.</p><p><strong>Availability and implementation: </strong>NetworkCommons is implemented in Python and offers programmatic access to multiple omics datasets, network inference methods, and benchmarking setups. It is a free software, available at https://github.com/saezlab/networkcommons, and deposited in Zenodo at https://doi.org/10.5281/zenodo.14719118 .</p><p><strong>Supplementary data: </strong>Contribution guidelines, additional figures, and descriptions for data, knowledge, methods, evaluation strategies and their implementation are available in the Supplementary Data and in the NetworkCommons documentation at https://networkcommons.readthedocs.io/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143191532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-05DOI: 10.1093/bioinformatics/btaf058
Zhenmiao Zhang, Ishaan Gupta, Pavel A Pevzner
Motivation: The emergence of the "telomere-to-telomere" genomics brought the challenge of identifying segmental duplications (SDs) in complete genomes. It further opened a possibility for identifying the differences in SDs across individual human genomes and studying the SD evolution. These newly emerged challenges require algorithms for reconstructing SDs in the most complex genomic regions that evaded all previous attempts to analyze their architecture, such as rapidly-evolving immunoglobulin loci.
Results: We describe the GenomeDecoder algorithm for inferring SDs and apply it to analyzing genomic architectures of various loci in primate genomes. Our analysis revealed that multiple duplications/deletions led to a rapid birth/death of immunoglobulin genes within the human population and large changes in genomic architecture of immunoglobulin loci across primate genomes. Comparison of immunoglobulin loci across primate genomes suggests that they are subjected to diversifying selection.
Availability and implementation: GenomeDecoder is available at https://github.com/ZhangZhenmiao/GenomeDecoder. The software version and test data used in this paper is uploaded to https://doi.org/10.5281/zenodo.14753844.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"GenomeDecoder: Inferring Segmental Duplica-tions in Highly-Repetitive Genomic Regions.","authors":"Zhenmiao Zhang, Ishaan Gupta, Pavel A Pevzner","doi":"10.1093/bioinformatics/btaf058","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf058","url":null,"abstract":"<p><strong>Motivation: </strong>The emergence of the \"telomere-to-telomere\" genomics brought the challenge of identifying segmental duplications (SDs) in complete genomes. It further opened a possibility for identifying the differences in SDs across individual human genomes and studying the SD evolution. These newly emerged challenges require algorithms for reconstructing SDs in the most complex genomic regions that evaded all previous attempts to analyze their architecture, such as rapidly-evolving immunoglobulin loci.</p><p><strong>Results: </strong>We describe the GenomeDecoder algorithm for inferring SDs and apply it to analyzing genomic architectures of various loci in primate genomes. Our analysis revealed that multiple duplications/deletions led to a rapid birth/death of immunoglobulin genes within the human population and large changes in genomic architecture of immunoglobulin loci across primate genomes. Comparison of immunoglobulin loci across primate genomes suggests that they are subjected to diversifying selection.</p><p><strong>Availability and implementation: </strong>GenomeDecoder is available at https://github.com/ZhangZhenmiao/GenomeDecoder. The software version and test data used in this paper is uploaded to https://doi.org/10.5281/zenodo.14753844.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143257344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1093/bioinformatics/btaf023
Chuanze Kang, Zonghuan Liu, Han Zhang
Motivation: The drug-disease, gene-disease, and drug-gene relationships, as high-frequency edge types, describe complex biological processes within the biomedical knowledge graph. The structural patterns formed by these three edges are the graph motifs of (disease, drug, gene) triplets. Among them, the triangle is a steady and important motif structure in the network, and other various motifs different from the triangle also indicate rich semantic relationships. However, existing methods only focus on the triangle representation learning for classification, and fail to further discriminate various motifs of triplets. A comprehensive method is needed to predict the various motifs within triplets, which will uncover new pharmacological mechanisms and improve our understanding of disease-gene-drug interactions. Identifying complex motif structures within triplets can also help us to study the structural properties of triangles.
Results: We consider the seven typical motifs within the triplets and propose a novel graph contrastive learning-based method for triplet motif prediction (TriMoGCL). TriMoGCL utilizes a graph convolutional encoder to extract node features from the global network topology. Next, node pooling and edge pooling extract context information as the triplet features from global and local views. To avoid the redundant context information and motif imbalance problem caused by dense edges, we use node and class-prototype contrastive learning to denoise triplet features and enhance discrimination between motifs. The experiments on two different-scale knowledge graphs demonstrate the effectiveness and reliability of TriMoGCL in identifying various motif types. In addition, our model reveals new pharmacological mechanisms, providing a comprehensive analysis of triplet motifs.
Availability and implementation: Codes and datasets are available at https://github.com/zhanglabNKU/TriMoGCL and https://doi.org/10.5281/zenodo.14633572.
{"title":"A comprehensive graph neural network method for predicting triplet motifs in disease-drug-gene interactions.","authors":"Chuanze Kang, Zonghuan Liu, Han Zhang","doi":"10.1093/bioinformatics/btaf023","DOIUrl":"10.1093/bioinformatics/btaf023","url":null,"abstract":"<p><strong>Motivation: </strong>The drug-disease, gene-disease, and drug-gene relationships, as high-frequency edge types, describe complex biological processes within the biomedical knowledge graph. The structural patterns formed by these three edges are the graph motifs of (disease, drug, gene) triplets. Among them, the triangle is a steady and important motif structure in the network, and other various motifs different from the triangle also indicate rich semantic relationships. However, existing methods only focus on the triangle representation learning for classification, and fail to further discriminate various motifs of triplets. A comprehensive method is needed to predict the various motifs within triplets, which will uncover new pharmacological mechanisms and improve our understanding of disease-gene-drug interactions. Identifying complex motif structures within triplets can also help us to study the structural properties of triangles.</p><p><strong>Results: </strong>We consider the seven typical motifs within the triplets and propose a novel graph contrastive learning-based method for triplet motif prediction (TriMoGCL). TriMoGCL utilizes a graph convolutional encoder to extract node features from the global network topology. Next, node pooling and edge pooling extract context information as the triplet features from global and local views. To avoid the redundant context information and motif imbalance problem caused by dense edges, we use node and class-prototype contrastive learning to denoise triplet features and enhance discrimination between motifs. The experiments on two different-scale knowledge graphs demonstrate the effectiveness and reliability of TriMoGCL in identifying various motif types. In addition, our model reveals new pharmacological mechanisms, providing a comprehensive analysis of triplet motifs.</p><p><strong>Availability and implementation: </strong>Codes and datasets are available at https://github.com/zhanglabNKU/TriMoGCL and https://doi.org/10.5281/zenodo.14633572.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11796092/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143018194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1093/bioinformatics/btaf036
Miron B Kursa
Motivation: It is a challenging task to decipher the mechanisms of a complex system from observational data, especially in biology, where systems are sophisticated, measurements coarse, and multi-modality common. The typical approaches of inferring a network of relationships between a system's components struggle with the quality and feasibility of estimation, as well as with the interpretability of the results they yield. Said issues can be avoided, however, when dealing with a simpler problem of tracking only the influence paths, defined as circuits relying on the information of an experimental perturbation as it spreads through the system. Such an approach can be formalized with information theory and leads to a relatively streamlined, interpretable output, in contrast to the incomprehensibly dense 'haystack' networks produced by typical tools.
Results: Following this idea, the paper introduces Vistla, a novel method built around tri-variate mutual information and data processing inequality, combined with a higher-order generalization of the widest path problem. Vistla can be used standalone, in a machine learning pipeline to aid interpretability, or as a tool for mediation analysis; the paper demonstrates its efficiency both in synthetic and real-world problems.
Availability and implementation: The R package implementing the method is available at https://gitlab.com/mbq/vistla, as well as on CRAN.
{"title":"Vistla: identifying influence paths with information theory.","authors":"Miron B Kursa","doi":"10.1093/bioinformatics/btaf036","DOIUrl":"10.1093/bioinformatics/btaf036","url":null,"abstract":"<p><strong>Motivation: </strong>It is a challenging task to decipher the mechanisms of a complex system from observational data, especially in biology, where systems are sophisticated, measurements coarse, and multi-modality common. The typical approaches of inferring a network of relationships between a system's components struggle with the quality and feasibility of estimation, as well as with the interpretability of the results they yield. Said issues can be avoided, however, when dealing with a simpler problem of tracking only the influence paths, defined as circuits relying on the information of an experimental perturbation as it spreads through the system. Such an approach can be formalized with information theory and leads to a relatively streamlined, interpretable output, in contrast to the incomprehensibly dense 'haystack' networks produced by typical tools.</p><p><strong>Results: </strong>Following this idea, the paper introduces Vistla, a novel method built around tri-variate mutual information and data processing inequality, combined with a higher-order generalization of the widest path problem. Vistla can be used standalone, in a machine learning pipeline to aid interpretability, or as a tool for mediation analysis; the paper demonstrates its efficiency both in synthetic and real-world problems.</p><p><strong>Availability and implementation: </strong>The R package implementing the method is available at https://gitlab.com/mbq/vistla, as well as on CRAN.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11806950/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143034834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1093/bioinformatics/btaf043
Fabricio Almeida-Silva, Yves Van de Peer
Summary: Gene and genome duplications are major evolutionary forces that shape the diversity and complexity of life. However, different duplication modes have distinct impacts on gene function, expression, and regulation. Existing tools for identifying and classifying duplicated genes are either outdated or not user-friendly. Here, we present doubletrouble, an R/Bioconductor package that provides a comprehensive and robust framework for analyzing duplicated genes from genomic data. doubletrouble can detect and classify gene pairs as derived from six duplication modes (segmental, tandem, proximal, retrotransposon-derived, DNA transposon-derived, and dispersed duplications), calculate substitution rates, detect signatures of putative whole-genome duplication events, and visualize results as publication-ready figures. We applied doubletrouble to classify the duplicated gene repertoire in 822 eukaryotic genomes, and results were made available through a user-friendly web interface.
Availability and implementation: doubletrouble is available on Bioconductor (https://bioconductor.org/packages/doubletrouble), and the source code is available in a GitHub repository (https://github.com/almeidasilvaf/doubletrouble). doubletroubledb is available online at https://almeidasilvaf.github.io/doubletroubledb/.
{"title":"doubletrouble: an R/Bioconductor package for the identification, classification, and analysis of gene and genome duplications.","authors":"Fabricio Almeida-Silva, Yves Van de Peer","doi":"10.1093/bioinformatics/btaf043","DOIUrl":"10.1093/bioinformatics/btaf043","url":null,"abstract":"<p><strong>Summary: </strong>Gene and genome duplications are major evolutionary forces that shape the diversity and complexity of life. However, different duplication modes have distinct impacts on gene function, expression, and regulation. Existing tools for identifying and classifying duplicated genes are either outdated or not user-friendly. Here, we present doubletrouble, an R/Bioconductor package that provides a comprehensive and robust framework for analyzing duplicated genes from genomic data. doubletrouble can detect and classify gene pairs as derived from six duplication modes (segmental, tandem, proximal, retrotransposon-derived, DNA transposon-derived, and dispersed duplications), calculate substitution rates, detect signatures of putative whole-genome duplication events, and visualize results as publication-ready figures. We applied doubletrouble to classify the duplicated gene repertoire in 822 eukaryotic genomes, and results were made available through a user-friendly web interface.</p><p><strong>Availability and implementation: </strong>doubletrouble is available on Bioconductor (https://bioconductor.org/packages/doubletrouble), and the source code is available in a GitHub repository (https://github.com/almeidasilvaf/doubletrouble). doubletroubledb is available online at https://almeidasilvaf.github.io/doubletroubledb/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11810640/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143043754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1093/bioinformatics/btaf012
Bernat Bramon Mora, Helen Lindsay, Antonin Thiébaut, Kenneth D Stuart, Raphael Gottardo
Summary: In this article, we present tagtango, an innovative R package and web application designed for robust and intuitive comparison of single-cell clusters and annotations. It offers an interactive platform that simplifies the exploration of differences and similarities among different clustering and annotation methods. Leveraging single-cell data analysis and different visualizations, it allows researchers to dissect the underlying biological differences across groups. tagtango is a user-friendly application that is portable and works seamlessly across multiple operating systems.
Availability and implementation: tagtango is freely available at https://github.com/bernibra/tagtango as an R package as well as an online web service at https://tagtango.unil.ch.
{"title":"tagtango: an application to compare single-cell annotations.","authors":"Bernat Bramon Mora, Helen Lindsay, Antonin Thiébaut, Kenneth D Stuart, Raphael Gottardo","doi":"10.1093/bioinformatics/btaf012","DOIUrl":"10.1093/bioinformatics/btaf012","url":null,"abstract":"<p><strong>Summary: </strong>In this article, we present tagtango, an innovative R package and web application designed for robust and intuitive comparison of single-cell clusters and annotations. It offers an interactive platform that simplifies the exploration of differences and similarities among different clustering and annotation methods. Leveraging single-cell data analysis and different visualizations, it allows researchers to dissect the underlying biological differences across groups. tagtango is a user-friendly application that is portable and works seamlessly across multiple operating systems.</p><p><strong>Availability and implementation: </strong>tagtango is freely available at https://github.com/bernibra/tagtango as an R package as well as an online web service at https://tagtango.unil.ch.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11814489/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142973834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1093/bioinformatics/btaf050
Lin Du, Hammad Farooq, Pourya Delafrouz, Jie Liang
Motivation: Techniques such as high-throughput chromosome conformation capture (Hi-C) have provided a wealth of information on nucleus organization and genome important for understanding gene expression regulation. Genome-Wide Association Studies have identified numerous loci associated with complex traits. Expression quantitative trait loci (eQTL) studies have further linked the genetic variants to alteration in expression levels of associated target genes across individuals. However, the functional roles of many eQTLs in noncoding regions remain unclear. Current joint analyses of Hi-C and eQTLs data lack advanced computational tools, limiting what can be learned from these data.
Results: We developed a computational method for simultaneous analysis of Hi-C and eQTL data, capable of identifying a small set of nonrandom interactions from all Hi-C interactions. Using these nonrandom interactions, we reconstructed large ensembles (×105) of high-resolution single-cell 3D chromatin conformations with thorough sampling, accurately replicating Hi-C measurements. Our results revealed many-body interactions in chromatin conformation at the single-cell level within eQTL loci, providing a detailed view of how 3D chromatin structures form the physical foundation for gene regulation, including how genetic variants of eQTLs affect the expression of associated eGenes. Furthermore, our method can deconvolve chromatin heterogeneity and investigate the spatial associations of eQTLs and eGenes at subpopulation level, revealing their regulatory impacts on gene expression. Together, ensemble modeling of thoroughly sampled single-cell chromatin conformations combined with eQTL data, helps decipher how 3D chromatin structures provide the physical basis for gene regulation, expression control, and aid in understanding the overall structure-function relationships of genome organization.
Availability and implementation: It is available at https://github.com/uic-liang-lab/3DChromFolding-eQTL-Loci.
{"title":"Structural basis of differential gene expression at eQTLs loci from high-resolution ensemble models of 3D single-cell chromatin conformations.","authors":"Lin Du, Hammad Farooq, Pourya Delafrouz, Jie Liang","doi":"10.1093/bioinformatics/btaf050","DOIUrl":"10.1093/bioinformatics/btaf050","url":null,"abstract":"<p><strong>Motivation: </strong>Techniques such as high-throughput chromosome conformation capture (Hi-C) have provided a wealth of information on nucleus organization and genome important for understanding gene expression regulation. Genome-Wide Association Studies have identified numerous loci associated with complex traits. Expression quantitative trait loci (eQTL) studies have further linked the genetic variants to alteration in expression levels of associated target genes across individuals. However, the functional roles of many eQTLs in noncoding regions remain unclear. Current joint analyses of Hi-C and eQTLs data lack advanced computational tools, limiting what can be learned from these data.</p><p><strong>Results: </strong>We developed a computational method for simultaneous analysis of Hi-C and eQTL data, capable of identifying a small set of nonrandom interactions from all Hi-C interactions. Using these nonrandom interactions, we reconstructed large ensembles (×105) of high-resolution single-cell 3D chromatin conformations with thorough sampling, accurately replicating Hi-C measurements. Our results revealed many-body interactions in chromatin conformation at the single-cell level within eQTL loci, providing a detailed view of how 3D chromatin structures form the physical foundation for gene regulation, including how genetic variants of eQTLs affect the expression of associated eGenes. Furthermore, our method can deconvolve chromatin heterogeneity and investigate the spatial associations of eQTLs and eGenes at subpopulation level, revealing their regulatory impacts on gene expression. Together, ensemble modeling of thoroughly sampled single-cell chromatin conformations combined with eQTL data, helps decipher how 3D chromatin structures provide the physical basis for gene regulation, expression control, and aid in understanding the overall structure-function relationships of genome organization.</p><p><strong>Availability and implementation: </strong>It is available at https://github.com/uic-liang-lab/3DChromFolding-eQTL-Loci.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143076694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1093/bioinformatics/btaf052
Yingfei Wang, Jinsen Li, Tsu-Pei Chiu, Nicolas Gompel, Remo Rohs
Motivation: DNA sequence and shape readout represent different modes of protein-DNA recognition. Current tools lack the functionality to simultaneously consider alterations in different readout modes caused by sequence mutations. DNAdesign is a web-based tool to compare and design mutations based on both DNA sequence and shape characteristics. Users input a wild-type sequence, select sites to introduce mutations and choose a set of DNA shape parameters for mutation design.
Results: DNAdesign utilizes Deep DNAshape to provide ultra-fast predictions of DNA shape based on extended k-mers and offers multiple encoding methods for nucleotide sequences, including the physicochemical encoding of DNA through their functional groups in the major and minor groove. DNAdesign provides all mutation candidates along the sequence and shape dimensions, with interactive visualization comparing each candidate with the wild-type DNA molecule. DNAdesign provides an approach to studying gene regulation and applications in synthetic biology, such as the design of synthetic enhancers and transcription factor binding sites.
Availability and implementation: The DNAdesign webserver and documentation are freely accessible at https://dnadesign.usc.edu.
{"title":"DNAdesign: feature-aware in silico design of synthetic DNA through mutation.","authors":"Yingfei Wang, Jinsen Li, Tsu-Pei Chiu, Nicolas Gompel, Remo Rohs","doi":"10.1093/bioinformatics/btaf052","DOIUrl":"10.1093/bioinformatics/btaf052","url":null,"abstract":"<p><strong>Motivation: </strong>DNA sequence and shape readout represent different modes of protein-DNA recognition. Current tools lack the functionality to simultaneously consider alterations in different readout modes caused by sequence mutations. DNAdesign is a web-based tool to compare and design mutations based on both DNA sequence and shape characteristics. Users input a wild-type sequence, select sites to introduce mutations and choose a set of DNA shape parameters for mutation design.</p><p><strong>Results: </strong>DNAdesign utilizes Deep DNAshape to provide ultra-fast predictions of DNA shape based on extended k-mers and offers multiple encoding methods for nucleotide sequences, including the physicochemical encoding of DNA through their functional groups in the major and minor groove. DNAdesign provides all mutation candidates along the sequence and shape dimensions, with interactive visualization comparing each candidate with the wild-type DNA molecule. DNAdesign provides an approach to studying gene regulation and applications in synthetic biology, such as the design of synthetic enhancers and transcription factor binding sites.</p><p><strong>Availability and implementation: </strong>The DNAdesign webserver and documentation are freely accessible at https://dnadesign.usc.edu.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11825384/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143076691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1093/bioinformatics/btaf035
R Prabakaran, Yana Bromberg
Motivation: In silico functional annotation of proteins is crucial to narrowing the sequencing-accelerated gap in our understanding of protein activities. Numerous function annotation methods exist, and their ranks have been growing, particularly so with the recent deep learning-based developments. However, it is unclear if these tools are truly predictive. As we are not aware of any methods that can identify new terms in functional ontologies, we ask if they can, at least, identify molecular functions of proteins that are non-homologous to or far-removed from known protein families.
Results: Here, we explore the potential and limitations of the existing methods in predicting the molecular functions of thousands of such proteins. Lacking the "ground truth" functional annotations, we transformed the assessment of function prediction into evaluation of functional similarity of protein pairs that likely share function but are unlike any of the currently functionally annotated sequences. Notably, our approach transcends the limitations of functional annotation vocabularies, providing a means to assess different-ontology annotation methods. We find that most existing methods are limited to identifying functional similarity of homologous sequences and fail to predict the function of proteins lacking reference. Curiously, despite their seemingly unlimited by-homology scope, deep learning methods also have trouble capturing the functional signal encoded in protein sequence. We believe that our work will inspire the development of a new generation of methods that push boundaries and promote exploration and discovery in the molecular function domain.
Availability and implementation: The data underlying this article are available at https://doi.org/10.6084/m9.figshare.c.6737127.v3. The code used to compute siblings is available openly at https://bitbucket.org/bromberglab/siblings-detector/.
{"title":"Functional profiling of the sequence stockpile: a protein pair-based assessment of in silico prediction tools.","authors":"R Prabakaran, Yana Bromberg","doi":"10.1093/bioinformatics/btaf035","DOIUrl":"10.1093/bioinformatics/btaf035","url":null,"abstract":"<p><strong>Motivation: </strong>In silico functional annotation of proteins is crucial to narrowing the sequencing-accelerated gap in our understanding of protein activities. Numerous function annotation methods exist, and their ranks have been growing, particularly so with the recent deep learning-based developments. However, it is unclear if these tools are truly predictive. As we are not aware of any methods that can identify new terms in functional ontologies, we ask if they can, at least, identify molecular functions of proteins that are non-homologous to or far-removed from known protein families.</p><p><strong>Results: </strong>Here, we explore the potential and limitations of the existing methods in predicting the molecular functions of thousands of such proteins. Lacking the \"ground truth\" functional annotations, we transformed the assessment of function prediction into evaluation of functional similarity of protein pairs that likely share function but are unlike any of the currently functionally annotated sequences. Notably, our approach transcends the limitations of functional annotation vocabularies, providing a means to assess different-ontology annotation methods. We find that most existing methods are limited to identifying functional similarity of homologous sequences and fail to predict the function of proteins lacking reference. Curiously, despite their seemingly unlimited by-homology scope, deep learning methods also have trouble capturing the functional signal encoded in protein sequence. We believe that our work will inspire the development of a new generation of methods that push boundaries and promote exploration and discovery in the molecular function domain.</p><p><strong>Availability and implementation: </strong>The data underlying this article are available at https://doi.org/10.6084/m9.figshare.c.6737127.v3. The code used to compute siblings is available openly at https://bitbucket.org/bromberglab/siblings-detector/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11821270/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143034899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}