Pub Date : 2024-12-18DOI: 10.1186/s12859-024-06000-4
Jon Bohlin, Siri E Håberg, Per Magnus, Håkon K Gjessing
Generating prediction models from high dimensional data often result in large models with many predictors. Causal inference for such models can therefore be difficult or even impossible in practice. The stand-alone software package MinLinMo emphasizes small linear prediction models over highest possible predictability with a particular focus on including variables correlated with the outcome, minimal memory usage and speed. MinLinMo is demonstrated on large epigenetic datasets with prediction models for chronological age, gestational age, and birth weight comprising, respectively, 15, 14 and 10 predictors. The parsimonious MinLinMo models perform comparably to established prediction models requiring hundreds of predictors.
{"title":"MinLinMo: a minimalist approach to variable selection and linear model prediction.","authors":"Jon Bohlin, Siri E Håberg, Per Magnus, Håkon K Gjessing","doi":"10.1186/s12859-024-06000-4","DOIUrl":"10.1186/s12859-024-06000-4","url":null,"abstract":"<p><p>Generating prediction models from high dimensional data often result in large models with many predictors. Causal inference for such models can therefore be difficult or even impossible in practice. The stand-alone software package MinLinMo emphasizes small linear prediction models over highest possible predictability with a particular focus on including variables correlated with the outcome, minimal memory usage and speed. MinLinMo is demonstrated on large epigenetic datasets with prediction models for chronological age, gestational age, and birth weight comprising, respectively, 15, 14 and 10 predictors. The parsimonious MinLinMo models perform comparably to established prediction models requiring hundreds of predictors.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"380"},"PeriodicalIF":2.9,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11654326/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142852292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-18DOI: 10.1186/s12859-024-05996-z
Chu Pan, Yanlin Chen
Background: Using information measures to infer biological regulatory networks can capture nonlinear relationships between variables. However, it is computationally challenging, and there is a lack of convenient tools.
Results: We introduce Informeasure, an R package designed to quantify nonlinear dependencies in biological regulatory networks from an information theory perspective. This package compiles a comprehensive set of information measurements, including mutual information, conditional mutual information, interaction information, partial information decomposition, and part mutual information. Mutual information is used for bivariate network inference, while the other four estimators are dedicated to trivariate network analysis.
Conclusions: Informeasure is a turnkey solution, allowing users to utilize these information measures immediately upon installation. Informeasure is available as an R/Bioconductor package at https://bioconductor.org/packages/Informeasure .
{"title":"Informeasure: an R/bioconductor package for quantifying nonlinear dependence between variables in biological networks from an information theory perspective.","authors":"Chu Pan, Yanlin Chen","doi":"10.1186/s12859-024-05996-z","DOIUrl":"10.1186/s12859-024-05996-z","url":null,"abstract":"<p><strong>Background: </strong>Using information measures to infer biological regulatory networks can capture nonlinear relationships between variables. However, it is computationally challenging, and there is a lack of convenient tools.</p><p><strong>Results: </strong>We introduce Informeasure, an R package designed to quantify nonlinear dependencies in biological regulatory networks from an information theory perspective. This package compiles a comprehensive set of information measurements, including mutual information, conditional mutual information, interaction information, partial information decomposition, and part mutual information. Mutual information is used for bivariate network inference, while the other four estimators are dedicated to trivariate network analysis.</p><p><strong>Conclusions: </strong>Informeasure is a turnkey solution, allowing users to utilize these information measures immediately upon installation. Informeasure is available as an R/Bioconductor package at https://bioconductor.org/packages/Informeasure .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"382"},"PeriodicalIF":2.9,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11658358/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142852288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-18DOI: 10.1186/s12859-024-05997-y
Jeet Sukumaran, Marina Meila
Background: Existing software for comparison of species delimitation models do not provide a (true) metric or distance functions between species delimitation models, nor a way to compare these models in terms of relative clustering differences along a lattice of partitions.
Results: Piikun is a Python package for analyzing and visualizing species delimitation models in an information theoretic framework that, in addition to classic measures of information such as the entropy and mutual information [1], provides for the calculation of the Variation of Information (VI) criterion [2], a true metric or distance function for species delimitation models that is aligned with the lattice of partitions.
Conclusions: Piikun is available under the MIT license from its public repository ( https://github.com/jeetsukumaran/piikun ), and can be installed locally using the Python package manager 'pip'.
{"title":"Piikun: an information theoretic toolkit for analysis and visualization of species delimitation metric space.","authors":"Jeet Sukumaran, Marina Meila","doi":"10.1186/s12859-024-05997-y","DOIUrl":"10.1186/s12859-024-05997-y","url":null,"abstract":"<p><strong>Background: </strong>Existing software for comparison of species delimitation models do not provide a (true) metric or distance functions between species delimitation models, nor a way to compare these models in terms of relative clustering differences along a lattice of partitions.</p><p><strong>Results: </strong>Piikun is a Python package for analyzing and visualizing species delimitation models in an information theoretic framework that, in addition to classic measures of information such as the entropy and mutual information [1], provides for the calculation of the Variation of Information (VI) criterion [2], a true metric or distance function for species delimitation models that is aligned with the lattice of partitions.</p><p><strong>Conclusions: </strong>Piikun is available under the MIT license from its public repository ( https://github.com/jeetsukumaran/piikun ), and can be installed locally using the Python package manager 'pip'.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"385"},"PeriodicalIF":2.9,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11657818/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142852298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-18DOI: 10.1186/s12859-024-06004-0
Andrea Conte, Nicola Gulmini, Francesco Costa, Matteo Cartura, Felix Bröhl, Francesco Patanè, Francesco Filippini
Background: Vaccines development in this millennium started by the milestone work on Neisseria meningitidis B, reporting the invention of Reverse Vaccinology (RV), which allows to identify vaccine candidates (VCs) by screening bacterial pathogens genome or proteome through computational analyses. When NERVE (New Enhanced RV Environment), the first RV software integrating tools to perform the selection of VCs, was released, it prompted further development in the field. However, the problem-solving potential of most, if not all, RV programs is still largely unexploited by experimental vaccinologists that impaired by somehow difficult interfaces, requiring bioinformatic skills.
Results: We report here on the development and release of NERVE 2.0 (available at: https://nerve-bio.org ) which keeps the original integrative and modular approach of NERVE, while showing higher predictive performance than its previous version and other web-RV programs (Vaxign and Vaxijen). We renewed some of its modules and added innovative ones, such as Loop-Razor, to recover fragments of promising vaccine candidates or Epitope Prediction for the epitope prediction binding affinities and population coverage. Along with two newly built AI (Artificial Intelligence)-based models: ESPAAN and Virulent. To improve user-friendliness, NERVE was shifted to a tutored, web-based interface, with a noSQL-database to consent the user to submit, obtain and retrieve analysis results at any moment.
Conclusions: With its redesigned and updated environment, NERVE 2.0 allows customisable and refinable bacterial protein vaccine analyses to all different kinds of users.
{"title":"NERVE 2.0: boosting the new enhanced reverse vaccinology environment via artificial intelligence and a user-friendly web interface.","authors":"Andrea Conte, Nicola Gulmini, Francesco Costa, Matteo Cartura, Felix Bröhl, Francesco Patanè, Francesco Filippini","doi":"10.1186/s12859-024-06004-0","DOIUrl":"10.1186/s12859-024-06004-0","url":null,"abstract":"<p><strong>Background: </strong>Vaccines development in this millennium started by the milestone work on Neisseria meningitidis B, reporting the invention of Reverse Vaccinology (RV), which allows to identify vaccine candidates (VCs) by screening bacterial pathogens genome or proteome through computational analyses. When NERVE (New Enhanced RV Environment), the first RV software integrating tools to perform the selection of VCs, was released, it prompted further development in the field. However, the problem-solving potential of most, if not all, RV programs is still largely unexploited by experimental vaccinologists that impaired by somehow difficult interfaces, requiring bioinformatic skills.</p><p><strong>Results: </strong>We report here on the development and release of NERVE 2.0 (available at: https://nerve-bio.org ) which keeps the original integrative and modular approach of NERVE, while showing higher predictive performance than its previous version and other web-RV programs (Vaxign and Vaxijen). We renewed some of its modules and added innovative ones, such as Loop-Razor, to recover fragments of promising vaccine candidates or Epitope Prediction for the epitope prediction binding affinities and population coverage. Along with two newly built AI (Artificial Intelligence)-based models: ESPAAN and Virulent. To improve user-friendliness, NERVE was shifted to a tutored, web-based interface, with a noSQL-database to consent the user to submit, obtain and retrieve analysis results at any moment.</p><p><strong>Conclusions: </strong>With its redesigned and updated environment, NERVE 2.0 allows customisable and refinable bacterial protein vaccine analyses to all different kinds of users.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"378"},"PeriodicalIF":2.9,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11654298/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142852294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-18DOI: 10.1186/s12859-024-05994-1
Christopher Patsalis, Gayatri Iyer, Marci Brandenburg, Alla Karnovsky, George Michailidis
Background: Metabolomics is a high-throughput technology that measures small molecule metabolites in cells, tissues or biofluids. Analysis of metabolomics data is a multi-step process that involves data processing, quality control and normalization, followed by statistical and bioinformatics analysis. The latter step often involves pathway analysis to aid biological interpretation of the data. This approach is limited to endogenous metabolites that can be readily mapped to metabolic pathways. An alternative to pathway analysis that can be used for any classes of metabolites, including unknown compounds that are ubiquitous in untargeted metabolomics data, involves defining metabolite-metabolite interactions using experimental data. Our group has developed several network-based methods that use partial correlations of experimentally determined metabolite measurements. These were implemented in CorrelationCalculator and Filigree, two software tools for the analysis of metabolomics data we developed previously. The latter tool implements the Differential Network Enrichment Analysis (DNEA) algorithm. This analysis is useful for building differential networks from metabolomics data containing two experimental groups and identifying differentially enriched metabolic modules. While Filigree is a user-friendly tool, it has certain limitations when used for the analysis of large-scale metabolomics datasets.
Results: We developed the DNEA R package for the data-driven network analysis of metabolomics data. We present the DNEA workflow and functionality, algorithm enhancements implemented with respect to the package's predecessor, Filigree, and discuss best practices for analyses. We tested the performance of the DNEA R package and illustrated its features using publicly available metabolomics data from the environmental determinants of diabetes in the young. To our knowledge, this package is the only publicly available tool designed for the construction of biological networks and subsequent enrichment testing for datasets containing exogenous, secondary, and unknown compounds. This greatly expands the scope of traditional enrichment analysis tools that can be used to analyze a relatively small set of well-annotated metabolites.
Conclusions: The DNEA R package is a more flexible and powerful implementation of our previously published software tool, Filigree. The modular structure of the package, along with the parallel processing framework built into the most computationally extensive steps of the algorithm, make it a powerful tool for the analysis of large and complex metabolomics datasets.
{"title":"DNEA: an R package for fast and versatile data-driven network analysis of metabolomics data.","authors":"Christopher Patsalis, Gayatri Iyer, Marci Brandenburg, Alla Karnovsky, George Michailidis","doi":"10.1186/s12859-024-05994-1","DOIUrl":"10.1186/s12859-024-05994-1","url":null,"abstract":"<p><strong>Background: </strong>Metabolomics is a high-throughput technology that measures small molecule metabolites in cells, tissues or biofluids. Analysis of metabolomics data is a multi-step process that involves data processing, quality control and normalization, followed by statistical and bioinformatics analysis. The latter step often involves pathway analysis to aid biological interpretation of the data. This approach is limited to endogenous metabolites that can be readily mapped to metabolic pathways. An alternative to pathway analysis that can be used for any classes of metabolites, including unknown compounds that are ubiquitous in untargeted metabolomics data, involves defining metabolite-metabolite interactions using experimental data. Our group has developed several network-based methods that use partial correlations of experimentally determined metabolite measurements. These were implemented in CorrelationCalculator and Filigree, two software tools for the analysis of metabolomics data we developed previously. The latter tool implements the Differential Network Enrichment Analysis (DNEA) algorithm. This analysis is useful for building differential networks from metabolomics data containing two experimental groups and identifying differentially enriched metabolic modules. While Filigree is a user-friendly tool, it has certain limitations when used for the analysis of large-scale metabolomics datasets.</p><p><strong>Results: </strong>We developed the DNEA R package for the data-driven network analysis of metabolomics data. We present the DNEA workflow and functionality, algorithm enhancements implemented with respect to the package's predecessor, Filigree, and discuss best practices for analyses. We tested the performance of the DNEA R package and illustrated its features using publicly available metabolomics data from the environmental determinants of diabetes in the young. To our knowledge, this package is the only publicly available tool designed for the construction of biological networks and subsequent enrichment testing for datasets containing exogenous, secondary, and unknown compounds. This greatly expands the scope of traditional enrichment analysis tools that can be used to analyze a relatively small set of well-annotated metabolites.</p><p><strong>Conclusions: </strong>The DNEA R package is a more flexible and powerful implementation of our previously published software tool, Filigree. The modular structure of the package, along with the parallel processing framework built into the most computationally extensive steps of the algorithm, make it a powerful tool for the analysis of large and complex metabolomics datasets.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"383"},"PeriodicalIF":2.9,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11657348/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142852195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-18DOI: 10.1186/s12859-024-06009-9
Y-H Taguchi, Turki Turki
The evaluation of drug-gene-disease interactions is key for the identification of drugs effective against disease. However, at present, drugs that are effective against genes that are critical for disease are difficult to identify. Following a disease-centric approach, there is a need to identify genes critical to disease function and find drugs that are effective against them. By contrast, following a drug-centric approach comprises identifying the genes targeted by drugs, and then the diseases in which the identified genes are critical. Both of these processes are complex. Using a gene-centric approach, whereby we identify genes that are effective against the disease and can be targeted by drugs, is much easier. However, how such sets of genes can be identified without specifying either the target diseases or drugs is not known. In this study, a novel artificial intelligence-based approach that employs unsupervised methods and identifies genes without specifying neither diseases nor drugs is presented. To evaluate its feasibility, we applied tensor decomposition (TD)-based unsupervised feature extraction (FE) to perform drug repositioning from protein-protein interactions (PPI) without any other information. Proteins selected by TD-based unsupervised FE include many genes related to cancers, as well as drugs that target the selected proteins. Thus, we were able to identify cancer drugs using only PPI. Because the selected proteins had more interactions, we replaced the selected proteins with hub proteins and found that hub proteins themselves could be used for drug repositioning. In contrast to hub proteins, which can only identify cancer drugs, TD-based unsupervised FE enables the identification of drugs for other diseases. In addition, TD-based unsupervised FE can be used to identify drugs that are effective in in vivo experiments, which is difficult when hub proteins are used. In conclusion, TD-based unsupervised FE is a useful tool for drug repositioning using only PPI without other information.
{"title":"Novel artificial intelligence-based identification of drug-gene-disease interaction using protein-protein interaction.","authors":"Y-H Taguchi, Turki Turki","doi":"10.1186/s12859-024-06009-9","DOIUrl":"10.1186/s12859-024-06009-9","url":null,"abstract":"<p><p>The evaluation of drug-gene-disease interactions is key for the identification of drugs effective against disease. However, at present, drugs that are effective against genes that are critical for disease are difficult to identify. Following a disease-centric approach, there is a need to identify genes critical to disease function and find drugs that are effective against them. By contrast, following a drug-centric approach comprises identifying the genes targeted by drugs, and then the diseases in which the identified genes are critical. Both of these processes are complex. Using a gene-centric approach, whereby we identify genes that are effective against the disease and can be targeted by drugs, is much easier. However, how such sets of genes can be identified without specifying either the target diseases or drugs is not known. In this study, a novel artificial intelligence-based approach that employs unsupervised methods and identifies genes without specifying neither diseases nor drugs is presented. To evaluate its feasibility, we applied tensor decomposition (TD)-based unsupervised feature extraction (FE) to perform drug repositioning from protein-protein interactions (PPI) without any other information. Proteins selected by TD-based unsupervised FE include many genes related to cancers, as well as drugs that target the selected proteins. Thus, we were able to identify cancer drugs using only PPI. Because the selected proteins had more interactions, we replaced the selected proteins with hub proteins and found that hub proteins themselves could be used for drug repositioning. In contrast to hub proteins, which can only identify cancer drugs, TD-based unsupervised FE enables the identification of drugs for other diseases. In addition, TD-based unsupervised FE can be used to identify drugs that are effective in in vivo experiments, which is difficult when hub proteins are used. In conclusion, TD-based unsupervised FE is a useful tool for drug repositioning using only PPI without other information.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"377"},"PeriodicalIF":2.9,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11653834/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142852296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-17DOI: 10.1186/s12859-024-06010-2
Jeremias Krause, Carlos Classen, Daniela Dey, Eva Lausberg, Luise Kessler, Thomas Eggermann, Ingo Kurth, Matthias Begemann, Florian Kraft
Background: Methods to call, analyze and visualize copy number variations (CNVs) from massive parallel sequencing data have been widely adopted in clinical practice and genetic research. To enable a streamlined analysis of CNV data, comprehensive annotations and good visualizations are indispensable. The ability to detect single exon CNVs is another important feature for genetic testing. Nonetheless, most available open-source tools come with limitations in at least one of these areas. One additional drawback is that available tools deliver data in an unstructured and static format which requires subsequent visualization and formatting efforts.
Results: Here we present CNVizard, an interactive Streamlit app allowing a comprehensive visualization of CNVkit data. Furthermore, combining CNVizard with the CNVand pipeline allows the annotation and visualization of CNV or SV VCF files from any CNV caller.
Conclusion: CNVizard, in combination with CNVand, enables the comprehensive and streamlined analysis of short- and long-read sequencing data and provide an intuitive webapp-like experience enabling an interactive visualization of CNV data.
{"title":"CNVizard-a lightweight streamlit application for an interactive analysis of copy number variants.","authors":"Jeremias Krause, Carlos Classen, Daniela Dey, Eva Lausberg, Luise Kessler, Thomas Eggermann, Ingo Kurth, Matthias Begemann, Florian Kraft","doi":"10.1186/s12859-024-06010-2","DOIUrl":"10.1186/s12859-024-06010-2","url":null,"abstract":"<p><strong>Background: </strong>Methods to call, analyze and visualize copy number variations (CNVs) from massive parallel sequencing data have been widely adopted in clinical practice and genetic research. To enable a streamlined analysis of CNV data, comprehensive annotations and good visualizations are indispensable. The ability to detect single exon CNVs is another important feature for genetic testing. Nonetheless, most available open-source tools come with limitations in at least one of these areas. One additional drawback is that available tools deliver data in an unstructured and static format which requires subsequent visualization and formatting efforts.</p><p><strong>Results: </strong>Here we present CNVizard, an interactive Streamlit app allowing a comprehensive visualization of CNVkit data. Furthermore, combining CNVizard with the CNVand pipeline allows the annotation and visualization of CNV or SV VCF files from any CNV caller.</p><p><strong>Conclusion: </strong>CNVizard, in combination with CNVand, enables the comprehensive and streamlined analysis of short- and long-read sequencing data and provide an intuitive webapp-like experience enabling an interactive visualization of CNV data.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"376"},"PeriodicalIF":2.9,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11650836/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142845685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-05DOI: 10.1186/s12859-024-05984-3
Jinghong Sun, Han Wang, Jia Mi, Jing Wan, Jingyang Gao
Background: The development of drug-target binding affinity (DTA) prediction tasks significantly drives the drug discovery process forward. Leveraging the rapid advancement of artificial intelligence, DTA prediction tasks have undergone a transformative shift from wet lab experimentation to machine learning-based prediction. This transition enables a more expedient exploration of potential interactions between drugs and targets, leading to substantial savings in time and funding resources. However, existing methods still face several challenges, such as drug information loss, lack of calculation of the contribution of each modality, and lack of simulation regarding the drug-target binding mechanisms.
Results: We propose MTAF-DTA, a method for drug-target binding affinity prediction to solve the above problems. The drug representation module extracts three modalities of features from drugs and uses an attention mechanism to update their respective contribution weights. Additionally, we design a Spiral-Attention Block (SAB) as drug-target feature fusion module based on multi-type attention mechanisms, facilitating a triple fusion process between them. The SAB, to some extent, simulates the interactions between drugs and targets, thereby enabling outstanding performance in the DTA task. Our regression task on the Davis and KIBA datasets demonstrates the predictive capability of MTAF-DTA, with CI and MSE metrics showing respective improvements of 1.1% and 9.2% over the state-of-the-art (SOTA) method in the novel target settings. Furthermore, downstream tasks further validate MTAF-DTA's superiority in DTA prediction.
Conclusions: Experimental results and case study demonstrate the superior performance of our approach in DTA prediction tasks, showing its potential in practical applications such as drug discovery and disease treatment.
{"title":"MTAF-DTA: multi-type attention fusion network for drug-target affinity prediction.","authors":"Jinghong Sun, Han Wang, Jia Mi, Jing Wan, Jingyang Gao","doi":"10.1186/s12859-024-05984-3","DOIUrl":"10.1186/s12859-024-05984-3","url":null,"abstract":"<p><strong>Background: </strong>The development of drug-target binding affinity (DTA) prediction tasks significantly drives the drug discovery process forward. Leveraging the rapid advancement of artificial intelligence, DTA prediction tasks have undergone a transformative shift from wet lab experimentation to machine learning-based prediction. This transition enables a more expedient exploration of potential interactions between drugs and targets, leading to substantial savings in time and funding resources. However, existing methods still face several challenges, such as drug information loss, lack of calculation of the contribution of each modality, and lack of simulation regarding the drug-target binding mechanisms.</p><p><strong>Results: </strong>We propose MTAF-DTA, a method for drug-target binding affinity prediction to solve the above problems. The drug representation module extracts three modalities of features from drugs and uses an attention mechanism to update their respective contribution weights. Additionally, we design a Spiral-Attention Block (SAB) as drug-target feature fusion module based on multi-type attention mechanisms, facilitating a triple fusion process between them. The SAB, to some extent, simulates the interactions between drugs and targets, thereby enabling outstanding performance in the DTA task. Our regression task on the Davis and KIBA datasets demonstrates the predictive capability of MTAF-DTA, with CI and MSE metrics showing respective improvements of 1.1% and 9.2% over the state-of-the-art (SOTA) method in the novel target settings. Furthermore, downstream tasks further validate MTAF-DTA's superiority in DTA prediction.</p><p><strong>Conclusions: </strong>Experimental results and case study demonstrate the superior performance of our approach in DTA prediction tasks, showing its potential in practical applications such as drug discovery and disease treatment.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"375"},"PeriodicalIF":2.9,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11622562/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142784054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-04DOI: 10.1186/s12859-024-05993-2
Ivo C Leist, María Rivas-Torrubia, Marta E Alarcón-Riquelme, Guillermo Barturen, Precisesads Clinical Consortium, Ivo G Gut, Manuel Rueda
Background: Phenotypic data comparison is essential for disease association studies, patient stratification, and genotype-phenotype correlation analysis. To support these efforts, the Global Alliance for Genomics and Health (GA4GH) established Phenopackets v2 and Beacon v2 standards for storing, sharing, and discovering genomic and phenotypic data. These standards provide a consistent framework for organizing biological data, simplifying their transformation into computer-friendly formats. However, matching participants using GA4GH-based formats remains challenging, as current methods are not fully compatible, limiting their effectiveness.
Results: Here, we introduce Pheno-Ranker, an open-source software toolkit for individual-level comparison of phenotypic data. As input, it accepts JSON/YAML data exchange formats from Beacon v2 and Phenopackets v2 data models, as well as any data structure encoded in JSON, YAML, or CSV formats. Internally, the hierarchical data structure is flattened to one dimension and then transformed through one-hot encoding. This allows for efficient pairwise (all-to-all) comparisons within cohorts or for matching of a patient's profile in cohorts. Users have the flexibility to refine their comparisons by including or excluding terms, applying weights to variables, and obtaining statistical significance through Z-scores and p-values. The output consists of text files, which can be further analyzed using unsupervised learning techniques, such as clustering or multidimensional scaling (MDS), and with graph analytics. Pheno-Ranker's performance has been validated with simulated and synthetic data, showing its accuracy, robustness, and efficiency across various health data scenarios. A real data use case from the PRECISESADS study highlights its practical utility in clinical research.
Conclusions: Pheno-Ranker is a user-friendly, lightweight software for semantic similarity analysis of phenotypic data in Beacon v2 and Phenopackets v2 formats, extendable to other data types. It enables the comparison of a wide range of variables beyond HPO or OMIM terms while preserving full context. The software is designed as a command-line tool with additional utilities for CSV import, data simulation, summary statistics plotting, and QR code generation. For interactive analysis, it also includes a web-based user interface built with R Shiny. Links to the online documentation, including a Google Colab tutorial, and the tool's source code are available on the project home page: https://github.com/CNAG-Biomedical-Informatics/pheno-ranker .
{"title":"Pheno-Ranker: a toolkit for comparison of phenotypic data stored in GA4GH standards and beyond.","authors":"Ivo C Leist, María Rivas-Torrubia, Marta E Alarcón-Riquelme, Guillermo Barturen, Precisesads Clinical Consortium, Ivo G Gut, Manuel Rueda","doi":"10.1186/s12859-024-05993-2","DOIUrl":"10.1186/s12859-024-05993-2","url":null,"abstract":"<p><strong>Background: </strong>Phenotypic data comparison is essential for disease association studies, patient stratification, and genotype-phenotype correlation analysis. To support these efforts, the Global Alliance for Genomics and Health (GA4GH) established Phenopackets v2 and Beacon v2 standards for storing, sharing, and discovering genomic and phenotypic data. These standards provide a consistent framework for organizing biological data, simplifying their transformation into computer-friendly formats. However, matching participants using GA4GH-based formats remains challenging, as current methods are not fully compatible, limiting their effectiveness.</p><p><strong>Results: </strong>Here, we introduce Pheno-Ranker, an open-source software toolkit for individual-level comparison of phenotypic data. As input, it accepts JSON/YAML data exchange formats from Beacon v2 and Phenopackets v2 data models, as well as any data structure encoded in JSON, YAML, or CSV formats. Internally, the hierarchical data structure is flattened to one dimension and then transformed through one-hot encoding. This allows for efficient pairwise (all-to-all) comparisons within cohorts or for matching of a patient's profile in cohorts. Users have the flexibility to refine their comparisons by including or excluding terms, applying weights to variables, and obtaining statistical significance through Z-scores and p-values. The output consists of text files, which can be further analyzed using unsupervised learning techniques, such as clustering or multidimensional scaling (MDS), and with graph analytics. Pheno-Ranker's performance has been validated with simulated and synthetic data, showing its accuracy, robustness, and efficiency across various health data scenarios. A real data use case from the PRECISESADS study highlights its practical utility in clinical research.</p><p><strong>Conclusions: </strong>Pheno-Ranker is a user-friendly, lightweight software for semantic similarity analysis of phenotypic data in Beacon v2 and Phenopackets v2 formats, extendable to other data types. It enables the comparison of a wide range of variables beyond HPO or OMIM terms while preserving full context. The software is designed as a command-line tool with additional utilities for CSV import, data simulation, summary statistics plotting, and QR code generation. For interactive analysis, it also includes a web-based user interface built with R Shiny. Links to the online documentation, including a Google Colab tutorial, and the tool's source code are available on the project home page: https://github.com/CNAG-Biomedical-Informatics/pheno-ranker .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"373"},"PeriodicalIF":2.9,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11616229/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142779296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}