Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad535
Sophie Wharrie, Zhiyu Yang, Vishnu Raj, Remo Monti, Rahul Gupta, Ying Wang, Alicia Martin, Luke J O'Connor, Samuel Kaski, Pekka Marttinen, Pier Francesco Palamara, Christoph Lippert, Andrea Ganna
Motivation: Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking.
Results: We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures.
Availability and implementation: A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.
{"title":"HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes.","authors":"Sophie Wharrie, Zhiyu Yang, Vishnu Raj, Remo Monti, Rahul Gupta, Ying Wang, Alicia Martin, Luke J O'Connor, Samuel Kaski, Pekka Marttinen, Pier Francesco Palamara, Christoph Lippert, Andrea Ganna","doi":"10.1093/bioinformatics/btad535","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad535","url":null,"abstract":"<p><strong>Motivation: </strong>Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking.</p><p><strong>Results: </strong>We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures.</p><p><strong>Availability and implementation: </strong>A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10493177/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10335851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad514
Haochen Zhao, Xiaoyu Zhang, Qichang Zhao, Yaohang Li, Jianxin Wang
Motivation: Cancer heterogeneity drastically affects cancer therapeutic outcomes. Predicting drug response in vitro is expected to help formulate personalized therapy regimens. In recent years, several computational models based on machine learning and deep learning have been proposed to predict drug response in vitro. However, most of these methods capture drug features based on a single drug description (e.g. drug structure), without considering the relationships between drugs and biological entities (e.g. target, diseases, and side effects). Moreover, most of these methods collect features separately for drugs and cell lines but fail to consider the pairwise interactions between drugs and cell lines.
Results: In this paper, we propose a deep learning framework, named MSDRP for drug response prediction. MSDRP uses an interaction module to capture interactions between drugs and cell lines, and integrates multiple associations/interactions between drugs and biological entities through similarity network fusion algorithms, outperforming some state-of-the-art models in all performance measures for all experiments. The experimental results of de novo test and independent test demonstrate the excellent performance of our model for new drugs. Furthermore, several case studies illustrate the rationality for using feature vectors derived from drug similarity matrices from multisource data to represent drugs and the interpretability of our model.
Availability and implementation: The codes of MSDRP are available at https://github.com/xyzhang-10/MSDRP.
{"title":"MSDRP: a deep learning model based on multisource data for predicting drug response.","authors":"Haochen Zhao, Xiaoyu Zhang, Qichang Zhao, Yaohang Li, Jianxin Wang","doi":"10.1093/bioinformatics/btad514","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad514","url":null,"abstract":"<p><strong>Motivation: </strong>Cancer heterogeneity drastically affects cancer therapeutic outcomes. Predicting drug response in vitro is expected to help formulate personalized therapy regimens. In recent years, several computational models based on machine learning and deep learning have been proposed to predict drug response in vitro. However, most of these methods capture drug features based on a single drug description (e.g. drug structure), without considering the relationships between drugs and biological entities (e.g. target, diseases, and side effects). Moreover, most of these methods collect features separately for drugs and cell lines but fail to consider the pairwise interactions between drugs and cell lines.</p><p><strong>Results: </strong>In this paper, we propose a deep learning framework, named MSDRP for drug response prediction. MSDRP uses an interaction module to capture interactions between drugs and cell lines, and integrates multiple associations/interactions between drugs and biological entities through similarity network fusion algorithms, outperforming some state-of-the-art models in all performance measures for all experiments. The experimental results of de novo test and independent test demonstrate the excellent performance of our model for new drugs. Furthermore, several case studies illustrate the rationality for using feature vectors derived from drug similarity matrices from multisource data to represent drugs and the interpretability of our model.</p><p><strong>Availability and implementation: </strong>The codes of MSDRP are available at https://github.com/xyzhang-10/MSDRP.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10474952/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10647978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad523
Saifur R Khan, Andreea Obersterescu, Erica P Gunderson, Babak Razani, Michael B Wheeler, Brian J Cox
Motivation: The method of genome-wide association studies (GWAS) and metabolomics combined provide an quantitative approach to pinpoint metabolic pathways and genes linked to specific diseases; however, such analyses require both genomics and metabolomics datasets from the same individuals/samples. In most cases, this approach is not feasible due to high costs, lack of technical infrastructure, unavailability of samples, and other factors. Therefore, an unmet need exists for a bioinformatics tool that can identify gene loci-associated polymorphic variants for metabolite alterations seen in disease states using standalone metabolomics.
Results: Here, we developed a bioinformatics tool, metGWAS 1.0, that integrates independent GWAS data from the GWAS database and standalone metabolomics data using a network-based systems biology approach to identify novel disease/trait-specific metabolite-gene associations. The tool was evaluated using standalone metabolomics datasets extracted from two metabolomics-GWAS case studies. It discovered both the observed and novel gene loci with known single nucleotide polymorphisms when compared to the original studies.
Availability and implementation: The developed metGWAS 1.0 framework is implemented in an R pipeline and available at: https://github.com/saifurbd28/metGWAS-1.0.
{"title":"metGWAS 1.0: an R workflow for network-driven over-representation analysis between independent metabolomic and meta-genome-wide association studies.","authors":"Saifur R Khan, Andreea Obersterescu, Erica P Gunderson, Babak Razani, Michael B Wheeler, Brian J Cox","doi":"10.1093/bioinformatics/btad523","DOIUrl":"10.1093/bioinformatics/btad523","url":null,"abstract":"<p><strong>Motivation: </strong>The method of genome-wide association studies (GWAS) and metabolomics combined provide an quantitative approach to pinpoint metabolic pathways and genes linked to specific diseases; however, such analyses require both genomics and metabolomics datasets from the same individuals/samples. In most cases, this approach is not feasible due to high costs, lack of technical infrastructure, unavailability of samples, and other factors. Therefore, an unmet need exists for a bioinformatics tool that can identify gene loci-associated polymorphic variants for metabolite alterations seen in disease states using standalone metabolomics.</p><p><strong>Results: </strong>Here, we developed a bioinformatics tool, metGWAS 1.0, that integrates independent GWAS data from the GWAS database and standalone metabolomics data using a network-based systems biology approach to identify novel disease/trait-specific metabolite-gene associations. The tool was evaluated using standalone metabolomics datasets extracted from two metabolomics-GWAS case studies. It discovered both the observed and novel gene loci with known single nucleotide polymorphisms when compared to the original studies.</p><p><strong>Availability and implementation: </strong>The developed metGWAS 1.0 framework is implemented in an R pipeline and available at: https://github.com/saifurbd28/metGWAS-1.0.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10491949/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10647979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad521
Pin Lyu, Yijie Zhai, Taibo Li, Jiang Qian
Motivation: Single-cell sequencing technology has become a routine in studying many biological problems. A core step of analyzing single-cell data is the assignment of cell clusters to specific cell types. Reference-based methods are proposed for predicting cell types for single-cell clusters. However, the scalability and lack of preprocessed reference datasets prevent them from being practical and easy to use.
Results: Here, we introduce a reference-based cell annotation web server, CellAnn, which is super-fast and easy to use. CellAnn contains a comprehensive reference database with 204 human and 191 mouse single-cell datasets. These reference datasets cover 32 organs. Furthermore, we developed a cluster-to-cluster alignment method to transfer cell labels from the reference to the query datasets, which is superior to the existing methods with higher accuracy and higher scalability. Finally, CellAnn is an online tool that integrates all the procedures in cell annotation, including reference searching, transferring cell labels, visualizing results, and harmonizing cell annotation labels. Through the user-friendly interface, users can identify the best annotation by cross-validating with multiple reference datasets. We believe that CellAnn can greatly facilitate single-cell sequencing data analysis.
Availability and implementation: The web server is available at www.cellann.io, and the source code is available at https://github.com/Pinlyu3/CellAnn_shinyapp.
{"title":"CellAnn: a comprehensive, super-fast, and user-friendly single-cell annotation web server.","authors":"Pin Lyu, Yijie Zhai, Taibo Li, Jiang Qian","doi":"10.1093/bioinformatics/btad521","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad521","url":null,"abstract":"<p><strong>Motivation: </strong>Single-cell sequencing technology has become a routine in studying many biological problems. A core step of analyzing single-cell data is the assignment of cell clusters to specific cell types. Reference-based methods are proposed for predicting cell types for single-cell clusters. However, the scalability and lack of preprocessed reference datasets prevent them from being practical and easy to use.</p><p><strong>Results: </strong>Here, we introduce a reference-based cell annotation web server, CellAnn, which is super-fast and easy to use. CellAnn contains a comprehensive reference database with 204 human and 191 mouse single-cell datasets. These reference datasets cover 32 organs. Furthermore, we developed a cluster-to-cluster alignment method to transfer cell labels from the reference to the query datasets, which is superior to the existing methods with higher accuracy and higher scalability. Finally, CellAnn is an online tool that integrates all the procedures in cell annotation, including reference searching, transferring cell labels, visualizing results, and harmonizing cell annotation labels. Through the user-friendly interface, users can identify the best annotation by cross-validating with multiple reference datasets. We believe that CellAnn can greatly facilitate single-cell sequencing data analysis.</p><p><strong>Availability and implementation: </strong>The web server is available at www.cellann.io, and the source code is available at https://github.com/Pinlyu3/CellAnn_shinyapp.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10477937/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10647980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad486
David Teschner, David Gomez-Zepeda, Arthur Declercq, Mateusz K Łącki, Seymen Avci, Konstantin Bob, Ute Distler, Thomas Michna, Lennart Martens, Stefan Tenzer, Andreas Hildebrandt
Motivation: Including ion mobility separation (IMS) into mass spectrometry proteomics experiments is useful to improve coverage and throughput. Many IMS devices enable linking experimentally derived mobility of an ion to its collisional cross-section (CCS), a highly reproducible physicochemical property dependent on the ion's mass, charge and conformation in the gas phase. Thus, known peptide ion mobilities can be used to tailor acquisition methods or to refine database search results. The large space of potential peptide sequences, driven also by posttranslational modifications of amino acids, motivates an in silico predictor for peptide CCS. Recent studies explored the general performance of varying machine-learning techniques, however, the workflow engineering part was of secondary importance. For the sake of applicability, such a tool should be generic, data driven, and offer the possibility to be easily adapted to individual workflows for experimental design and data processing.
Results: We created ionmob, a Python-based framework for data preparation, training, and prediction of collisional cross-section values of peptides. It is easily customizable and includes a set of pretrained, ready-to-use models and preprocessing routines for training and inference. Using a set of ≈21 000 unique phosphorylated peptides and ≈17 000 MHC ligand sequences and charge state pairs, we expand upon the space of peptides that can be integrated into CCS prediction. Lastly, we investigate the applicability of in silico predicted CCS to increase confidence in identified peptides by applying methods of re-scoring and demonstrate that predicted CCS values complement existing predictors for that task.
Availability and implementation: The Python package is available at github: https://github.com/theGreatHerrLebert/ionmob.
{"title":"Ionmob: a Python package for prediction of peptide collisional cross-section values.","authors":"David Teschner, David Gomez-Zepeda, Arthur Declercq, Mateusz K Łącki, Seymen Avci, Konstantin Bob, Ute Distler, Thomas Michna, Lennart Martens, Stefan Tenzer, Andreas Hildebrandt","doi":"10.1093/bioinformatics/btad486","DOIUrl":"10.1093/bioinformatics/btad486","url":null,"abstract":"<p><strong>Motivation: </strong>Including ion mobility separation (IMS) into mass spectrometry proteomics experiments is useful to improve coverage and throughput. Many IMS devices enable linking experimentally derived mobility of an ion to its collisional cross-section (CCS), a highly reproducible physicochemical property dependent on the ion's mass, charge and conformation in the gas phase. Thus, known peptide ion mobilities can be used to tailor acquisition methods or to refine database search results. The large space of potential peptide sequences, driven also by posttranslational modifications of amino acids, motivates an in silico predictor for peptide CCS. Recent studies explored the general performance of varying machine-learning techniques, however, the workflow engineering part was of secondary importance. For the sake of applicability, such a tool should be generic, data driven, and offer the possibility to be easily adapted to individual workflows for experimental design and data processing.</p><p><strong>Results: </strong>We created ionmob, a Python-based framework for data preparation, training, and prediction of collisional cross-section values of peptides. It is easily customizable and includes a set of pretrained, ready-to-use models and preprocessing routines for training and inference. Using a set of ≈21 000 unique phosphorylated peptides and ≈17 000 MHC ligand sequences and charge state pairs, we expand upon the space of peptides that can be integrated into CCS prediction. Lastly, we investigate the applicability of in silico predicted CCS to increase confidence in identified peptides by applying methods of re-scoring and demonstrate that predicted CCS values complement existing predictors for that task.</p><p><strong>Availability and implementation: </strong>The Python package is available at github: https://github.com/theGreatHerrLebert/ionmob.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10521631/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9989115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad561
Brian Johnson, Yubo Shuai, Jason Schweinsberg, Kit Curtius
Motivation: While evolutionary approaches to medicine show promise, measuring evolution itself is difficult due to experimental constraints and the dynamic nature of body systems. In cancer evolution, continuous observation of clonal architecture is impossible, and longitudinal samples from multiple timepoints are rare. Increasingly available DNA sequencing datasets at single-cell resolution enable the reconstruction of past evolution using mutational history, allowing for a better understanding of dynamics prior to detectable disease. There is an unmet need for an accurate, fast, and easy-to-use method to quantify clone growth dynamics from these datasets.
Results: We derived methods based on coalescent theory for estimating the net growth rate of clones using either reconstructed phylogenies or the number of shared mutations. We applied and validated our analytical methods for estimating the net growth rate of clones, eliminating the need for complex simulations used in previous methods. When applied to hematopoietic data, we show that our estimates may have broad applications to improve mechanistic understanding and prognostic ability. Compared to clones with a single or unknown driver mutation, clones with multiple drivers have significantly increased growth rates (median 0.94 versus 0.25 per year; P = 1.6×10-6). Further, stratifying patients with a myeloproliferative neoplasm (MPN) by the growth rate of their fittest clone shows that higher growth rates are associated with shorter time to MPN diagnosis (median 13.9 versus 26.4 months; P = 0.0026).
Availability and implementation: We developed a publicly available R package, cloneRate, to implement our methods (Package website: https://bdj34.github.io/cloneRate/). Source code: https://github.com/bdj34/cloneRate/.
{"title":"cloneRate: fast estimation of single-cell clonal dynamics using coalescent theory.","authors":"Brian Johnson, Yubo Shuai, Jason Schweinsberg, Kit Curtius","doi":"10.1093/bioinformatics/btad561","DOIUrl":"10.1093/bioinformatics/btad561","url":null,"abstract":"<p><strong>Motivation: </strong>While evolutionary approaches to medicine show promise, measuring evolution itself is difficult due to experimental constraints and the dynamic nature of body systems. In cancer evolution, continuous observation of clonal architecture is impossible, and longitudinal samples from multiple timepoints are rare. Increasingly available DNA sequencing datasets at single-cell resolution enable the reconstruction of past evolution using mutational history, allowing for a better understanding of dynamics prior to detectable disease. There is an unmet need for an accurate, fast, and easy-to-use method to quantify clone growth dynamics from these datasets.</p><p><strong>Results: </strong>We derived methods based on coalescent theory for estimating the net growth rate of clones using either reconstructed phylogenies or the number of shared mutations. We applied and validated our analytical methods for estimating the net growth rate of clones, eliminating the need for complex simulations used in previous methods. When applied to hematopoietic data, we show that our estimates may have broad applications to improve mechanistic understanding and prognostic ability. Compared to clones with a single or unknown driver mutation, clones with multiple drivers have significantly increased growth rates (median 0.94 versus 0.25 per year; P = 1.6×10-6). Further, stratifying patients with a myeloproliferative neoplasm (MPN) by the growth rate of their fittest clone shows that higher growth rates are associated with shorter time to MPN diagnosis (median 13.9 versus 26.4 months; P = 0.0026).</p><p><strong>Availability and implementation: </strong>We developed a publicly available R package, cloneRate, to implement our methods (Package website: https://bdj34.github.io/cloneRate/). Source code: https://github.com/bdj34/cloneRate/.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10534056/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10226226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad571
Gherard Batisti Biffignandi, Greta Bellinzona, Greta Petazzoni, Davide Sassera, Gian Vincenzo Zuccotti, Claudio Bandi, Fausto Baldanti, Francesco Comandatore, Stefano Gaiarsa
Summary: Bacterial Healthcare-Associated Infections (HAIs) are a major threat worldwide, which can be counteracted by establishing effective infection control measures, guided by constant surveillance and timely epidemiological investigations. Genomics is crucial in modern epidemiology but lacks standard methods and user-friendly software, accessible to users without a strong bioinformatics proficiency. To overcome these issues we developed P-DOR, a novel tool for rapid bacterial outbreak characterization. P-DOR accepts genome assemblies as input, it automatically selects a background of publicly available genomes using k-mer distances and adds it to the analysis dataset before inferring a Single-Nucleotide Polymorphism (SNP)-based phylogeny. Epidemiological clusters are identified considering the phylogenetic tree topology and SNP distances. By analyzing the SNP-distance distribution, the user can gauge the correct threshold. Patient metadata can be inputted as well, to provide a spatio-temporal representation of the outbreak. The entire pipeline is fast and scalable and can be also run on low-end computers.
Availability and implementation: P-DOR is implemented in Python3 and R and can be installed using conda environments. It is available from GitHub https://github.com/SteMIDIfactory/P-DOR under the GPL-3.0 license.
{"title":"P-DOR, an easy-to-use pipeline to reconstruct bacterial outbreaks using genomics.","authors":"Gherard Batisti Biffignandi, Greta Bellinzona, Greta Petazzoni, Davide Sassera, Gian Vincenzo Zuccotti, Claudio Bandi, Fausto Baldanti, Francesco Comandatore, Stefano Gaiarsa","doi":"10.1093/bioinformatics/btad571","DOIUrl":"10.1093/bioinformatics/btad571","url":null,"abstract":"<p><strong>Summary: </strong>Bacterial Healthcare-Associated Infections (HAIs) are a major threat worldwide, which can be counteracted by establishing effective infection control measures, guided by constant surveillance and timely epidemiological investigations. Genomics is crucial in modern epidemiology but lacks standard methods and user-friendly software, accessible to users without a strong bioinformatics proficiency. To overcome these issues we developed P-DOR, a novel tool for rapid bacterial outbreak characterization. P-DOR accepts genome assemblies as input, it automatically selects a background of publicly available genomes using k-mer distances and adds it to the analysis dataset before inferring a Single-Nucleotide Polymorphism (SNP)-based phylogeny. Epidemiological clusters are identified considering the phylogenetic tree topology and SNP distances. By analyzing the SNP-distance distribution, the user can gauge the correct threshold. Patient metadata can be inputted as well, to provide a spatio-temporal representation of the outbreak. The entire pipeline is fast and scalable and can be also run on low-end computers.</p><p><strong>Availability and implementation: </strong>P-DOR is implemented in Python3 and R and can be installed using conda environments. It is available from GitHub https://github.com/SteMIDIfactory/P-DOR under the GPL-3.0 license.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10533420/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10227374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad450
Ronan M T Fleming, Hulda S Haraldsdottir, Le Hoai Minh, Phan Tu Vuong, Thomas Hankemeier, Ines Thiele
Motivation: Several applications in constraint-based modelling can be mathematically formulated as cardinality optimization problems involving the minimization or maximization of the number of nonzeros in a vector. These problems include testing for stoichiometric consistency, testing for flux consistency, testing for thermodynamic flux consistency, computing sparse solutions to flux balance analysis problems and computing the minimum number of constraints to relax to render an infeasible flux balance analysis problem feasible. Such cardinality optimization problems are computationally complex, with no known polynomial time algorithms capable of returning an exact and globally optimal solution.
Results: By approximating the zero-norm with nonconvex continuous functions, we reformulate a set of cardinality optimization problems in constraint-based modelling into a difference of convex functions. We implemented and numerically tested novel algorithms that approximately solve the reformulated problems using a sequence of convex programs. We applied these algorithms to various biochemical networks and demonstrate that our algorithms match or outperform existing related approaches. In particular, we illustrate the efficiency and practical utility of our algorithms for cardinality optimization problems that arise when extracting a model ready for thermodynamic flux balance analysis given a human metabolic reconstruction.
Availability and implementation: Open source scripts to reproduce the results are here https://github.com/opencobra/COBRA.papers/2023_cardOpt with general purpose functions integrated within the COnstraint-Based Reconstruction and Analysis toolbox: https://github.com/opencobra/cobratoolbox.
{"title":"Cardinality optimization in constraint-based modelling: application to human metabolism.","authors":"Ronan M T Fleming, Hulda S Haraldsdottir, Le Hoai Minh, Phan Tu Vuong, Thomas Hankemeier, Ines Thiele","doi":"10.1093/bioinformatics/btad450","DOIUrl":"10.1093/bioinformatics/btad450","url":null,"abstract":"<p><strong>Motivation: </strong>Several applications in constraint-based modelling can be mathematically formulated as cardinality optimization problems involving the minimization or maximization of the number of nonzeros in a vector. These problems include testing for stoichiometric consistency, testing for flux consistency, testing for thermodynamic flux consistency, computing sparse solutions to flux balance analysis problems and computing the minimum number of constraints to relax to render an infeasible flux balance analysis problem feasible. Such cardinality optimization problems are computationally complex, with no known polynomial time algorithms capable of returning an exact and globally optimal solution.</p><p><strong>Results: </strong>By approximating the zero-norm with nonconvex continuous functions, we reformulate a set of cardinality optimization problems in constraint-based modelling into a difference of convex functions. We implemented and numerically tested novel algorithms that approximately solve the reformulated problems using a sequence of convex programs. We applied these algorithms to various biochemical networks and demonstrate that our algorithms match or outperform existing related approaches. In particular, we illustrate the efficiency and practical utility of our algorithms for cardinality optimization problems that arise when extracting a model ready for thermodynamic flux balance analysis given a human metabolic reconstruction.</p><p><strong>Availability and implementation: </strong>Open source scripts to reproduce the results are here https://github.com/opencobra/COBRA.papers/2023_cardOpt with general purpose functions integrated within the COnstraint-Based Reconstruction and Analysis toolbox: https://github.com/opencobra/cobratoolbox.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10495685/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10649419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad537
Anna Paola Muntoni, Andrea Pagnani
Summary: DCAlign is a new alignment method able to cope with the conservation and the co-evolution signals that characterize the columns of multiple sequence alignments of homologous sequences. However, the pre-processing steps required to align a candidate sequence are computationally demanding. We show in v1.0 how to dramatically reduce the overall computing time by including an empirical prior over an informative set of variables mirroring the presence of insertions and deletions.
Availability and implementation: DCAlign v1.0 is implemented in Julia and it is fully available at https://github.com/infernet-h2020/DCAlign.
{"title":"DCAlign v1.0: aligning biological sequences using co-evolution models and informed priors.","authors":"Anna Paola Muntoni, Andrea Pagnani","doi":"10.1093/bioinformatics/btad537","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad537","url":null,"abstract":"<p><strong>Summary: </strong>DCAlign is a new alignment method able to cope with the conservation and the co-evolution signals that characterize the columns of multiple sequence alignments of homologous sequences. However, the pre-processing steps required to align a candidate sequence are computationally demanding. We show in v1.0 how to dramatically reduce the overall computing time by including an empirical prior over an informative set of variables mirroring the presence of insertions and deletions.</p><p><strong>Availability and implementation: </strong>DCAlign v1.0 is implemented in Julia and it is fully available at https://github.com/infernet-h2020/DCAlign.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10491954/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10285622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary: The next-generation sequencing brought opportunities for the diagnosis of genetic disorders due to its high-throughput capabilities. However, the majority of existing methods were limited to only sequencing candidate variants, and the process of linking these variants to a diagnosis of genetic disorders still required medical professionals to consult databases. Therefore, we introduce diseaseGPS, an integrated platform for the diagnosis of genetic disorders that combines both phenotype and genotype data for analysis. It offers not only a user-friendly GUI web application for those without a programming background but also scripts that can be executed in batch mode for bioinformatics professionals. The genetic and phenotypic data are integrated using the ACMG-Bayes method and a novel phenotypic similarity method, to prioritize the results of genetic disorders. diseaseGPS was evaluated on 6085 cases from Deciphering Developmental Disorders project and 187 cases from Shanghai Children's hospital. The results demonstrated that diseaseGPS performed better than other commonly used methods.
Availability and implementation: diseaseGPS is available to freely accessed at https://diseasegps.sjtu.edu.cn with source code at https://github.com/BioHuangDY/diseaseGPS.
{"title":"diseaseGPS: auxiliary diagnostic system for genetic disorders based on genotype and phenotype.","authors":"Daoyi Huang, Jianping Jiang, Tingting Zhao, Shengnan Wu, Pin Li, Yongfen Lyu, Jincai Feng, Mingyue Wei, Zhixing Zhu, Jianlei Gu, Yongyong Ren, Guangjun Yu, Hui Lu","doi":"10.1093/bioinformatics/btad517","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad517","url":null,"abstract":"<p><strong>Summary: </strong>The next-generation sequencing brought opportunities for the diagnosis of genetic disorders due to its high-throughput capabilities. However, the majority of existing methods were limited to only sequencing candidate variants, and the process of linking these variants to a diagnosis of genetic disorders still required medical professionals to consult databases. Therefore, we introduce diseaseGPS, an integrated platform for the diagnosis of genetic disorders that combines both phenotype and genotype data for analysis. It offers not only a user-friendly GUI web application for those without a programming background but also scripts that can be executed in batch mode for bioinformatics professionals. The genetic and phenotypic data are integrated using the ACMG-Bayes method and a novel phenotypic similarity method, to prioritize the results of genetic disorders. diseaseGPS was evaluated on 6085 cases from Deciphering Developmental Disorders project and 187 cases from Shanghai Children's hospital. The results demonstrated that diseaseGPS performed better than other commonly used methods.</p><p><strong>Availability and implementation: </strong>diseaseGPS is available to freely accessed at https://diseasegps.sjtu.edu.cn with source code at https://github.com/BioHuangDY/diseaseGPS.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10500091/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10285625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}