Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad536
Weiwen Wang, James Barbetti, Thomas Wong, Bryan Thornlow, Russ Corbett-Detig, Yatish Turakhia, Robert Lanfear, Bui Quang Minh
Motivation: Neighbour-Joining is one of the most widely used distance-based phylogenetic inference methods. However, current implementations do not scale well for datasets with more than 10 000 sequences. Given the increasing pace of generating new sequence data, particularly in outbreaks of emerging diseases, and the already enormous existing databases of sequence data for which Neighbour-Joining is a useful approach, new implementations of existing methods are warranted.
Results: Here, we present DecentTree, which provides highly optimized and parallel implementations of Neighbour-Joining and several of its variants. DecentTree is designed as a stand-alone application and a header-only library easily integrated with other phylogenetic software (e.g. it is integral in the popular IQ-TREE software). We show that DecentTree shows similar or improved performance over existing software (BIONJ, Quicktree, FastME, and RapidNJ), especially for handling very large alignments. For example, DecentTree is up to 6-fold faster than the fastest existing Neighbour-Joining software (e.g. RapidNJ) when generating a tree of 64 000 SARS-CoV-2 genomes.
Availability and implementation: DecentTree is open source and freely available at https://github.com/iqtree/decenttree. All code and data used in this analysis are available on Github (https://github.com/asdcid/Comparison-of-neighbour-joining-software).
{"title":"DecentTree: scalable Neighbour-Joining for the genomic era.","authors":"Weiwen Wang, James Barbetti, Thomas Wong, Bryan Thornlow, Russ Corbett-Detig, Yatish Turakhia, Robert Lanfear, Bui Quang Minh","doi":"10.1093/bioinformatics/btad536","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad536","url":null,"abstract":"<p><strong>Motivation: </strong>Neighbour-Joining is one of the most widely used distance-based phylogenetic inference methods. However, current implementations do not scale well for datasets with more than 10 000 sequences. Given the increasing pace of generating new sequence data, particularly in outbreaks of emerging diseases, and the already enormous existing databases of sequence data for which Neighbour-Joining is a useful approach, new implementations of existing methods are warranted.</p><p><strong>Results: </strong>Here, we present DecentTree, which provides highly optimized and parallel implementations of Neighbour-Joining and several of its variants. DecentTree is designed as a stand-alone application and a header-only library easily integrated with other phylogenetic software (e.g. it is integral in the popular IQ-TREE software). We show that DecentTree shows similar or improved performance over existing software (BIONJ, Quicktree, FastME, and RapidNJ), especially for handling very large alignments. For example, DecentTree is up to 6-fold faster than the fastest existing Neighbour-Joining software (e.g. RapidNJ) when generating a tree of 64 000 SARS-CoV-2 genomes.</p><p><strong>Availability and implementation: </strong>DecentTree is open source and freely available at https://github.com/iqtree/decenttree. All code and data used in this analysis are available on Github (https://github.com/asdcid/Comparison-of-neighbour-joining-software).</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10491953/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10285633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad520
Hui Yu, Limin Jiang, Chung-I Li, Scott Ness, Sara G M Piccirillo, Yan Guo
Motivation: As an important player in transcriptome regulation, microRNAs may effectively diffuse somatic mutation impacts to broad cellular processes and ultimately manifest disease and dictate prognosis. Previous studies that tried to correlate mutation with gene expression dysregulation neglected to adjust for the disparate multitudes of false positives associated with unequal sample sizes and uneven class balancing scenarios.
Results: To properly address this issue, we developed a statistical framework to rigorously assess the extent of mutation impact on microRNAs in relation to a permutation-based null distribution of a matching sample structure. Carrying out the framework in a pan-cancer study, we ascertained 9008 protein-coding genes with statistically significant mutation impacts on miRNAs. Of these, the collective miRNA expression for 83 genes showed significant prognostic power in nine cancer types. For example, in lower-grade glioma, 10 genes' mutations broadly impacted miRNAs, all of which showed prognostic value with the corresponding miRNA expression. Our framework was further validated with functional analysis and augmented with rich features including the ability to analyze miRNA isoforms; aggregative prognostic analysis; advanced annotations such as mutation type, regulator alteration, somatic motif, and disease association; and instructive visualization such as mutation OncoPrint, Ideogram, and interactive mRNA-miRNA network.
Availability and implementation: The data underlying this article are available in MutMix, at http://innovebioinfo.com/Database/TmiEx/MutMix.php.
{"title":"Somatic mutation effects diffused over microRNA dysregulation.","authors":"Hui Yu, Limin Jiang, Chung-I Li, Scott Ness, Sara G M Piccirillo, Yan Guo","doi":"10.1093/bioinformatics/btad520","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad520","url":null,"abstract":"<p><strong>Motivation: </strong>As an important player in transcriptome regulation, microRNAs may effectively diffuse somatic mutation impacts to broad cellular processes and ultimately manifest disease and dictate prognosis. Previous studies that tried to correlate mutation with gene expression dysregulation neglected to adjust for the disparate multitudes of false positives associated with unequal sample sizes and uneven class balancing scenarios.</p><p><strong>Results: </strong>To properly address this issue, we developed a statistical framework to rigorously assess the extent of mutation impact on microRNAs in relation to a permutation-based null distribution of a matching sample structure. Carrying out the framework in a pan-cancer study, we ascertained 9008 protein-coding genes with statistically significant mutation impacts on miRNAs. Of these, the collective miRNA expression for 83 genes showed significant prognostic power in nine cancer types. For example, in lower-grade glioma, 10 genes' mutations broadly impacted miRNAs, all of which showed prognostic value with the corresponding miRNA expression. Our framework was further validated with functional analysis and augmented with rich features including the ability to analyze miRNA isoforms; aggregative prognostic analysis; advanced annotations such as mutation type, regulator alteration, somatic motif, and disease association; and instructive visualization such as mutation OncoPrint, Ideogram, and interactive mRNA-miRNA network.</p><p><strong>Availability and implementation: </strong>The data underlying this article are available in MutMix, at http://innovebioinfo.com/Database/TmiEx/MutMix.php.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10474951/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10335312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-01DOI: 10.1093/bioinformatics/btad539
Bhanwar Lal Puniya, Andreas Dräger
Abstract Summary The Computational Modelling of Systems Biology (SysMod) Community of Special Interest (COSI) convenes annually at the Intelligent Systems for Molecular Biology (ISMB) conference to facilitate knowledge dissemination and exchange of research findings on systems modelling from interdisciplinary domains. The SysMod meeting 2022 was held in a hybrid mode in Madison, Wisconsin, spanning a 1-day duration centred on modelling techniques, applications, and single-cell technology implementations. The meeting showcased innovative approaches to modelling biological systems using cell-specific and multiscale modelling, multiomics data integration, and novel tools to develop systems models using single-cell and multiomics technology. The meeting also recognized outstanding research by awarding the three best posters. This report summarizes the key highlights and outcomes of the meeting. Availability and implementation: All resources and further information are freely accessible at https://sysmod.info.
{"title":"Advancements in computational modelling of biological systems: seventh annual SysMod meeting","authors":"Bhanwar Lal Puniya, Andreas Dräger","doi":"10.1093/bioinformatics/btad539","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad539","url":null,"abstract":"Abstract Summary The Computational Modelling of Systems Biology (SysMod) Community of Special Interest (COSI) convenes annually at the Intelligent Systems for Molecular Biology (ISMB) conference to facilitate knowledge dissemination and exchange of research findings on systems modelling from interdisciplinary domains. The SysMod meeting 2022 was held in a hybrid mode in Madison, Wisconsin, spanning a 1-day duration centred on modelling techniques, applications, and single-cell technology implementations. The meeting showcased innovative approaches to modelling biological systems using cell-specific and multiscale modelling, multiomics data integration, and novel tools to develop systems models using single-cell and multiomics technology. The meeting also recognized outstanding research by awarding the three best posters. This report summarizes the key highlights and outcomes of the meeting. Availability and implementation: All resources and further information are freely accessible at https://sysmod.info.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135346616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.1093/bioinformatics/btad522
Kim Philipp Jablonski, Niko Beerenwinkel
Motivation: Gene set enrichment methods are a common tool to improve the interpretability of gene lists as obtained, for example, from differential gene expression analyses. They are based on computing whether dysregulated genes are located in certain biological pathways more often than expected by chance. Gene set enrichment tools rely on pre-existing pathway databases such as KEGG, Reactome, or the Gene Ontology. These databases are increasing in size and in the number of redundancies between pathways, which complicates the statistical enrichment computation.
Results: We address this problem and develop a novel gene set enrichment method, called pareg, which is based on a regularized generalized linear model and directly incorporates dependencies between gene sets related to certain biological functions, for example, due to shared genes, in the enrichment computation. We show that pareg is more robust to noise than competing methods. Additionally, we demonstrate the ability of our method to recover known pathways as well as to suggest novel treatment targets in an exploratory analysis using breast cancer samples from TCGA.
Availability and implementation: pareg is freely available as an R package on Bioconductor (https://bioconductor.org/packages/release/bioc/html/pareg.html) as well as on https://github.com/cbg-ethz/pareg. The GitHub repository also contains the Snakemake workflows needed to reproduce all results presented here.
{"title":"Coherent pathway enrichment estimation by modeling inter-pathway dependencies using regularized regression.","authors":"Kim Philipp Jablonski, Niko Beerenwinkel","doi":"10.1093/bioinformatics/btad522","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad522","url":null,"abstract":"<p><strong>Motivation: </strong>Gene set enrichment methods are a common tool to improve the interpretability of gene lists as obtained, for example, from differential gene expression analyses. They are based on computing whether dysregulated genes are located in certain biological pathways more often than expected by chance. Gene set enrichment tools rely on pre-existing pathway databases such as KEGG, Reactome, or the Gene Ontology. These databases are increasing in size and in the number of redundancies between pathways, which complicates the statistical enrichment computation.</p><p><strong>Results: </strong>We address this problem and develop a novel gene set enrichment method, called pareg, which is based on a regularized generalized linear model and directly incorporates dependencies between gene sets related to certain biological functions, for example, due to shared genes, in the enrichment computation. We show that pareg is more robust to noise than competing methods. Additionally, we demonstrate the ability of our method to recover known pathways as well as to suggest novel treatment targets in an exploratory analysis using breast cancer samples from TCGA.</p><p><strong>Availability and implementation: </strong>pareg is freely available as an R package on Bioconductor (https://bioconductor.org/packages/release/bioc/html/pareg.html) as well as on https://github.com/cbg-ethz/pareg. The GitHub repository also contains the Snakemake workflows needed to reproduce all results presented here.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 8","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10471899/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10647981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.1093/bioinformatics/btad487
Daniel Liu, Martin Steinegger
Motivation: Efficiently aligning sequences is a fundamental problem in bioinformatics. Many recent algorithms for computing alignments through Smith-Waterman-Gotoh dynamic programming (DP) exploit Single Instruction Multiple Data (SIMD) operations on modern CPUs for speed. However, these advances have largely ignored difficulties associated with efficiently handling complex scoring matrices or large gaps (insertions or deletions).
Results: We propose a new SIMD-accelerated algorithm called Block Aligner for aligning nucleotide and protein sequences against other sequences or position-specific scoring matrices. We introduce a new paradigm that uses blocks in the DP matrix that greedily shift, grow, and shrink. This approach allows regions of the DP matrix to be adaptively computed. Our algorithm reaches over 5-10 times faster than some previous methods while incurring an error rate of less than 3% on protein and long read datasets, despite large gaps and low sequence identities.
Availability and implementation: Our algorithm is implemented for global, local, and X-drop alignments. It is available as a Rust library (with C bindings) at https://github.com/Daniel-Liu-c0deb0t/block-aligner.
{"title":"Block Aligner: an adaptive SIMD-accelerated aligner for sequences and position-specific scoring matrices.","authors":"Daniel Liu, Martin Steinegger","doi":"10.1093/bioinformatics/btad487","DOIUrl":"10.1093/bioinformatics/btad487","url":null,"abstract":"<p><strong>Motivation: </strong>Efficiently aligning sequences is a fundamental problem in bioinformatics. Many recent algorithms for computing alignments through Smith-Waterman-Gotoh dynamic programming (DP) exploit Single Instruction Multiple Data (SIMD) operations on modern CPUs for speed. However, these advances have largely ignored difficulties associated with efficiently handling complex scoring matrices or large gaps (insertions or deletions).</p><p><strong>Results: </strong>We propose a new SIMD-accelerated algorithm called Block Aligner for aligning nucleotide and protein sequences against other sequences or position-specific scoring matrices. We introduce a new paradigm that uses blocks in the DP matrix that greedily shift, grow, and shrink. This approach allows regions of the DP matrix to be adaptively computed. Our algorithm reaches over 5-10 times faster than some previous methods while incurring an error rate of less than 3% on protein and long read datasets, despite large gaps and low sequence identities.</p><p><strong>Availability and implementation: </strong>Our algorithm is implemented for global, local, and X-drop alignments. It is available as a Rust library (with C bindings) at https://github.com/Daniel-Liu-c0deb0t/block-aligner.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 8","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10457662/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10093070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary: Mass spectrometry (MS)-based proteomics has become the most powerful approach to study the proteome of given biological and clinical samples. Advancements in sample preparation and MS detection have extended the application of proteomics but have also brought new demands on data analysis. Appropriate proteomics data analysis workflow mainly requires quality control, hypothesis testing, functional mining, and visualization. Although there are numerous tools for each process, an efficient and universal tandem analysis toolkit to obtain a quick overall view of various proteomics data is still urgently needed. Here, we present DEP2, an updated version of DEP we previously established, for proteomics data analysis. We amended the analysis workflow by incorporating alternative approaches to accommodate diverse proteomics data, introducing peptide-protein summarization and coupling biological function exploration. In summary, DEP2 is a well-rounded toolkit designed for protein- and peptide-level quantitative proteomics data. It features a more flexible differential analysis workflow and includes a user-friendly Shiny application to facilitate data analysis.
Availability and implementation: DEP2 is available at https://github.com/mildpiggy/DEP2, released under the MIT license. For further information and usage details, please refer to the package website at https://mildpiggy.github.io/DEP2/.
{"title":"DEP2: an upgraded comprehensive analysis toolkit for quantitative proteomics data.","authors":"Zhenhuan Feng, Peiyang Fang, Hui Zheng, Xiaofei Zhang","doi":"10.1093/bioinformatics/btad526","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad526","url":null,"abstract":"<p><strong>Summary: </strong>Mass spectrometry (MS)-based proteomics has become the most powerful approach to study the proteome of given biological and clinical samples. Advancements in sample preparation and MS detection have extended the application of proteomics but have also brought new demands on data analysis. Appropriate proteomics data analysis workflow mainly requires quality control, hypothesis testing, functional mining, and visualization. Although there are numerous tools for each process, an efficient and universal tandem analysis toolkit to obtain a quick overall view of various proteomics data is still urgently needed. Here, we present DEP2, an updated version of DEP we previously established, for proteomics data analysis. We amended the analysis workflow by incorporating alternative approaches to accommodate diverse proteomics data, introducing peptide-protein summarization and coupling biological function exploration. In summary, DEP2 is a well-rounded toolkit designed for protein- and peptide-level quantitative proteomics data. It features a more flexible differential analysis workflow and includes a user-friendly Shiny application to facilitate data analysis.</p><p><strong>Availability and implementation: </strong>DEP2 is available at https://github.com/mildpiggy/DEP2, released under the MIT license. For further information and usage details, please refer to the package website at https://mildpiggy.github.io/DEP2/.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 8","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10466079/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10335314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Graphical analysis of the molecular structure of proteins can be very complex. Full-atom representations retain most geometric information but are generally crowded, and key structural patterns can be challenging to identify. Non-full-atom representations could be more instructive on physicochemical aspects but be insufficiently detailed regarding shapes (e.g. entity beans-like models in coarse grain approaches) or simple properties of amino acids (e.g. representation of superficial electrostatic properties). In this work, we present TALAIA a visual dictionary that aims to provide another layer of structural representations.TALAIA offers a visual grammar that combines simple representations of amino acids while retaining their general geometry and physicochemical properties. It uses unique objects, with differentiated shapes and colors to represent amino acids. It makes easier to spot crucial molecular information, including patches of amino acids or key interactions between side chains. Most conventions used in TALAIA are standard in chemistry and biochemistry, so experimentalists and modelers can rapidly grasp the meaning of any TALAIA depiction.
Results: We propose TALAIA as a tool that renders protein structures and encodes structure and physicochemical aspects as a simple visual grammar. The approach is fast, highly informative, and intuitive, allowing the identification of possible interactions, hydrophobic patches, and other characteristic structural features at first glance. The first implementation of TALAIA can be found at https://github.com/insilichem/talaia.
{"title":"TALAIA: a 3D visual dictionary for protein structures.","authors":"Mercè Alemany-Chavarria, Jaime Rodríguez-Guerra, Jean-Didier Maréchal","doi":"10.1093/bioinformatics/btad476","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad476","url":null,"abstract":"<p><strong>Motivation: </strong>Graphical analysis of the molecular structure of proteins can be very complex. Full-atom representations retain most geometric information but are generally crowded, and key structural patterns can be challenging to identify. Non-full-atom representations could be more instructive on physicochemical aspects but be insufficiently detailed regarding shapes (e.g. entity beans-like models in coarse grain approaches) or simple properties of amino acids (e.g. representation of superficial electrostatic properties). In this work, we present TALAIA a visual dictionary that aims to provide another layer of structural representations.TALAIA offers a visual grammar that combines simple representations of amino acids while retaining their general geometry and physicochemical properties. It uses unique objects, with differentiated shapes and colors to represent amino acids. It makes easier to spot crucial molecular information, including patches of amino acids or key interactions between side chains. Most conventions used in TALAIA are standard in chemistry and biochemistry, so experimentalists and modelers can rapidly grasp the meaning of any TALAIA depiction.</p><p><strong>Results: </strong>We propose TALAIA as a tool that renders protein structures and encodes structure and physicochemical aspects as a simple visual grammar. The approach is fast, highly informative, and intuitive, allowing the identification of possible interactions, hydrophobic patches, and other characteristic structural features at first glance. The first implementation of TALAIA can be found at https://github.com/insilichem/talaia.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 8","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10423020/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9988990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.1093/bioinformatics/btad480
Richard Wilton, Alexander S Szalay
Motivation: Read alignment is an essential first step in the characterization of DNA sequence variation. The accuracy of variant-calling results depends not only on the quality of read alignment and variant-calling software but also on the interaction between these complex software tools.
Results: In this review, we evaluate short-read aligner performance with the goal of optimizing germline variant-calling accuracy. We examine the performance of three general-purpose short-read aligners-BWA-MEM, Bowtie 2, and Arioc-in conjunction with three germline variant callers: DeepVariant, FreeBayes, and GATK HaplotypeCaller. We discuss the behavior of the read aligners with regard to the data elements on which the variant callers rely, and illustrate how the runtime configurations of these software tools combine to affect variant-calling performance.
Availability and implementation: The quick brown fox jumps over the lazy dog.
{"title":"Short-read aligner performance in germline variant identification.","authors":"Richard Wilton, Alexander S Szalay","doi":"10.1093/bioinformatics/btad480","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad480","url":null,"abstract":"<p><strong>Motivation: </strong>Read alignment is an essential first step in the characterization of DNA sequence variation. The accuracy of variant-calling results depends not only on the quality of read alignment and variant-calling software but also on the interaction between these complex software tools.</p><p><strong>Results: </strong>In this review, we evaluate short-read aligner performance with the goal of optimizing germline variant-calling accuracy. We examine the performance of three general-purpose short-read aligners-BWA-MEM, Bowtie 2, and Arioc-in conjunction with three germline variant callers: DeepVariant, FreeBayes, and GATK HaplotypeCaller. We discuss the behavior of the read aligners with regard to the data elements on which the variant callers rely, and illustrate how the runtime configurations of these software tools combine to affect variant-calling performance.</p><p><strong>Availability and implementation: </strong>The quick brown fox jumps over the lazy dog.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 8","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10421969/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9996906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.1093/bioinformatics/btad482
Andrea Mastropietro, Gianluca De Carlo, Aris Anagnostopoulos
Abstract Motivation Disease gene prioritization consists in identifying genes that are likely to be involved in the mechanisms of a given disease, providing a ranking of such genes. Recently, the research community has used computational methods to uncover unknown gene–disease associations; these methods range from combinatorial to machine learning-based approaches. In particular, during the last years, approaches based on deep learning have provided superior results compared to more traditional ones. Yet, the problem with these is their inherent black-box structure, which prevents interpretability. Results We propose a new methodology for disease gene discovery, which leverages graph-structured data using graph neural networks (GNNs) along with an explainability phase for determining the ranking of candidate genes and understanding the model’s output. Our approach is based on a positive–unlabeled learning strategy, which outperforms existing gene discovery methods by exploiting GNNs in a non-black-box fashion. Our methodology is effective even in scenarios where a large number of associated genes need to be retrieved, in which gene prioritization methods often tend to lose their reliability. Availability and implementation The source code of XGDAG is available on GitHub at: https://github.com/GiDeCarlo/XGDAG. The data underlying this article are available at: https://www.disgenet.org/, https://thebiogrid.org/, https://doi.org/10.1371/journal.pcbi.1004120.s003, and https://doi.org/10.1371/journal.pcbi.1004120.s004.
{"title":"XGDAG: explainable gene-disease associations via graph neural networks.","authors":"Andrea Mastropietro, Gianluca De Carlo, Aris Anagnostopoulos","doi":"10.1093/bioinformatics/btad482","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad482","url":null,"abstract":"Abstract Motivation Disease gene prioritization consists in identifying genes that are likely to be involved in the mechanisms of a given disease, providing a ranking of such genes. Recently, the research community has used computational methods to uncover unknown gene–disease associations; these methods range from combinatorial to machine learning-based approaches. In particular, during the last years, approaches based on deep learning have provided superior results compared to more traditional ones. Yet, the problem with these is their inherent black-box structure, which prevents interpretability. Results We propose a new methodology for disease gene discovery, which leverages graph-structured data using graph neural networks (GNNs) along with an explainability phase for determining the ranking of candidate genes and understanding the model’s output. Our approach is based on a positive–unlabeled learning strategy, which outperforms existing gene discovery methods by exploiting GNNs in a non-black-box fashion. Our methodology is effective even in scenarios where a large number of associated genes need to be retrieved, in which gene prioritization methods often tend to lose their reliability. Availability and implementation The source code of XGDAG is available on GitHub at: https://github.com/GiDeCarlo/XGDAG. The data underlying this article are available at: https://www.disgenet.org/, https://thebiogrid.org/, https://doi.org/10.1371/journal.pcbi.1004120.s003, and https://doi.org/10.1371/journal.pcbi.1004120.s004.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 8","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10421968/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10055233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.1093/bioinformatics/btad491
Yichen Cheng, Yusen Xia, Xinlei Wang
Motivation: We propose a drug recommendation model that integrates information from both structured data (patient demographic information) and unstructured texts (patient reviews). It is based on multitask learning to predict review ratings of several satisfaction-related measures for a given medicine, where related tasks can learn from each other for prediction. The learned models can then be applied to new patients for drug recommendation. This is fundamentally different from most recommender systems in e-commerce, which do not work well for new customers (referred to as the cold-start problem). To extract information from review texts, we employ both topic modeling and sentiment analysis. We further incorporate variable selection into the model via Bayesian LASSO, which aims to filter out irrelevant features. To our best knowledge, this is the first Bayesian multitask learning method for ordinal responses. We are also the first to apply multitask learning to medicine recommendation. The sample code and data are made available at GitHub: https://github.com/thrushcyc-github/BMull.
Results: We evaluate the proposed method on two sets of drug reviews involving 17 depression/high blood pressure-related drugs. Overall, our method performs better than existing benchmark methods in terms of accuracy and AUC (area under the receiver operating characteristic curve). It is effective even with a small sample size and only a few available features, and more robust to possible noninformative covariates. Due to our model explainability, insights generated from our model may work as a useful reference for doctors. In practice, however, a final decision should be carefully made by combining the information from the proposed recommender with doctors' domain knowledge and past experience.
Availability and implementation: The sample code and data are publicly available at GitHub: https://github.com/thrushcyc-github/BMull.
{"title":"Bayesian multitask learning for medicine recommendation based on online patient reviews.","authors":"Yichen Cheng, Yusen Xia, Xinlei Wang","doi":"10.1093/bioinformatics/btad491","DOIUrl":"10.1093/bioinformatics/btad491","url":null,"abstract":"<p><strong>Motivation: </strong>We propose a drug recommendation model that integrates information from both structured data (patient demographic information) and unstructured texts (patient reviews). It is based on multitask learning to predict review ratings of several satisfaction-related measures for a given medicine, where related tasks can learn from each other for prediction. The learned models can then be applied to new patients for drug recommendation. This is fundamentally different from most recommender systems in e-commerce, which do not work well for new customers (referred to as the cold-start problem). To extract information from review texts, we employ both topic modeling and sentiment analysis. We further incorporate variable selection into the model via Bayesian LASSO, which aims to filter out irrelevant features. To our best knowledge, this is the first Bayesian multitask learning method for ordinal responses. We are also the first to apply multitask learning to medicine recommendation. The sample code and data are made available at GitHub: https://github.com/thrushcyc-github/BMull.</p><p><strong>Results: </strong>We evaluate the proposed method on two sets of drug reviews involving 17 depression/high blood pressure-related drugs. Overall, our method performs better than existing benchmark methods in terms of accuracy and AUC (area under the receiver operating characteristic curve). It is effective even with a small sample size and only a few available features, and more robust to possible noninformative covariates. Due to our model explainability, insights generated from our model may work as a useful reference for doctors. In practice, however, a final decision should be carefully made by combining the information from the proposed recommender with doctors' domain knowledge and past experience.</p><p><strong>Availability and implementation: </strong>The sample code and data are publicly available at GitHub: https://github.com/thrushcyc-github/BMull.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 8","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10425196/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10068713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}