Motivation: The Positional Burrows-Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory.
Results: In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as μ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the μ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, μ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. μ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel.
Availability and implementation: Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.
{"title":"μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data.","authors":"Davide Cozzi, Massimiliano Rossi, Simone Rubinacci, Travis Gagie, Dominik Köppl, Christina Boucher, Paola Bonizzoni","doi":"10.1093/bioinformatics/btad552","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad552","url":null,"abstract":"<p><strong>Motivation: </strong>The Positional Burrows-Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory.</p><p><strong>Results: </strong>In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as μ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the μ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, μ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. μ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel.</p><p><strong>Availability and implementation: </strong>Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10502237/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10287676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad568
Anna Nadtochiy, Peter Luu, Scott E Fraser, Thai V Truong
Summary: In functional imaging studies, accurately synchronizing the time course of experimental manipulations and stimulus presentations with resulting imaging data is crucial for analysis. Current software tools lack such functionality, requiring manual processing of the experimental and imaging data, which is error-prone and potentially non-reproducible. We present VoDEx, an open-source Python library that streamlines the data management and analysis of functional imaging data. VoDEx synchronizes the experimental timeline and events (e.g. presented stimuli, recorded behavior) with imaging data. VoDEx provides tools for logging and storing the timeline annotation, and enables retrieval of imaging data based on specific time-based and manipulation-based experimental conditions.
Availability and implementation: VoDEx is an open-source Python library and can be installed via the "pip install" command. It is released under a BSD license, and its source code is publicly accessible on GitHub (https://github.com/LemonJust/vodex). A graphical interface is available as a napari-vodex plugin, which can be installed through the napari plugins menu or using "pip install." The source code for the napari plugin is available on GitHub (https://github.com/LemonJust/napari-vodex). The software version at the time of submission is archived at Zenodo (version v1.0.18, https://zenodo.org/record/8061531).
{"title":"VoDEx: a Python library for time annotation and management of volumetric functional imaging data.","authors":"Anna Nadtochiy, Peter Luu, Scott E Fraser, Thai V Truong","doi":"10.1093/bioinformatics/btad568","DOIUrl":"10.1093/bioinformatics/btad568","url":null,"abstract":"<p><strong>Summary: </strong>In functional imaging studies, accurately synchronizing the time course of experimental manipulations and stimulus presentations with resulting imaging data is crucial for analysis. Current software tools lack such functionality, requiring manual processing of the experimental and imaging data, which is error-prone and potentially non-reproducible. We present VoDEx, an open-source Python library that streamlines the data management and analysis of functional imaging data. VoDEx synchronizes the experimental timeline and events (e.g. presented stimuli, recorded behavior) with imaging data. VoDEx provides tools for logging and storing the timeline annotation, and enables retrieval of imaging data based on specific time-based and manipulation-based experimental conditions.</p><p><strong>Availability and implementation: </strong>VoDEx is an open-source Python library and can be installed via the \"pip install\" command. It is released under a BSD license, and its source code is publicly accessible on GitHub (https://github.com/LemonJust/vodex). A graphical interface is available as a napari-vodex plugin, which can be installed through the napari plugins menu or using \"pip install.\" The source code for the napari plugin is available on GitHub (https://github.com/LemonJust/napari-vodex). The software version at the time of submission is archived at Zenodo (version v1.0.18, https://zenodo.org/record/8061531).</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10562951/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10226233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad560
Yan Zhu, Lingling Zhao, Naifeng Wen, Junjie Wang, Chunyu Wang
Motivation: Accurate prediction of drug-target binding affinity (DTA) is crucial for drug discovery. The increase in the publication of large-scale DTA datasets enables the development of various computational methods for DTA prediction. Numerous deep learning-based methods have been proposed to predict affinities, some of which only utilize original sequence information or complex structures, but the effective combination of various information and protein-binding pockets have not been fully mined. Therefore, a new method that integrates available key information is urgently needed to predict DTA and accelerate the drug discovery process.
Results: In this study, we propose a novel deep learning-based predictor termed DataDTA to estimate the affinities of drug-target pairs. DataDTA utilizes descriptors of predicted pockets and sequences of proteins, as well as low-dimensional molecular features and SMILES strings of compounds as inputs. Specifically, the pockets were predicted from the three-dimensional structure of proteins and their descriptors were extracted as the partial input features for DTA prediction. The molecular representation of compounds based on algebraic graph features was collected to supplement the input information of targets. Furthermore, to ensure effective learning of multiscale interaction features, a dual-interaction aggregation neural network strategy was developed. DataDTA was compared with state-of-the-art methods on different datasets, and the results showed that DataDTA is a reliable prediction tool for affinities estimation. Specifically, the concordance index (CI) of DataDTA is 0.806 and the Pearson correlation coefficient (R) value is 0.814 on the test dataset, which is higher than other methods.
Availability and implementation: The codes and datasets of DataDTA are available at https://github.com/YanZhu06/DataDTA.
{"title":"DataDTA: a multi-feature and dual-interaction aggregation framework for drug-target binding affinity prediction.","authors":"Yan Zhu, Lingling Zhao, Naifeng Wen, Junjie Wang, Chunyu Wang","doi":"10.1093/bioinformatics/btad560","DOIUrl":"10.1093/bioinformatics/btad560","url":null,"abstract":"<p><strong>Motivation: </strong>Accurate prediction of drug-target binding affinity (DTA) is crucial for drug discovery. The increase in the publication of large-scale DTA datasets enables the development of various computational methods for DTA prediction. Numerous deep learning-based methods have been proposed to predict affinities, some of which only utilize original sequence information or complex structures, but the effective combination of various information and protein-binding pockets have not been fully mined. Therefore, a new method that integrates available key information is urgently needed to predict DTA and accelerate the drug discovery process.</p><p><strong>Results: </strong>In this study, we propose a novel deep learning-based predictor termed DataDTA to estimate the affinities of drug-target pairs. DataDTA utilizes descriptors of predicted pockets and sequences of proteins, as well as low-dimensional molecular features and SMILES strings of compounds as inputs. Specifically, the pockets were predicted from the three-dimensional structure of proteins and their descriptors were extracted as the partial input features for DTA prediction. The molecular representation of compounds based on algebraic graph features was collected to supplement the input information of targets. Furthermore, to ensure effective learning of multiscale interaction features, a dual-interaction aggregation neural network strategy was developed. DataDTA was compared with state-of-the-art methods on different datasets, and the results showed that DataDTA is a reliable prediction tool for affinities estimation. Specifically, the concordance index (CI) of DataDTA is 0.806 and the Pearson correlation coefficient (R) value is 0.814 on the test dataset, which is higher than other methods.</p><p><strong>Availability and implementation: </strong>The codes and datasets of DataDTA are available at https://github.com/YanZhu06/DataDTA.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10516524/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10181115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad554
{"title":"Correction to: Robust joint clustering of multi-omics single-cell data via multi-modal high-order neighborhood Laplacian matrix optimization.","authors":"","doi":"10.1093/bioinformatics/btad554","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad554","url":null,"abstract":"","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10497449/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10232109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad529
Lechuan Li, Ruth Dannenfelser, Yu Zhu, Nathaniel Hejduk, Santiago Segarra, Vicky Yao
Motivation: Model organisms are widely used to better understand the molecular causes of human disease. While sequence similarity greatly aids this cross-species transfer, sequence similarity does not imply functional similarity, and thus, several current approaches incorporate protein-protein interactions to help map findings between species. Existing transfer methods either formulate the alignment problem as a matching problem which pits network features against known orthology, or more recently, as a joint embedding problem.
Results: We propose a novel state-of-the-art joint embedding solution: Embeddings to Network Alignment (ETNA). ETNA generates individual network embeddings based on network topological structure and then uses a Natural Language Processing-inspired cross-training approach to align the two embeddings using sequence-based orthologs. The final embedding preserves both within and between species gene functional relationships, and we demonstrate that it captures both pairwise and group functional relevance. In addition, ETNA's embeddings can be used to transfer genetic interactions across species and identify phenotypic alignments, laying the groundwork for potential opportunities for drug repurposing and translational studies.
Availability and implementation: https://github.com/ylaboratory/ETNA.
{"title":"Joint embedding of biological networks for cross-species functional alignment.","authors":"Lechuan Li, Ruth Dannenfelser, Yu Zhu, Nathaniel Hejduk, Santiago Segarra, Vicky Yao","doi":"10.1093/bioinformatics/btad529","DOIUrl":"10.1093/bioinformatics/btad529","url":null,"abstract":"<p><strong>Motivation: </strong>Model organisms are widely used to better understand the molecular causes of human disease. While sequence similarity greatly aids this cross-species transfer, sequence similarity does not imply functional similarity, and thus, several current approaches incorporate protein-protein interactions to help map findings between species. Existing transfer methods either formulate the alignment problem as a matching problem which pits network features against known orthology, or more recently, as a joint embedding problem.</p><p><strong>Results: </strong>We propose a novel state-of-the-art joint embedding solution: Embeddings to Network Alignment (ETNA). ETNA generates individual network embeddings based on network topological structure and then uses a Natural Language Processing-inspired cross-training approach to align the two embeddings using sequence-based orthologs. The final embedding preserves both within and between species gene functional relationships, and we demonstrate that it captures both pairwise and group functional relevance. In addition, ETNA's embeddings can be used to transfer genetic interactions across species and identify phenotypic alignments, laying the groundwork for potential opportunities for drug repurposing and translational studies.</p><p><strong>Availability and implementation: </strong>https://github.com/ylaboratory/ETNA.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10477935/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10286575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad530
Yiwei Liu, Changhuo Yang, Hong-Dong Li, Jianxin Wang
Motivation: A single gene may yield several isoforms with different functions through alternative splicing. Continuous efforts are devoted to developing machine-learning methods to predict isoform functions. However, existing methods do not consider the relevance of each feature to specific functions and ignore the noise caused by the irrelevant features. In this case, we hypothesize that constructing a feature selection framework to extract the function-relevant features might help improve the model accuracy in isoform function prediction.
Results: In this article, we present a feature selection-based approach named IsoFrog to predict isoform functions. First, IsoFrog adopts a reversible jump Markov Chain Monte Carlo (RJMCMC)-based feature selection framework to assess the feature importance to gene functions. Second, a sequential feature selection procedure is applied to select a subset of function-relevant features. This strategy screens the relevant features for the specific function while eliminating irrelevant ones, improving the effectiveness of the input features. Then, the selected features are input into our proposed method modified domain-invariant partial least squares, which prioritizes the most likely positive isoform for each positive MIG and utilizes diPLS for isoform function prediction. Tested on three datasets, our method achieves superior performance over six state-of-the-art methods, and the RJMCMC-based feature selection framework outperforms three classic feature selection methods. We expect this proposed methodology will promote the identification of isoform functions and further inspire the development of new methods.
Availability and implementation: IsoFrog is freely available at https://github.com/genemine/IsoFrog.
{"title":"IsoFrog: a reversible jump Markov Chain Monte Carlo feature selection-based method for predicting isoform functions.","authors":"Yiwei Liu, Changhuo Yang, Hong-Dong Li, Jianxin Wang","doi":"10.1093/bioinformatics/btad530","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad530","url":null,"abstract":"<p><strong>Motivation: </strong>A single gene may yield several isoforms with different functions through alternative splicing. Continuous efforts are devoted to developing machine-learning methods to predict isoform functions. However, existing methods do not consider the relevance of each feature to specific functions and ignore the noise caused by the irrelevant features. In this case, we hypothesize that constructing a feature selection framework to extract the function-relevant features might help improve the model accuracy in isoform function prediction.</p><p><strong>Results: </strong>In this article, we present a feature selection-based approach named IsoFrog to predict isoform functions. First, IsoFrog adopts a reversible jump Markov Chain Monte Carlo (RJMCMC)-based feature selection framework to assess the feature importance to gene functions. Second, a sequential feature selection procedure is applied to select a subset of function-relevant features. This strategy screens the relevant features for the specific function while eliminating irrelevant ones, improving the effectiveness of the input features. Then, the selected features are input into our proposed method modified domain-invariant partial least squares, which prioritizes the most likely positive isoform for each positive MIG and utilizes diPLS for isoform function prediction. Tested on three datasets, our method achieves superior performance over six state-of-the-art methods, and the RJMCMC-based feature selection framework outperforms three classic feature selection methods. We expect this proposed methodology will promote the identification of isoform functions and further inspire the development of new methods.</p><p><strong>Availability and implementation: </strong>IsoFrog is freely available at https://github.com/genemine/IsoFrog.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10491952/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10335853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad532
Lin Wang, Chenhao Sun, Xianyu Xu, Jia Li, Wenjuan Zhang
Motivation: A critical issue in drug benefit-risk assessment is to determine the frequency of side effects, which is performed by randomized controlled trails. Computationally predicted frequencies of drug side effects can be used to effectively guide the randomized controlled trails. However, it is more challenging to predict drug side effect frequencies, and thus only a few studies cope with this problem.
Results: In this work, we propose a neighborhood-regularization method (NRFSE) that leverages multiview data on drugs and side effects to predict the frequency of side effects. First, we adopt a class-weighted non-negative matrix factorization to decompose the drug-side effect frequency matrix, in which Gaussian likelihood is used to model unknown drug-side effect pairs. Second, we design a multiview neighborhood regularization to integrate three drug attributes and two side effect attributes, respectively, which makes most similar drugs and most similar side effects have similar latent signatures. The regularization can adaptively determine the weights of different attributes. We conduct extensive experiments on one benchmark dataset, and NRFSE improves the prediction performance compared with five state-of-the-art approaches. Independent test set of post-marketing side effects further validate the effectiveness of NRFSE.
Availability and implementation: Source code and datasets are available at https://github.com/linwang1982/NRFSE or https://codeocean.com/capsule/4741497/tree/v1.
{"title":"A neighborhood-regularization method leveraging multiview data for predicting the frequency of drug-side effects.","authors":"Lin Wang, Chenhao Sun, Xianyu Xu, Jia Li, Wenjuan Zhang","doi":"10.1093/bioinformatics/btad532","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad532","url":null,"abstract":"<p><strong>Motivation: </strong>A critical issue in drug benefit-risk assessment is to determine the frequency of side effects, which is performed by randomized controlled trails. Computationally predicted frequencies of drug side effects can be used to effectively guide the randomized controlled trails. However, it is more challenging to predict drug side effect frequencies, and thus only a few studies cope with this problem.</p><p><strong>Results: </strong>In this work, we propose a neighborhood-regularization method (NRFSE) that leverages multiview data on drugs and side effects to predict the frequency of side effects. First, we adopt a class-weighted non-negative matrix factorization to decompose the drug-side effect frequency matrix, in which Gaussian likelihood is used to model unknown drug-side effect pairs. Second, we design a multiview neighborhood regularization to integrate three drug attributes and two side effect attributes, respectively, which makes most similar drugs and most similar side effects have similar latent signatures. The regularization can adaptively determine the weights of different attributes. We conduct extensive experiments on one benchmark dataset, and NRFSE improves the prediction performance compared with five state-of-the-art approaches. Independent test set of post-marketing side effects further validate the effectiveness of NRFSE.</p><p><strong>Availability and implementation: </strong>Source code and datasets are available at https://github.com/linwang1982/NRFSE or https://codeocean.com/capsule/4741497/tree/v1.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10491955/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10285623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad512
Bryce Kille, Erik Garrison, Todd J Treangen, Adam M Phillippy
Motivation: The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates.
Results: To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.
Availability and implementation: MashMap3 is available at https://github.com/marbl/MashMap.
{"title":"Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation.","authors":"Bryce Kille, Erik Garrison, Todd J Treangen, Adam M Phillippy","doi":"10.1093/bioinformatics/btad512","DOIUrl":"10.1093/bioinformatics/btad512","url":null,"abstract":"<p><strong>Motivation: </strong>The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates.</p><p><strong>Results: </strong>To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.</p><p><strong>Availability and implementation: </strong>MashMap3 is available at https://github.com/marbl/MashMap.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10505501/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10304418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad546
Xinwei He, Kun Qian, Ziqian Wang, Shirou Zeng, Hongwei Li, Wei Vivian Li
Motivation: Since the development of single-cell RNA sequencing (scRNA-seq) technologies, clustering analysis of single-cell gene expression data has been an essential tool for distinguishing cell types and identifying novel cell types. Even though many methods have been available for scRNA-seq clustering analysis, the majority of them are constrained by the requirement on predetermined cluster numbers or the dependence on selected initial cluster assignment.
Results: In this article, we propose an adaptive embedding and clustering method named scAce, which constructs a variational autoencoder to simultaneously learn cell embeddings and cluster assignments. In the scAce method, we develop an adaptive cluster merging approach which achieves improved clustering results without the need to estimate the number of clusters in advance. In addition, scAce provides an option to perform clustering enhancement, which can update and enhance cluster assignments based on previous clustering results from other methods. Based on computational analysis of both simulated and real datasets, we demonstrate that scAce outperforms state-of-the-art clustering methods for scRNA-seq data, and achieves better clustering accuracy and robustness.
Availability and implementation: The scAce package is implemented in python 3.8 and is freely available from https://github.com/sldyns/scAce.
{"title":"scAce: an adaptive embedding and clustering method for single-cell gene expression data.","authors":"Xinwei He, Kun Qian, Ziqian Wang, Shirou Zeng, Hongwei Li, Wei Vivian Li","doi":"10.1093/bioinformatics/btad546","DOIUrl":"10.1093/bioinformatics/btad546","url":null,"abstract":"<p><strong>Motivation: </strong>Since the development of single-cell RNA sequencing (scRNA-seq) technologies, clustering analysis of single-cell gene expression data has been an essential tool for distinguishing cell types and identifying novel cell types. Even though many methods have been available for scRNA-seq clustering analysis, the majority of them are constrained by the requirement on predetermined cluster numbers or the dependence on selected initial cluster assignment.</p><p><strong>Results: </strong>In this article, we propose an adaptive embedding and clustering method named scAce, which constructs a variational autoencoder to simultaneously learn cell embeddings and cluster assignments. In the scAce method, we develop an adaptive cluster merging approach which achieves improved clustering results without the need to estimate the number of clusters in advance. In addition, scAce provides an option to perform clustering enhancement, which can update and enhance cluster assignments based on previous clustering results from other methods. Based on computational analysis of both simulated and real datasets, we demonstrate that scAce outperforms state-of-the-art clustering methods for scRNA-seq data, and achieves better clustering accuracy and robustness.</p><p><strong>Availability and implementation: </strong>The scAce package is implemented in python 3.8 and is freely available from https://github.com/sldyns/scAce.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10500084/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10649377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad534
Jun Young Park, Jang Jae Lee, Younghwa Lee, Dongsoo Lee, Jungsoo Gim, Lindsay Farrer, Kun Ho Lee, Sungho Won
Motivation: Allowance for increasingly large samples is a key to identify the association of genetic variants with Alzheimer's disease (AD) in genome-wide association studies (GWAS). Accordingly, we aimed to develop a method that incorporates patients with mild cognitive impairment and unknown cognitive status in GWAS using a machine learning-based AD prediction model.
Results: Simulation analyses showed that weighting imputed phenotypes method increased the statistical power compared to ordinary logistic regression using only AD cases and controls. Applied to real-world data, the penalized logistic method had the highest AUC (0.96) for AD prediction and weighting imputed phenotypes method performed well in terms of power. We identified an association (P<5.0×10-8) of AD with several variants in the APOE region and rs143625563 in LMX1A. Our method, which allows the inclusion of individuals with mild cognitive impairment, improves the statistical power of GWAS for AD. We discovered a novel association with LMX1A.
Availability and implementation: Simulation codes can be accessed at https://github.com/Junkkkk/wGEE_GWAS.
{"title":"Machine learning-based quantification for disease uncertainty increases the statistical power of genetic association studies.","authors":"Jun Young Park, Jang Jae Lee, Younghwa Lee, Dongsoo Lee, Jungsoo Gim, Lindsay Farrer, Kun Ho Lee, Sungho Won","doi":"10.1093/bioinformatics/btad534","DOIUrl":"10.1093/bioinformatics/btad534","url":null,"abstract":"<p><strong>Motivation: </strong>Allowance for increasingly large samples is a key to identify the association of genetic variants with Alzheimer's disease (AD) in genome-wide association studies (GWAS). Accordingly, we aimed to develop a method that incorporates patients with mild cognitive impairment and unknown cognitive status in GWAS using a machine learning-based AD prediction model.</p><p><strong>Results: </strong>Simulation analyses showed that weighting imputed phenotypes method increased the statistical power compared to ordinary logistic regression using only AD cases and controls. Applied to real-world data, the penalized logistic method had the highest AUC (0.96) for AD prediction and weighting imputed phenotypes method performed well in terms of power. We identified an association (P<5.0×10-8) of AD with several variants in the APOE region and rs143625563 in LMX1A. Our method, which allows the inclusion of individuals with mild cognitive impairment, improves the statistical power of GWAS for AD. We discovered a novel association with LMX1A.</p><p><strong>Availability and implementation: </strong>Simulation codes can be accessed at https://github.com/Junkkkk/wGEE_GWAS.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10539075/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10151455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}