Bioinformatics最新文献_第5页

μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data. μ- PBWT:用于存储和查询UK Biobank数据的轻量级r索引PBWT。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad552

Davide Cozzi, Massimiliano Rossi, Simone Rubinacci, Travis Gagie, Dominik Köppl, Christina Boucher, Paola Bonizzoni

Motivation: The Positional Burrows-Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory.

Results: In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as μ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the μ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, μ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. μ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel.

Availability and implementation: Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.

动机:位置Burrows-Wheeler变换(PBWT)是一种数据结构，它以一种方式对单倍型序列进行索引，这种方式能够在O(hw)时间内找到包含w个变异位点的h个序列中的最大单倍型匹配。这代表了对经典二次时间方法的重大改进。然而，如果单体型的索引必须完全保存在内存中，那么原始的PBWT数据结构不允许对包含数百万单体型的Biobank面板进行查询。结果:在本文中，我们利用为BWT提出的r-index概念，提出了一种内存高效的方法来构建和存储运行长度编码的PBWT，并计算单倍型序列中的最大匹配集(SMEMs)查询。我们实现了我们的方法，我们称之为μ-PBWT，并在1000 Genome Project和UK Biobank数据集上进行了评估。我们的实验表明，与目前最好的基于pbwt的索引相比，μ-PBWT将内存使用减少了20%。特别是，μ-PBWT产生了一个索引，该索引将20号染色体的高覆盖率全基因组测序数据存储在其BCF文件约三分之一的空间中。μ-PBWT是对运行长度压缩的BWT (RLPBWT)技术的改进，它基于在内存中只保留RLPBWT的简洁表示，仍然允许在原始面板上有效地计算集最大匹配(SMEMs)。可用性和实现:我们的实现是开源的，可以在https://github.com/dlcgold/muPBWT上获得。二进制文件可从https://bioconda.github.io/recipes/mupbwt/README.html获得。

{"title":"μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data.","authors":"Davide Cozzi, Massimiliano Rossi, Simone Rubinacci, Travis Gagie, Dominik Köppl, Christina Boucher, Paola Bonizzoni","doi":"10.1093/bioinformatics/btad552","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad552","url":null,"abstract":"Motivation: The Positional Burrows-Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory.Results: In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as μ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the μ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, μ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. μ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel.Availability and implementation: Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10502237/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10287676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VoDEx: a Python library for time annotation and management of volumetric functional imaging data. VoDEx：用于体积功能成像数据的时间注释和管理的Python库。

IF 4.4 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad568

Anna Nadtochiy, Peter Luu, Scott E Fraser, Thai V Truong

Summary: In functional imaging studies, accurately synchronizing the time course of experimental manipulations and stimulus presentations with resulting imaging data is crucial for analysis. Current software tools lack such functionality, requiring manual processing of the experimental and imaging data, which is error-prone and potentially non-reproducible. We present VoDEx, an open-source Python library that streamlines the data management and analysis of functional imaging data. VoDEx synchronizes the experimental timeline and events (e.g. presented stimuli, recorded behavior) with imaging data. VoDEx provides tools for logging and storing the timeline annotation, and enables retrieval of imaging data based on specific time-based and manipulation-based experimental conditions.

Availability and implementation: VoDEx is an open-source Python library and can be installed via the "pip install" command. It is released under a BSD license, and its source code is publicly accessible on GitHub (https://github.com/LemonJust/vodex). A graphical interface is available as a napari-vodex plugin, which can be installed through the napari plugins menu or using "pip install." The source code for the napari plugin is available on GitHub (https://github.com/LemonJust/napari-vodex). The software version at the time of submission is archived at Zenodo (version v1.0.18, https://zenodo.org/record/8061531).

摘要：在功能成像研究中，准确同步实验操作和刺激表现的时间进程与产生的成像数据对于分析至关重要。目前的软件工具缺乏这样的功能，需要手动处理实验和成像数据，这很容易出错，并且可能不可复制。我们介绍了VoDEx，一个开源Python库，它简化了功能成像数据的数据管理和分析。VoDEx将实验时间线和事件（例如，呈现的刺激、记录的行为）与成像数据同步。VoDEx提供了用于记录和存储时间轴注释的工具，并能够基于特定的基于时间和基于操作的实验条件检索成像数据。可用性和实现：VoDEx是一个开源Python库，可以通过“pip-install”命令进行安装。它是在BSD许可证下发布的，其源代码可以在GitHub上公开访问(https://github.com/LemonJust/vodex)。图形界面作为napari vodex插件提供，可以通过napari插件菜单或使用“pip-install”进行安装。napari插件的源代码可在GitHub上获得(https://github.com/LemonJust/napari-vodex)。提交时的软件版本存档在Zenodo（v1.0.18版本，https://zenodo.org/record/8061531)。补充信息：补充数据可在生物信息学在线获取。

{"title":"VoDEx: a Python library for time annotation and management of volumetric functional imaging data.","authors":"Anna Nadtochiy, Peter Luu, Scott E Fraser, Thai V Truong","doi":"10.1093/bioinformatics/btad568","DOIUrl":"10.1093/bioinformatics/btad568","url":null,"abstract":"Summary: In functional imaging studies, accurately synchronizing the time course of experimental manipulations and stimulus presentations with resulting imaging data is crucial for analysis. Current software tools lack such functionality, requiring manual processing of the experimental and imaging data, which is error-prone and potentially non-reproducible. We present VoDEx, an open-source Python library that streamlines the data management and analysis of functional imaging data. VoDEx synchronizes the experimental timeline and events (e.g. presented stimuli, recorded behavior) with imaging data. VoDEx provides tools for logging and storing the timeline annotation, and enables retrieval of imaging data based on specific time-based and manipulation-based experimental conditions.Availability and implementation: VoDEx is an open-source Python library and can be installed via the \"pip install\" command. It is released under a BSD license, and its source code is publicly accessible on GitHub (https://github.com/LemonJust/vodex). A graphical interface is available as a napari-vodex plugin, which can be installed through the napari plugins menu or using \"pip install.\" The source code for the napari plugin is available on GitHub (https://github.com/LemonJust/napari-vodex). The software version at the time of submission is archived at Zenodo (version v1.0.18, https://zenodo.org/record/8061531).","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10562951/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10226233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DataDTA: a multi-feature and dual-interaction aggregation framework for drug-target binding affinity prediction. DataDTA：一个用于药物靶标结合亲和力预测的多特征和双重相互作用聚集框架。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad560

Yan Zhu, Lingling Zhao, Naifeng Wen, Junjie Wang, Chunyu Wang

Motivation: Accurate prediction of drug-target binding affinity (DTA) is crucial for drug discovery. The increase in the publication of large-scale DTA datasets enables the development of various computational methods for DTA prediction. Numerous deep learning-based methods have been proposed to predict affinities, some of which only utilize original sequence information or complex structures, but the effective combination of various information and protein-binding pockets have not been fully mined. Therefore, a new method that integrates available key information is urgently needed to predict DTA and accelerate the drug discovery process.

Results: In this study, we propose a novel deep learning-based predictor termed DataDTA to estimate the affinities of drug-target pairs. DataDTA utilizes descriptors of predicted pockets and sequences of proteins, as well as low-dimensional molecular features and SMILES strings of compounds as inputs. Specifically, the pockets were predicted from the three-dimensional structure of proteins and their descriptors were extracted as the partial input features for DTA prediction. The molecular representation of compounds based on algebraic graph features was collected to supplement the input information of targets. Furthermore, to ensure effective learning of multiscale interaction features, a dual-interaction aggregation neural network strategy was developed. DataDTA was compared with state-of-the-art methods on different datasets, and the results showed that DataDTA is a reliable prediction tool for affinities estimation. Specifically, the concordance index (CI) of DataDTA is 0.806 and the Pearson correlation coefficient (R) value is 0.814 on the test dataset, which is higher than other methods.

Availability and implementation: The codes and datasets of DataDTA are available at https://github.com/YanZhu06/DataDTA.

动机：准确预测药物靶点结合亲和力（DTA）对药物发现至关重要。大规模DTA数据集出版的增加使得DTA预测的各种计算方法得以发展。已经提出了许多基于深度学习的方法来预测亲和力，其中一些方法只利用原始序列信息或复杂结构，但各种信息和蛋白质结合口袋的有效组合尚未得到充分挖掘。因此，迫切需要一种整合现有关键信息的新方法来预测DTA并加快药物发现过程。结果：在这项研究中，我们提出了一种新的基于深度学习的预测因子DataDTA来估计药物-靶标对的亲和力。DataDTA利用预测的蛋白质口袋和序列的描述符，以及低维分子特征和化合物的SMILES串作为输入。具体而言，从蛋白质的三维结构预测口袋，并提取它们的描述符作为DTA预测的部分输入特征。收集了基于代数图特征的化合物分子表示，以补充靶标的输入信息。此外，为了确保多尺度交互特征的有效学习，开发了一种双交互聚合神经网络策略。在不同的数据集上，将DataDTA与最先进的方法进行了比较，结果表明，DataDTA是一种可靠的亲和力估计预测工具。具体而言，在测试数据集上，DataDTA的一致性指数（CI）为0.806，Pearson相关系数（R）值为0.814，高于其他方法。可用性和实施：DataDTA的代码和数据集可在https://github.com/YanZhu06/DataDTA.

{"title":"DataDTA: a multi-feature and dual-interaction aggregation framework for drug-target binding affinity prediction.","authors":"Yan Zhu, Lingling Zhao, Naifeng Wen, Junjie Wang, Chunyu Wang","doi":"10.1093/bioinformatics/btad560","DOIUrl":"10.1093/bioinformatics/btad560","url":null,"abstract":"Motivation: Accurate prediction of drug-target binding affinity (DTA) is crucial for drug discovery. The increase in the publication of large-scale DTA datasets enables the development of various computational methods for DTA prediction. Numerous deep learning-based methods have been proposed to predict affinities, some of which only utilize original sequence information or complex structures, but the effective combination of various information and protein-binding pockets have not been fully mined. Therefore, a new method that integrates available key information is urgently needed to predict DTA and accelerate the drug discovery process.Results: In this study, we propose a novel deep learning-based predictor termed DataDTA to estimate the affinities of drug-target pairs. DataDTA utilizes descriptors of predicted pockets and sequences of proteins, as well as low-dimensional molecular features and SMILES strings of compounds as inputs. Specifically, the pockets were predicted from the three-dimensional structure of proteins and their descriptors were extracted as the partial input features for DTA prediction. The molecular representation of compounds based on algebraic graph features was collected to supplement the input information of targets. Furthermore, to ensure effective learning of multiscale interaction features, a dual-interaction aggregation neural network strategy was developed. DataDTA was compared with state-of-the-art methods on different datasets, and the results showed that DataDTA is a reliable prediction tool for affinities estimation. Specifically, the concordance index (CI) of DataDTA is 0.806 and the Pearson correlation coefficient (R) value is 0.814 on the test dataset, which is higher than other methods.Availability and implementation: The codes and datasets of DataDTA are available at https://github.com/YanZhu06/DataDTA.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10516524/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10181115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Correction to: Robust joint clustering of multi-omics single-cell data via multi-modal high-order neighborhood Laplacian matrix optimization. 修正:基于多模态高阶邻域拉普拉斯矩阵优化的多组学单细胞数据鲁棒联合聚类。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad554

引用次数: 0

Joint embedding of biological networks for cross-species functional alignment. 用于跨物种功能比对的生物网络的联合嵌入。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad529

Lechuan Li, Ruth Dannenfelser, Yu Zhu, Nathaniel Hejduk, Santiago Segarra, Vicky Yao

Motivation: Model organisms are widely used to better understand the molecular causes of human disease. While sequence similarity greatly aids this cross-species transfer, sequence similarity does not imply functional similarity, and thus, several current approaches incorporate protein-protein interactions to help map findings between species. Existing transfer methods either formulate the alignment problem as a matching problem which pits network features against known orthology, or more recently, as a joint embedding problem.

Results: We propose a novel state-of-the-art joint embedding solution: Embeddings to Network Alignment (ETNA). ETNA generates individual network embeddings based on network topological structure and then uses a Natural Language Processing-inspired cross-training approach to align the two embeddings using sequence-based orthologs. The final embedding preserves both within and between species gene functional relationships, and we demonstrate that it captures both pairwise and group functional relevance. In addition, ETNA's embeddings can be used to transfer genetic interactions across species and identify phenotypic alignments, laying the groundwork for potential opportunities for drug repurposing and translational studies.

Availability and implementation: https://github.com/ylaboratory/ETNA.

动机：模式生物被广泛用于更好地了解人类疾病的分子原因。虽然序列相似性极大地帮助了这种跨物种转移，但序列相似性并不意味着功能相似，因此，目前的几种方法结合了蛋白质-蛋白质相互作用，以帮助绘制物种之间的发现图。现有的传输方法要么将对齐问题表述为使网络特征与已知的正交性相匹配的匹配问题，要么最近将其表述为联合嵌入问题。结果：我们提出了一种新的最先进的联合嵌入解决方案：嵌入到网络对齐（ETNA）。ETNA基于网络拓扑结构生成单独的网络嵌入，然后使用受自然语言处理启发的交叉训练方法，使用基于序列的正交对数对两个嵌入进行对齐。最终的嵌入保留了物种内部和物种之间的基因功能关系，我们证明它捕获了成对和群体功能相关性。此外，ETNA的嵌入物可用于跨物种转移遗传相互作用并确定表型比对，为药物再利用和转化研究的潜在机会奠定基础。可用性和实施：https://github.com/ylaboratory/ETNA.

{"title":"Joint embedding of biological networks for cross-species functional alignment.","authors":"Lechuan Li, Ruth Dannenfelser, Yu Zhu, Nathaniel Hejduk, Santiago Segarra, Vicky Yao","doi":"10.1093/bioinformatics/btad529","DOIUrl":"10.1093/bioinformatics/btad529","url":null,"abstract":"Motivation: Model organisms are widely used to better understand the molecular causes of human disease. While sequence similarity greatly aids this cross-species transfer, sequence similarity does not imply functional similarity, and thus, several current approaches incorporate protein-protein interactions to help map findings between species. Existing transfer methods either formulate the alignment problem as a matching problem which pits network features against known orthology, or more recently, as a joint embedding problem.Results: We propose a novel state-of-the-art joint embedding solution: Embeddings to Network Alignment (ETNA). ETNA generates individual network embeddings based on network topological structure and then uses a Natural Language Processing-inspired cross-training approach to align the two embeddings using sequence-based orthologs. The final embedding preserves both within and between species gene functional relationships, and we demonstrate that it captures both pairwise and group functional relevance. In addition, ETNA's embeddings can be used to transfer genetic interactions across species and identify phenotypic alignments, laying the groundwork for potential opportunities for drug repurposing and translational studies.Availability and implementation: https://github.com/ylaboratory/ETNA.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10477935/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10286575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IsoFrog: a reversible jump Markov Chain Monte Carlo feature selection-based method for predicting isoform functions. IsoFrog:一个可逆跳跃马尔可夫链蒙特卡罗特征选择为基础的方法预测异构体函数。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad530

Yiwei Liu, Changhuo Yang, Hong-Dong Li, Jianxin Wang

Motivation: A single gene may yield several isoforms with different functions through alternative splicing. Continuous efforts are devoted to developing machine-learning methods to predict isoform functions. However, existing methods do not consider the relevance of each feature to specific functions and ignore the noise caused by the irrelevant features. In this case, we hypothesize that constructing a feature selection framework to extract the function-relevant features might help improve the model accuracy in isoform function prediction.

Results: In this article, we present a feature selection-based approach named IsoFrog to predict isoform functions. First, IsoFrog adopts a reversible jump Markov Chain Monte Carlo (RJMCMC)-based feature selection framework to assess the feature importance to gene functions. Second, a sequential feature selection procedure is applied to select a subset of function-relevant features. This strategy screens the relevant features for the specific function while eliminating irrelevant ones, improving the effectiveness of the input features. Then, the selected features are input into our proposed method modified domain-invariant partial least squares, which prioritizes the most likely positive isoform for each positive MIG and utilizes diPLS for isoform function prediction. Tested on three datasets, our method achieves superior performance over six state-of-the-art methods, and the RJMCMC-based feature selection framework outperforms three classic feature selection methods. We expect this proposed methodology will promote the identification of isoform functions and further inspire the development of new methods.

Availability and implementation: IsoFrog is freely available at https://github.com/genemine/IsoFrog.

动机:一个基因可以通过选择性剪接产生几个具有不同功能的同种异构体。不断的努力致力于开发机器学习方法来预测同形函数。然而，现有的方法没有考虑到每个特征与特定函数的相关性，忽略了不相关特征引起的噪声。在这种情况下，我们假设构建一个特征选择框架来提取与函数相关的特征可能有助于提高模型在同形函数预测中的准确性。结果:在本文中，我们提出了一种基于特征选择的IsoFrog方法来预测异构体函数。首先，IsoFrog采用基于可逆跳跃马尔可夫链蒙特卡罗(RJMCMC)的特征选择框架来评估特征对基因功能的重要性。其次，采用顺序特征选择程序选择与功能相关的特征子集。该策略为特定功能筛选相关特征，同时剔除不相关特征，提高输入特征的有效性。然后，将选择的特征输入到我们提出的改进域不变偏最小二乘方法中，该方法为每个正MIG优先考虑最可能的正异构体，并利用diPLS进行异构体函数预测。在三个数据集上的测试表明，我们的方法比六种最先进的方法取得了更好的性能，基于rjmcmc的特征选择框架优于三种经典的特征选择方法。我们期望这一方法将促进异构体功能的识别，并进一步激发新方法的发展。可用性和实现:IsoFrog可以在https://github.com/genemine/IsoFrog上免费获得。

{"title":"IsoFrog: a reversible jump Markov Chain Monte Carlo feature selection-based method for predicting isoform functions.","authors":"Yiwei Liu, Changhuo Yang, Hong-Dong Li, Jianxin Wang","doi":"10.1093/bioinformatics/btad530","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad530","url":null,"abstract":"Motivation: A single gene may yield several isoforms with different functions through alternative splicing. Continuous efforts are devoted to developing machine-learning methods to predict isoform functions. However, existing methods do not consider the relevance of each feature to specific functions and ignore the noise caused by the irrelevant features. In this case, we hypothesize that constructing a feature selection framework to extract the function-relevant features might help improve the model accuracy in isoform function prediction.Results: In this article, we present a feature selection-based approach named IsoFrog to predict isoform functions. First, IsoFrog adopts a reversible jump Markov Chain Monte Carlo (RJMCMC)-based feature selection framework to assess the feature importance to gene functions. Second, a sequential feature selection procedure is applied to select a subset of function-relevant features. This strategy screens the relevant features for the specific function while eliminating irrelevant ones, improving the effectiveness of the input features. Then, the selected features are input into our proposed method modified domain-invariant partial least squares, which prioritizes the most likely positive isoform for each positive MIG and utilizes diPLS for isoform function prediction. Tested on three datasets, our method achieves superior performance over six state-of-the-art methods, and the RJMCMC-based feature selection framework outperforms three classic feature selection methods. We expect this proposed methodology will promote the identification of isoform functions and further inspire the development of new methods.Availability and implementation: IsoFrog is freely available at https://github.com/genemine/IsoFrog.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10491952/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10335853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A neighborhood-regularization method leveraging multiview data for predicting the frequency of drug-side effects. 利用多视角数据预测药物副作用发生频率的邻域正则化方法。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad532

Lin Wang, Chenhao Sun, Xianyu Xu, Jia Li, Wenjuan Zhang

Motivation: A critical issue in drug benefit-risk assessment is to determine the frequency of side effects, which is performed by randomized controlled trails. Computationally predicted frequencies of drug side effects can be used to effectively guide the randomized controlled trails. However, it is more challenging to predict drug side effect frequencies, and thus only a few studies cope with this problem.

Results: In this work, we propose a neighborhood-regularization method (NRFSE) that leverages multiview data on drugs and side effects to predict the frequency of side effects. First, we adopt a class-weighted non-negative matrix factorization to decompose the drug-side effect frequency matrix, in which Gaussian likelihood is used to model unknown drug-side effect pairs. Second, we design a multiview neighborhood regularization to integrate three drug attributes and two side effect attributes, respectively, which makes most similar drugs and most similar side effects have similar latent signatures. The regularization can adaptively determine the weights of different attributes. We conduct extensive experiments on one benchmark dataset, and NRFSE improves the prediction performance compared with five state-of-the-art approaches. Independent test set of post-marketing side effects further validate the effectiveness of NRFSE.

Availability and implementation: Source code and datasets are available at https://github.com/linwang1982/NRFSE or https://codeocean.com/capsule/4741497/tree/v1.

动机:药物获益-风险评估的一个关键问题是确定副作用的频率，这是通过随机对照试验进行的。计算预测药物副作用的频率可以有效地指导随机对照试验。然而，预测药物副作用频率更具挑战性，因此只有少数研究涉及这一问题。在这项工作中，我们提出了一种邻域正则化方法(NRFSE)，该方法利用药物和副作用的多视图数据来预测副作用的频率。首先，我们采用类加权非负矩阵分解法分解毒副作用频率矩阵，其中使用高斯似然对未知毒副作用对建模。其次，我们设计了一个多视图邻域正则化，分别整合三个药物属性和两个副作用属性，使得大多数相似的药物和大多数相似的副作用具有相似的潜在特征。正则化可以自适应地确定不同属性的权重。我们在一个基准数据集上进行了广泛的实验，与五种最先进的方法相比，NRFSE提高了预测性能。上市后副作用的独立测试集进一步验证了NRFSE的有效性。可用性和实现:源代码和数据集可从https://github.com/linwang1982/NRFSE或https://codeocean.com/capsule/4741497/tree/v1获得。

{"title":"A neighborhood-regularization method leveraging multiview data for predicting the frequency of drug-side effects.","authors":"Lin Wang, Chenhao Sun, Xianyu Xu, Jia Li, Wenjuan Zhang","doi":"10.1093/bioinformatics/btad532","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad532","url":null,"abstract":"Motivation: A critical issue in drug benefit-risk assessment is to determine the frequency of side effects, which is performed by randomized controlled trails. Computationally predicted frequencies of drug side effects can be used to effectively guide the randomized controlled trails. However, it is more challenging to predict drug side effect frequencies, and thus only a few studies cope with this problem.Results: In this work, we propose a neighborhood-regularization method (NRFSE) that leverages multiview data on drugs and side effects to predict the frequency of side effects. First, we adopt a class-weighted non-negative matrix factorization to decompose the drug-side effect frequency matrix, in which Gaussian likelihood is used to model unknown drug-side effect pairs. Second, we design a multiview neighborhood regularization to integrate three drug attributes and two side effect attributes, respectively, which makes most similar drugs and most similar side effects have similar latent signatures. The regularization can adaptively determine the weights of different attributes. We conduct extensive experiments on one benchmark dataset, and NRFSE improves the prediction performance compared with five state-of-the-art approaches. Independent test set of post-marketing side effects further validate the effectiveness of NRFSE.Availability and implementation: Source code and datasets are available at https://github.com/linwang1982/NRFSE or https://codeocean.com/capsule/4741497/tree/v1.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10491955/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10285623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation. 最小化器是实现无偏局部Jaccard估计的最小化器的推广。

IF 4.4 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad512

Bryce Kille, Erik Garrison, Todd J Treangen, Adam M Phillippy

Motivation: The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates.

Results: To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.

Availability and implementation: MashMap3 is available at https://github.com/marbl/MashMap.

动机：k-mer集上的Jaccard相似性已被证明是序列同一性的一个方便的代理。通过避免昂贵的基层比对和比较简化的序列表示，MashMap等工具可以扩展到大量的成对比较，同时仍然提供有用的相似性估计。然而，由于它们依赖于最小化筛选，以前版本的MashMap被证明是对Jaccard相似性的有偏差和不一致的估计。这直接影响了依赖这些估计准确性的下游工具。结果：为了解决这个问题，我们提出了minmer筛选方案，该方案通过使用每个窗口具有多个采样k-mer的滚动minhash来推广最小化器方案。我们从理论和经验上证明了minmers产生了局部Jaccard相似性的无偏估计，并在MashMap的更新版本中实现了该方案。在默认ANI阈值下，基于minmer的实现比基于minimizer的版本快10多倍，非常适合大规模的比较基因组学应用。可用性和实现：MashMap3可在https://github.com/marbl/MashMap.

{"title":"Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation.","authors":"Bryce Kille, Erik Garrison, Todd J Treangen, Adam M Phillippy","doi":"10.1093/bioinformatics/btad512","DOIUrl":"10.1093/bioinformatics/btad512","url":null,"abstract":"Motivation: The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates.Results: To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.Availability and implementation: MashMap3 is available at https://github.com/marbl/MashMap.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10505501/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10304418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

scAce: an adaptive embedding and clustering method for single-cell gene expression data. scAce：单细胞基因表达数据的自适应嵌入和聚类方法。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad546

Xinwei He, Kun Qian, Ziqian Wang, Shirou Zeng, Hongwei Li, Wei Vivian Li

Motivation: Since the development of single-cell RNA sequencing (scRNA-seq) technologies, clustering analysis of single-cell gene expression data has been an essential tool for distinguishing cell types and identifying novel cell types. Even though many methods have been available for scRNA-seq clustering analysis, the majority of them are constrained by the requirement on predetermined cluster numbers or the dependence on selected initial cluster assignment.

Results: In this article, we propose an adaptive embedding and clustering method named scAce, which constructs a variational autoencoder to simultaneously learn cell embeddings and cluster assignments. In the scAce method, we develop an adaptive cluster merging approach which achieves improved clustering results without the need to estimate the number of clusters in advance. In addition, scAce provides an option to perform clustering enhancement, which can update and enhance cluster assignments based on previous clustering results from other methods. Based on computational analysis of both simulated and real datasets, we demonstrate that scAce outperforms state-of-the-art clustering methods for scRNA-seq data, and achieves better clustering accuracy and robustness.

Availability and implementation: The scAce package is implemented in python 3.8 and is freely available from https://github.com/sldyns/scAce.

动因：自单细胞RNA测序（scRNA-seq）技术发展以来，单细胞基因表达数据的聚类分析一直是区分细胞类型和识别新型细胞类型的重要工具。尽管目前已有许多用于 scRNA-seq 聚类分析的方法，但大多数方法都受限于对预定聚类数量的要求或对选定的初始聚类分配的依赖：在本文中，我们提出了一种名为 scAce 的自适应嵌入和聚类方法，它构建了一个变异自动编码器来同时学习细胞嵌入和聚类分配。在 scAce 方法中，我们开发了一种自适应聚类合并方法，无需提前估计聚类数量，就能获得更好的聚类结果。此外，scAce 还提供了执行聚类增强的选项，可以根据其他方法的聚类结果更新和增强聚类分配。基于对模拟数据集和真实数据集的计算分析，我们证明了在 scRNA-seq 数据方面，scAce 优于最先进的聚类方法，并实现了更好的聚类准确性和鲁棒性：scAce 软件包由 python 3.8 实现，可从 https://github.com/sldyns/scAce 免费获取。

{"title":"scAce: an adaptive embedding and clustering method for single-cell gene expression data.","authors":"Xinwei He, Kun Qian, Ziqian Wang, Shirou Zeng, Hongwei Li, Wei Vivian Li","doi":"10.1093/bioinformatics/btad546","DOIUrl":"10.1093/bioinformatics/btad546","url":null,"abstract":"Motivation: Since the development of single-cell RNA sequencing (scRNA-seq) technologies, clustering analysis of single-cell gene expression data has been an essential tool for distinguishing cell types and identifying novel cell types. Even though many methods have been available for scRNA-seq clustering analysis, the majority of them are constrained by the requirement on predetermined cluster numbers or the dependence on selected initial cluster assignment.Results: In this article, we propose an adaptive embedding and clustering method named scAce, which constructs a variational autoencoder to simultaneously learn cell embeddings and cluster assignments. In the scAce method, we develop an adaptive cluster merging approach which achieves improved clustering results without the need to estimate the number of clusters in advance. In addition, scAce provides an option to perform clustering enhancement, which can update and enhance cluster assignments based on previous clustering results from other methods. Based on computational analysis of both simulated and real datasets, we demonstrate that scAce outperforms state-of-the-art clustering methods for scRNA-seq data, and achieves better clustering accuracy and robustness.Availability and implementation: The scAce package is implemented in python 3.8 and is freely available from https://github.com/sldyns/scAce.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10500084/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10649377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Machine learning-based quantification for disease uncertainty increases the statistical power of genetic association studies. 基于机器学习的疾病不确定性量化增加了遗传关联研究的统计能力。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad534

Jun Young Park, Jang Jae Lee, Younghwa Lee, Dongsoo Lee, Jungsoo Gim, Lindsay Farrer, Kun Ho Lee, Sungho Won

Motivation: Allowance for increasingly large samples is a key to identify the association of genetic variants with Alzheimer's disease (AD) in genome-wide association studies (GWAS). Accordingly, we aimed to develop a method that incorporates patients with mild cognitive impairment and unknown cognitive status in GWAS using a machine learning-based AD prediction model.

Results: Simulation analyses showed that weighting imputed phenotypes method increased the statistical power compared to ordinary logistic regression using only AD cases and controls. Applied to real-world data, the penalized logistic method had the highest AUC (0.96) for AD prediction and weighting imputed phenotypes method performed well in terms of power. We identified an association (P<5.0×10-8) of AD with several variants in the APOE region and rs143625563 in LMX1A. Our method, which allows the inclusion of individuals with mild cognitive impairment, improves the statistical power of GWAS for AD. We discovered a novel association with LMX1A.

Availability and implementation: Simulation codes can be accessed at https://github.com/Junkkkk/wGEE_GWAS.

动机：在全基因组关联研究（GWAS）中，考虑到越来越大的样本是确定遗传变异与阿尔茨海默病（AD）关联的关键。因此，我们旨在开发一种方法，使用基于机器学习的AD预测模型，将轻度认知障碍和未知认知状态的患者纳入GWAS。结果：模拟分析表明，与仅使用AD病例和对照的普通逻辑回归相比，加权估算表型方法增加了统计能力。应用于真实世界的数据，惩罚逻辑方法具有最高的AD预测AUC（0.96），加权估算表型方法在功率方面表现良好。我们确定了一个关联（PA可用性和实现：模拟代码可以访问https://github.com/Junkkkk/wGEE_GWAS.

{"title":"Machine learning-based quantification for disease uncertainty increases the statistical power of genetic association studies.","authors":"Jun Young Park, Jang Jae Lee, Younghwa Lee, Dongsoo Lee, Jungsoo Gim, Lindsay Farrer, Kun Ho Lee, Sungho Won","doi":"10.1093/bioinformatics/btad534","DOIUrl":"10.1093/bioinformatics/btad534","url":null,"abstract":"Motivation: Allowance for increasingly large samples is a key to identify the association of genetic variants with Alzheimer's disease (AD) in genome-wide association studies (GWAS). Accordingly, we aimed to develop a method that incorporates patients with mild cognitive impairment and unknown cognitive status in GWAS using a machine learning-based AD prediction model.Results: Simulation analyses showed that weighting imputed phenotypes method increased the statistical power compared to ordinary logistic regression using only AD cases and controls. Applied to real-world data, the penalized logistic method had the highest AUC (0.96) for AD prediction and weighting imputed phenotypes method performed well in terms of power. We identified an association (P<5.0×10-8) of AD with several variants in the APOE region and rs143625563 in LMX1A. Our method, which allows the inclusion of individuals with mild cognitive impairment, improves the statistical power of GWAS for AD. We discovered a novel association with LMX1A.Availability and implementation: Simulation codes can be accessed at https://github.com/Junkkkk/wGEE_GWAS.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10539075/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10151455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0