首页 > 最新文献

Bioinformatics最新文献

英文 中文
IsoFrog: a reversible jump Markov Chain Monte Carlo feature selection-based method for predicting isoform functions. IsoFrog:一个可逆跳跃马尔可夫链蒙特卡罗特征选择为基础的方法预测异构体函数。
IF 5.8 3区 生物学 Q1 Mathematics Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad530
Yiwei Liu, Changhuo Yang, Hong-Dong Li, Jianxin Wang

Motivation: A single gene may yield several isoforms with different functions through alternative splicing. Continuous efforts are devoted to developing machine-learning methods to predict isoform functions. However, existing methods do not consider the relevance of each feature to specific functions and ignore the noise caused by the irrelevant features. In this case, we hypothesize that constructing a feature selection framework to extract the function-relevant features might help improve the model accuracy in isoform function prediction.

Results: In this article, we present a feature selection-based approach named IsoFrog to predict isoform functions. First, IsoFrog adopts a reversible jump Markov Chain Monte Carlo (RJMCMC)-based feature selection framework to assess the feature importance to gene functions. Second, a sequential feature selection procedure is applied to select a subset of function-relevant features. This strategy screens the relevant features for the specific function while eliminating irrelevant ones, improving the effectiveness of the input features. Then, the selected features are input into our proposed method modified domain-invariant partial least squares, which prioritizes the most likely positive isoform for each positive MIG and utilizes diPLS for isoform function prediction. Tested on three datasets, our method achieves superior performance over six state-of-the-art methods, and the RJMCMC-based feature selection framework outperforms three classic feature selection methods. We expect this proposed methodology will promote the identification of isoform functions and further inspire the development of new methods.

Availability and implementation: IsoFrog is freely available at https://github.com/genemine/IsoFrog.

动机:一个基因可以通过选择性剪接产生几个具有不同功能的同种异构体。不断的努力致力于开发机器学习方法来预测同形函数。然而,现有的方法没有考虑到每个特征与特定函数的相关性,忽略了不相关特征引起的噪声。在这种情况下,我们假设构建一个特征选择框架来提取与函数相关的特征可能有助于提高模型在同形函数预测中的准确性。结果:在本文中,我们提出了一种基于特征选择的IsoFrog方法来预测异构体函数。首先,IsoFrog采用基于可逆跳跃马尔可夫链蒙特卡罗(RJMCMC)的特征选择框架来评估特征对基因功能的重要性。其次,采用顺序特征选择程序选择与功能相关的特征子集。该策略为特定功能筛选相关特征,同时剔除不相关特征,提高输入特征的有效性。然后,将选择的特征输入到我们提出的改进域不变偏最小二乘方法中,该方法为每个正MIG优先考虑最可能的正异构体,并利用diPLS进行异构体函数预测。在三个数据集上的测试表明,我们的方法比六种最先进的方法取得了更好的性能,基于rjmcmc的特征选择框架优于三种经典的特征选择方法。我们期望这一方法将促进异构体功能的识别,并进一步激发新方法的发展。可用性和实现:IsoFrog可以在https://github.com/genemine/IsoFrog上免费获得。
{"title":"IsoFrog: a reversible jump Markov Chain Monte Carlo feature selection-based method for predicting isoform functions.","authors":"Yiwei Liu,&nbsp;Changhuo Yang,&nbsp;Hong-Dong Li,&nbsp;Jianxin Wang","doi":"10.1093/bioinformatics/btad530","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad530","url":null,"abstract":"<p><strong>Motivation: </strong>A single gene may yield several isoforms with different functions through alternative splicing. Continuous efforts are devoted to developing machine-learning methods to predict isoform functions. However, existing methods do not consider the relevance of each feature to specific functions and ignore the noise caused by the irrelevant features. In this case, we hypothesize that constructing a feature selection framework to extract the function-relevant features might help improve the model accuracy in isoform function prediction.</p><p><strong>Results: </strong>In this article, we present a feature selection-based approach named IsoFrog to predict isoform functions. First, IsoFrog adopts a reversible jump Markov Chain Monte Carlo (RJMCMC)-based feature selection framework to assess the feature importance to gene functions. Second, a sequential feature selection procedure is applied to select a subset of function-relevant features. This strategy screens the relevant features for the specific function while eliminating irrelevant ones, improving the effectiveness of the input features. Then, the selected features are input into our proposed method modified domain-invariant partial least squares, which prioritizes the most likely positive isoform for each positive MIG and utilizes diPLS for isoform function prediction. Tested on three datasets, our method achieves superior performance over six state-of-the-art methods, and the RJMCMC-based feature selection framework outperforms three classic feature selection methods. We expect this proposed methodology will promote the identification of isoform functions and further inspire the development of new methods.</p><p><strong>Availability and implementation: </strong>IsoFrog is freely available at https://github.com/genemine/IsoFrog.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10491952/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10335853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DataDTA: a multi-feature and dual-interaction aggregation framework for drug-target binding affinity prediction. DataDTA:一个用于药物靶标结合亲和力预测的多特征和双重相互作用聚集框架。
IF 5.8 3区 生物学 Q1 Mathematics Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad560
Yan Zhu, Lingling Zhao, Naifeng Wen, Junjie Wang, Chunyu Wang

Motivation: Accurate prediction of drug-target binding affinity (DTA) is crucial for drug discovery. The increase in the publication of large-scale DTA datasets enables the development of various computational methods for DTA prediction. Numerous deep learning-based methods have been proposed to predict affinities, some of which only utilize original sequence information or complex structures, but the effective combination of various information and protein-binding pockets have not been fully mined. Therefore, a new method that integrates available key information is urgently needed to predict DTA and accelerate the drug discovery process.

Results: In this study, we propose a novel deep learning-based predictor termed DataDTA to estimate the affinities of drug-target pairs. DataDTA utilizes descriptors of predicted pockets and sequences of proteins, as well as low-dimensional molecular features and SMILES strings of compounds as inputs. Specifically, the pockets were predicted from the three-dimensional structure of proteins and their descriptors were extracted as the partial input features for DTA prediction. The molecular representation of compounds based on algebraic graph features was collected to supplement the input information of targets. Furthermore, to ensure effective learning of multiscale interaction features, a dual-interaction aggregation neural network strategy was developed. DataDTA was compared with state-of-the-art methods on different datasets, and the results showed that DataDTA is a reliable prediction tool for affinities estimation. Specifically, the concordance index (CI) of DataDTA is 0.806 and the Pearson correlation coefficient (R) value is 0.814 on the test dataset, which is higher than other methods.

Availability and implementation: The codes and datasets of DataDTA are available at https://github.com/YanZhu06/DataDTA.

动机:准确预测药物靶点结合亲和力(DTA)对药物发现至关重要。大规模DTA数据集出版的增加使得DTA预测的各种计算方法得以发展。已经提出了许多基于深度学习的方法来预测亲和力,其中一些方法只利用原始序列信息或复杂结构,但各种信息和蛋白质结合口袋的有效组合尚未得到充分挖掘。因此,迫切需要一种整合现有关键信息的新方法来预测DTA并加快药物发现过程。结果:在这项研究中,我们提出了一种新的基于深度学习的预测因子DataDTA来估计药物-靶标对的亲和力。DataDTA利用预测的蛋白质口袋和序列的描述符,以及低维分子特征和化合物的SMILES串作为输入。具体而言,从蛋白质的三维结构预测口袋,并提取它们的描述符作为DTA预测的部分输入特征。收集了基于代数图特征的化合物分子表示,以补充靶标的输入信息。此外,为了确保多尺度交互特征的有效学习,开发了一种双交互聚合神经网络策略。在不同的数据集上,将DataDTA与最先进的方法进行了比较,结果表明,DataDTA是一种可靠的亲和力估计预测工具。具体而言,在测试数据集上,DataDTA的一致性指数(CI)为0.806,Pearson相关系数(R)值为0.814,高于其他方法。可用性和实施:DataDTA的代码和数据集可在https://github.com/YanZhu06/DataDTA.
{"title":"DataDTA: a multi-feature and dual-interaction aggregation framework for drug-target binding affinity prediction.","authors":"Yan Zhu,&nbsp;Lingling Zhao,&nbsp;Naifeng Wen,&nbsp;Junjie Wang,&nbsp;Chunyu Wang","doi":"10.1093/bioinformatics/btad560","DOIUrl":"10.1093/bioinformatics/btad560","url":null,"abstract":"<p><strong>Motivation: </strong>Accurate prediction of drug-target binding affinity (DTA) is crucial for drug discovery. The increase in the publication of large-scale DTA datasets enables the development of various computational methods for DTA prediction. Numerous deep learning-based methods have been proposed to predict affinities, some of which only utilize original sequence information or complex structures, but the effective combination of various information and protein-binding pockets have not been fully mined. Therefore, a new method that integrates available key information is urgently needed to predict DTA and accelerate the drug discovery process.</p><p><strong>Results: </strong>In this study, we propose a novel deep learning-based predictor termed DataDTA to estimate the affinities of drug-target pairs. DataDTA utilizes descriptors of predicted pockets and sequences of proteins, as well as low-dimensional molecular features and SMILES strings of compounds as inputs. Specifically, the pockets were predicted from the three-dimensional structure of proteins and their descriptors were extracted as the partial input features for DTA prediction. The molecular representation of compounds based on algebraic graph features was collected to supplement the input information of targets. Furthermore, to ensure effective learning of multiscale interaction features, a dual-interaction aggregation neural network strategy was developed. DataDTA was compared with state-of-the-art methods on different datasets, and the results showed that DataDTA is a reliable prediction tool for affinities estimation. Specifically, the concordance index (CI) of DataDTA is 0.806 and the Pearson correlation coefficient (R) value is 0.814 on the test dataset, which is higher than other methods.</p><p><strong>Availability and implementation: </strong>The codes and datasets of DataDTA are available at https://github.com/YanZhu06/DataDTA.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10516524/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10181115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A neighborhood-regularization method leveraging multiview data for predicting the frequency of drug-side effects. 利用多视角数据预测药物副作用发生频率的邻域正则化方法。
IF 5.8 3区 生物学 Q1 Mathematics Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad532
Lin Wang, Chenhao Sun, Xianyu Xu, Jia Li, Wenjuan Zhang

Motivation: A critical issue in drug benefit-risk assessment is to determine the frequency of side effects, which is performed by randomized controlled trails. Computationally predicted frequencies of drug side effects can be used to effectively guide the randomized controlled trails. However, it is more challenging to predict drug side effect frequencies, and thus only a few studies cope with this problem.

Results: In this work, we propose a neighborhood-regularization method (NRFSE) that leverages multiview data on drugs and side effects to predict the frequency of side effects. First, we adopt a class-weighted non-negative matrix factorization to decompose the drug-side effect frequency matrix, in which Gaussian likelihood is used to model unknown drug-side effect pairs. Second, we design a multiview neighborhood regularization to integrate three drug attributes and two side effect attributes, respectively, which makes most similar drugs and most similar side effects have similar latent signatures. The regularization can adaptively determine the weights of different attributes. We conduct extensive experiments on one benchmark dataset, and NRFSE improves the prediction performance compared with five state-of-the-art approaches. Independent test set of post-marketing side effects further validate the effectiveness of NRFSE.

Availability and implementation: Source code and datasets are available at https://github.com/linwang1982/NRFSE or https://codeocean.com/capsule/4741497/tree/v1.

动机:药物获益-风险评估的一个关键问题是确定副作用的频率,这是通过随机对照试验进行的。计算预测药物副作用的频率可以有效地指导随机对照试验。然而,预测药物副作用频率更具挑战性,因此只有少数研究涉及这一问题。在这项工作中,我们提出了一种邻域正则化方法(NRFSE),该方法利用药物和副作用的多视图数据来预测副作用的频率。首先,我们采用类加权非负矩阵分解法分解毒副作用频率矩阵,其中使用高斯似然对未知毒副作用对建模。其次,我们设计了一个多视图邻域正则化,分别整合三个药物属性和两个副作用属性,使得大多数相似的药物和大多数相似的副作用具有相似的潜在特征。正则化可以自适应地确定不同属性的权重。我们在一个基准数据集上进行了广泛的实验,与五种最先进的方法相比,NRFSE提高了预测性能。上市后副作用的独立测试集进一步验证了NRFSE的有效性。可用性和实现:源代码和数据集可从https://github.com/linwang1982/NRFSE或https://codeocean.com/capsule/4741497/tree/v1获得。
{"title":"A neighborhood-regularization method leveraging multiview data for predicting the frequency of drug-side effects.","authors":"Lin Wang,&nbsp;Chenhao Sun,&nbsp;Xianyu Xu,&nbsp;Jia Li,&nbsp;Wenjuan Zhang","doi":"10.1093/bioinformatics/btad532","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad532","url":null,"abstract":"<p><strong>Motivation: </strong>A critical issue in drug benefit-risk assessment is to determine the frequency of side effects, which is performed by randomized controlled trails. Computationally predicted frequencies of drug side effects can be used to effectively guide the randomized controlled trails. However, it is more challenging to predict drug side effect frequencies, and thus only a few studies cope with this problem.</p><p><strong>Results: </strong>In this work, we propose a neighborhood-regularization method (NRFSE) that leverages multiview data on drugs and side effects to predict the frequency of side effects. First, we adopt a class-weighted non-negative matrix factorization to decompose the drug-side effect frequency matrix, in which Gaussian likelihood is used to model unknown drug-side effect pairs. Second, we design a multiview neighborhood regularization to integrate three drug attributes and two side effect attributes, respectively, which makes most similar drugs and most similar side effects have similar latent signatures. The regularization can adaptively determine the weights of different attributes. We conduct extensive experiments on one benchmark dataset, and NRFSE improves the prediction performance compared with five state-of-the-art approaches. Independent test set of post-marketing side effects further validate the effectiveness of NRFSE.</p><p><strong>Availability and implementation: </strong>Source code and datasets are available at https://github.com/linwang1982/NRFSE or https://codeocean.com/capsule/4741497/tree/v1.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10491955/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10285623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Joint embedding of biological networks for cross-species functional alignment. 用于跨物种功能比对的生物网络的联合嵌入。
IF 5.8 3区 生物学 Q1 Mathematics Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad529
Lechuan Li, Ruth Dannenfelser, Yu Zhu, Nathaniel Hejduk, Santiago Segarra, Vicky Yao

Motivation: Model organisms are widely used to better understand the molecular causes of human disease. While sequence similarity greatly aids this cross-species transfer, sequence similarity does not imply functional similarity, and thus, several current approaches incorporate protein-protein interactions to help map findings between species. Existing transfer methods either formulate the alignment problem as a matching problem which pits network features against known orthology, or more recently, as a joint embedding problem.

Results: We propose a novel state-of-the-art joint embedding solution: Embeddings to Network Alignment (ETNA). ETNA generates individual network embeddings based on network topological structure and then uses a Natural Language Processing-inspired cross-training approach to align the two embeddings using sequence-based orthologs. The final embedding preserves both within and between species gene functional relationships, and we demonstrate that it captures both pairwise and group functional relevance. In addition, ETNA's embeddings can be used to transfer genetic interactions across species and identify phenotypic alignments, laying the groundwork for potential opportunities for drug repurposing and translational studies.

Availability and implementation: https://github.com/ylaboratory/ETNA.

动机:模式生物被广泛用于更好地了解人类疾病的分子原因。虽然序列相似性极大地帮助了这种跨物种转移,但序列相似性并不意味着功能相似,因此,目前的几种方法结合了蛋白质-蛋白质相互作用,以帮助绘制物种之间的发现图。现有的传输方法要么将对齐问题表述为使网络特征与已知的正交性相匹配的匹配问题,要么最近将其表述为联合嵌入问题。结果:我们提出了一种新的最先进的联合嵌入解决方案:嵌入到网络对齐(ETNA)。ETNA基于网络拓扑结构生成单独的网络嵌入,然后使用受自然语言处理启发的交叉训练方法,使用基于序列的正交对数对两个嵌入进行对齐。最终的嵌入保留了物种内部和物种之间的基因功能关系,我们证明它捕获了成对和群体功能相关性。此外,ETNA的嵌入物可用于跨物种转移遗传相互作用并确定表型比对,为药物再利用和转化研究的潜在机会奠定基础。可用性和实施:https://github.com/ylaboratory/ETNA.
{"title":"Joint embedding of biological networks for cross-species functional alignment.","authors":"Lechuan Li,&nbsp;Ruth Dannenfelser,&nbsp;Yu Zhu,&nbsp;Nathaniel Hejduk,&nbsp;Santiago Segarra,&nbsp;Vicky Yao","doi":"10.1093/bioinformatics/btad529","DOIUrl":"10.1093/bioinformatics/btad529","url":null,"abstract":"<p><strong>Motivation: </strong>Model organisms are widely used to better understand the molecular causes of human disease. While sequence similarity greatly aids this cross-species transfer, sequence similarity does not imply functional similarity, and thus, several current approaches incorporate protein-protein interactions to help map findings between species. Existing transfer methods either formulate the alignment problem as a matching problem which pits network features against known orthology, or more recently, as a joint embedding problem.</p><p><strong>Results: </strong>We propose a novel state-of-the-art joint embedding solution: Embeddings to Network Alignment (ETNA). ETNA generates individual network embeddings based on network topological structure and then uses a Natural Language Processing-inspired cross-training approach to align the two embeddings using sequence-based orthologs. The final embedding preserves both within and between species gene functional relationships, and we demonstrate that it captures both pairwise and group functional relevance. In addition, ETNA's embeddings can be used to transfer genetic interactions across species and identify phenotypic alignments, laying the groundwork for potential opportunities for drug repurposing and translational studies.</p><p><strong>Availability and implementation: </strong>https://github.com/ylaboratory/ETNA.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10477935/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10286575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction to: Robust joint clustering of multi-omics single-cell data via multi-modal high-order neighborhood Laplacian matrix optimization. 修正:基于多模态高阶邻域拉普拉斯矩阵优化的多组学单细胞数据鲁棒联合聚类。
IF 5.8 3区 生物学 Q1 Mathematics Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad554
{"title":"Correction to: Robust joint clustering of multi-omics single-cell data via multi-modal high-order neighborhood Laplacian matrix optimization.","authors":"","doi":"10.1093/bioinformatics/btad554","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad554","url":null,"abstract":"","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10497449/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10232109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation. 最小化器是实现无偏局部Jaccard估计的最小化器的推广。
IF 4.4 3区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad512
Bryce Kille, Erik Garrison, Todd J Treangen, Adam M Phillippy

Motivation: The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates.

Results: To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.

Availability and implementation: MashMap3 is available at https://github.com/marbl/MashMap.

动机:k-mer集上的Jaccard相似性已被证明是序列同一性的一个方便的代理。通过避免昂贵的基层比对和比较简化的序列表示,MashMap等工具可以扩展到大量的成对比较,同时仍然提供有用的相似性估计。然而,由于它们依赖于最小化筛选,以前版本的MashMap被证明是对Jaccard相似性的有偏差和不一致的估计。这直接影响了依赖这些估计准确性的下游工具。结果:为了解决这个问题,我们提出了minmer筛选方案,该方案通过使用每个窗口具有多个采样k-mer的滚动minhash来推广最小化器方案。我们从理论和经验上证明了minmers产生了局部Jaccard相似性的无偏估计,并在MashMap的更新版本中实现了该方案。在默认ANI阈值下,基于minmer的实现比基于minimizer的版本快10多倍,非常适合大规模的比较基因组学应用。可用性和实现:MashMap3可在https://github.com/marbl/MashMap.
{"title":"Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation.","authors":"Bryce Kille, Erik Garrison, Todd J Treangen, Adam M Phillippy","doi":"10.1093/bioinformatics/btad512","DOIUrl":"10.1093/bioinformatics/btad512","url":null,"abstract":"<p><strong>Motivation: </strong>The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates.</p><p><strong>Results: </strong>To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.</p><p><strong>Availability and implementation: </strong>MashMap3 is available at https://github.com/marbl/MashMap.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":4.4,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10505501/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10304418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction to: "Retraction of: DeepCRISTL: deep transfer learning to predict CRISPR/Cas9 functional and endogenous on-target editing efficiency". 更正:“撤回:deepcrisstl:深度迁移学习预测CRISPR/Cas9功能和内源性靶向编辑效率”。
IF 5.8 3区 生物学 Q1 Mathematics Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad562
This is a correction to “Retraction of: DeepCRISTL: deep transfer learning to predict CRISPR/Cas9 functional and endogenous on-target editing efficiency”, Bioinformatics, Volume 39, Issue 7, July 2023, https://doi.org/10.1093/bioin formatics/btad412. The retraction notice text has been updated, because we have subsequently discovered that the authors did not receive the journal’s communications to them asking them to address the flaws. This correction does not change the outcome or decision to retract.
{"title":"Correction to: \"Retraction of: DeepCRISTL: deep transfer learning to predict CRISPR/Cas9 functional and endogenous on-target editing efficiency\".","authors":"","doi":"10.1093/bioinformatics/btad562","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad562","url":null,"abstract":"This is a correction to “Retraction of: DeepCRISTL: deep transfer learning to predict CRISPR/Cas9 functional and endogenous on-target editing efficiency”, Bioinformatics, Volume 39, Issue 7, July 2023, https://doi.org/10.1093/bioin formatics/btad412. The retraction notice text has been updated, because we have subsequently discovered that the authors did not receive the journal’s communications to them asking them to address the flaws. This correction does not change the outcome or decision to retract.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10500088/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10264061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning-based quantification for disease uncertainty increases the statistical power of genetic association studies. 基于机器学习的疾病不确定性量化增加了遗传关联研究的统计能力。
IF 5.8 3区 生物学 Q1 Mathematics Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad534
Jun Young Park, Jang Jae Lee, Younghwa Lee, Dongsoo Lee, Jungsoo Gim, Lindsay Farrer, Kun Ho Lee, Sungho Won

Motivation: Allowance for increasingly large samples is a key to identify the association of genetic variants with Alzheimer's disease (AD) in genome-wide association studies (GWAS). Accordingly, we aimed to develop a method that incorporates patients with mild cognitive impairment and unknown cognitive status in GWAS using a machine learning-based AD prediction model.

Results: Simulation analyses showed that weighting imputed phenotypes method increased the statistical power compared to ordinary logistic regression using only AD cases and controls. Applied to real-world data, the penalized logistic method had the highest AUC (0.96) for AD prediction and weighting imputed phenotypes method performed well in terms of power. We identified an association (P<5.0×10-8) of AD with several variants in the APOE region and rs143625563 in LMX1A. Our method, which allows the inclusion of individuals with mild cognitive impairment, improves the statistical power of GWAS for AD. We discovered a novel association with LMX1A.

Availability and implementation: Simulation codes can be accessed at https://github.com/Junkkkk/wGEE_GWAS.

动机:在全基因组关联研究(GWAS)中,考虑到越来越大的样本是确定遗传变异与阿尔茨海默病(AD)关联的关键。因此,我们旨在开发一种方法,使用基于机器学习的AD预测模型,将轻度认知障碍和未知认知状态的患者纳入GWAS。结果:模拟分析表明,与仅使用AD病例和对照的普通逻辑回归相比,加权估算表型方法增加了统计能力。应用于真实世界的数据,惩罚逻辑方法具有最高的AD预测AUC(0.96),加权估算表型方法在功率方面表现良好。我们确定了一个关联(PA可用性和实现:模拟代码可以访问https://github.com/Junkkkk/wGEE_GWAS.
{"title":"Machine learning-based quantification for disease uncertainty increases the statistical power of genetic association studies.","authors":"Jun Young Park,&nbsp;Jang Jae Lee,&nbsp;Younghwa Lee,&nbsp;Dongsoo Lee,&nbsp;Jungsoo Gim,&nbsp;Lindsay Farrer,&nbsp;Kun Ho Lee,&nbsp;Sungho Won","doi":"10.1093/bioinformatics/btad534","DOIUrl":"10.1093/bioinformatics/btad534","url":null,"abstract":"<p><strong>Motivation: </strong>Allowance for increasingly large samples is a key to identify the association of genetic variants with Alzheimer's disease (AD) in genome-wide association studies (GWAS). Accordingly, we aimed to develop a method that incorporates patients with mild cognitive impairment and unknown cognitive status in GWAS using a machine learning-based AD prediction model.</p><p><strong>Results: </strong>Simulation analyses showed that weighting imputed phenotypes method increased the statistical power compared to ordinary logistic regression using only AD cases and controls. Applied to real-world data, the penalized logistic method had the highest AUC (0.96) for AD prediction and weighting imputed phenotypes method performed well in terms of power. We identified an association (P<5.0×10-8) of AD with several variants in the APOE region and rs143625563 in LMX1A. Our method, which allows the inclusion of individuals with mild cognitive impairment, improves the statistical power of GWAS for AD. We discovered a novel association with LMX1A.</p><p><strong>Availability and implementation: </strong>Simulation codes can be accessed at https://github.com/Junkkkk/wGEE_GWAS.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10539075/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10151455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RNA 3D structure modeling by fragment assembly with small-angle X-ray scattering restraints. 基于小角度x射线散射约束的RNA片段组装三维结构建模。
IF 5.8 3区 生物学 Q1 Mathematics Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad527
Grzegorz Chojnowski, Rafał Zaborowski, Marcin Magnus, Sunandan Mukherjee, Janusz M Bujnicki

Summary: Structure determination is a key step in the functional characterization of many non-coding RNA molecules. High-resolution RNA 3D structure determination efforts, however, are not keeping up with the pace of discovery of new non-coding RNA sequences. This increases the importance of computational approaches and low-resolution experimental data, such as from the small-angle X-ray scattering experiments. We present RNA Masonry, a computer program and a web service for a fully automated modeling of RNA 3D structures. It assemblies RNA fragments into geometrically plausible models that meet user-provided secondary structure constraints, restraints on tertiary contacts, and small-angle X-ray scattering data. We illustrate the method description with detailed benchmarks and its application to structural studies of viral RNAs with SAXS restraints.

Availability and implementation: The program web server is available at http://iimcb.genesilico.pl/rnamasonry. The source code is available at https://gitlab.com/gchojnowski/rnamasonry.

摘要:结构测定是许多非编码RNA分子功能表征的关键步骤。然而,高分辨率RNA 3D结构测定的努力并没有跟上发现新的非编码RNA序列的步伐。这增加了计算方法和低分辨率实验数据的重要性,例如来自小角度x射线散射实验的数据。我们提出了RNA砌体,一个计算机程序和一个网络服务,用于RNA 3D结构的全自动建模。它将RNA片段组装成几何上合理的模型,以满足用户提供的二级结构约束、三级接触约束和小角度x射线散射数据。我们用详细的基准说明了方法描述,并将其应用于具有SAXS约束的病毒rna的结构研究。可用性和实现:程序web服务器可在http://iimcb.genesilico.pl/rnamasonry上获得。源代码可从https://gitlab.com/gchojnowski/rnamasonry获得。
{"title":"RNA 3D structure modeling by fragment assembly with small-angle X-ray scattering restraints.","authors":"Grzegorz Chojnowski,&nbsp;Rafał Zaborowski,&nbsp;Marcin Magnus,&nbsp;Sunandan Mukherjee,&nbsp;Janusz M Bujnicki","doi":"10.1093/bioinformatics/btad527","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad527","url":null,"abstract":"<p><strong>Summary: </strong>Structure determination is a key step in the functional characterization of many non-coding RNA molecules. High-resolution RNA 3D structure determination efforts, however, are not keeping up with the pace of discovery of new non-coding RNA sequences. This increases the importance of computational approaches and low-resolution experimental data, such as from the small-angle X-ray scattering experiments. We present RNA Masonry, a computer program and a web service for a fully automated modeling of RNA 3D structures. It assemblies RNA fragments into geometrically plausible models that meet user-provided secondary structure constraints, restraints on tertiary contacts, and small-angle X-ray scattering data. We illustrate the method description with detailed benchmarks and its application to structural studies of viral RNAs with SAXS restraints.</p><p><strong>Availability and implementation: </strong>The program web server is available at http://iimcb.genesilico.pl/rnamasonry. The source code is available at https://gitlab.com/gchojnowski/rnamasonry.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10474949/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10285624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Genome-wide multimediator analyses using the generalized Berk-Jones statistics with the composite test. 全基因组多介质分析使用广义伯克-琼斯统计与复合检验。
IF 5.8 3区 生物学 Q1 Mathematics Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad544
En-Yu Lai, Yen-Tsung Huang

Motivation: Mediation analysis is performed to evaluate the effects of a hypothetical causal mechanism that marks the progression from an exposure, through mediators, to an outcome. In the age of high-throughput technologies, it has become routine to assess numerous potential mechanisms at the genome or proteome scales. Alongside this, the necessity to address issues related to multiple testing has also arisen. In a sparse scenario where only a few genes or proteins are causally involved, conventional methods for assessing mediation effects lose statistical power because the composite null distribution behind this experiment cannot be attained. The power loss hence decreases the true mechanisms identified after multiple testing corrections. To fairly delineate a uniform distribution under the composite null, Huang (Genome-wide analyses of sparse mediation effects under composite null hypotheses. Ann Appl Stat 2019a;13:60-84; AoAS) proposed the composite test to provide adjusted P-values for single-mediator analyses.

Results: Our contribution is to extend the method to multimediator analyses, which are commonly encountered in genomic studies and also flexible to various biological interests. Using the generalized Berk-Jones statistics with the composite test, we proposed a multivariate approach that favors dense and diverse mediation effects, a decorrelation approach that favors sparse and consistent effects, and a hybrid approach that captures the edges of both approaches. Our analysis suite has been implemented as an R package MACtest. The utility is demonstrated by analyzing the lung adenocarcinoma datasets from The Cancer Genome Atlas and Clinical Proteomic Tumor Analysis Consortium. We further investigate the genes and networks whose expression may be regulated by smoking-induced epigenetic aberrations.

Availability and implementation: An R package MACtest is available on https://github.com/roqe/MACtest.

动机:进行中介分析是为了评估一个假设的因果机制的影响,该机制标志着从暴露,通过中介,到结果的进展。在高通量技术的时代,在基因组或蛋白质组尺度上评估许多潜在的机制已经成为常规。除此之外,解决与多重测试相关的问题的必要性也出现了。在只有少数基因或蛋白质参与的稀疏情况下,评估中介效应的传统方法失去了统计能力,因为无法获得该实验背后的复合零分布。因此,功率损耗降低了经过多次测试修正后确定的真实机制。为了公平地描述在复合零假设下的均匀分布,Huang (Genome-wide)分析了在复合零假设下的稀疏中介效应。Ann apple Stat 2019;13:60-84;AoAS)提出了复合检验,为单介质分析提供调整后的p值。结果:我们的贡献是将方法扩展到多介质分析,这在基因组研究中经常遇到,并且对各种生物学兴趣也很灵活。利用广义Berk-Jones统计和复合检验,我们提出了一种有利于密集和多样化中介效应的多元方法,一种有利于稀疏和一致效应的去相关方法,以及一种捕捉两种方法边缘的混合方法。我们的分析套件已经被实现为一个R包MACtest。通过分析来自癌症基因组图谱和临床蛋白质组学肿瘤分析联盟的肺腺癌数据集,证明了其实用性。我们进一步研究了可能受吸烟诱导的表观遗传畸变调控的基因和网络。可用性和实现:在https://github.com/roqe/MACtest上可以获得R包MACtest。
{"title":"Genome-wide multimediator analyses using the generalized Berk-Jones statistics with the composite test.","authors":"En-Yu Lai,&nbsp;Yen-Tsung Huang","doi":"10.1093/bioinformatics/btad544","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad544","url":null,"abstract":"<p><strong>Motivation: </strong>Mediation analysis is performed to evaluate the effects of a hypothetical causal mechanism that marks the progression from an exposure, through mediators, to an outcome. In the age of high-throughput technologies, it has become routine to assess numerous potential mechanisms at the genome or proteome scales. Alongside this, the necessity to address issues related to multiple testing has also arisen. In a sparse scenario where only a few genes or proteins are causally involved, conventional methods for assessing mediation effects lose statistical power because the composite null distribution behind this experiment cannot be attained. The power loss hence decreases the true mechanisms identified after multiple testing corrections. To fairly delineate a uniform distribution under the composite null, Huang (Genome-wide analyses of sparse mediation effects under composite null hypotheses. Ann Appl Stat 2019a;13:60-84; AoAS) proposed the composite test to provide adjusted P-values for single-mediator analyses.</p><p><strong>Results: </strong>Our contribution is to extend the method to multimediator analyses, which are commonly encountered in genomic studies and also flexible to various biological interests. Using the generalized Berk-Jones statistics with the composite test, we proposed a multivariate approach that favors dense and diverse mediation effects, a decorrelation approach that favors sparse and consistent effects, and a hybrid approach that captures the edges of both approaches. Our analysis suite has been implemented as an R package MACtest. The utility is demonstrated by analyzing the lung adenocarcinoma datasets from The Cancer Genome Atlas and Clinical Proteomic Tumor Analysis Consortium. We further investigate the genes and networks whose expression may be regulated by smoking-induced epigenetic aberrations.</p><p><strong>Availability and implementation: </strong>An R package MACtest is available on https://github.com/roqe/MACtest.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10500087/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10286120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1