首页 > 最新文献

Statistical Applications in Genetics and Molecular Biology最新文献

英文 中文
Patterns of differential expression by association in omic data using a new measure based on ensemble learning. 基于集成学习的组学数据关联差分表达模式研究。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2023-11-23 eCollection Date: 2023-01-01 DOI: 10.1515/sagmb-2023-0009
Jorge M Arevalillo, Raquel Martin-Arevalillo

The ongoing development of high-throughput technologies is allowing the simultaneous monitoring of the expression levels for hundreds or thousands of biological inputs with the proliferation of what has been coined as omic data sources. One relevant issue when analyzing such data sources is concerned with the detection of differential expression across two experimental conditions, clinical status or two classes of a biological outcome. While a great deal of univariate data analysis approaches have been developed to address the issue, strategies for assessing interaction patterns of differential expression are scarce in the literature and have been limited to ad hoc solutions. This paper contributes to the problem by exploiting the facilities of an ensemble learning algorithm like random forests to propose a measure that assesses the differential expression explained by the interaction of the omic variables so subtle biological patterns may be uncovered as a result. The out of bag error rate, which is an estimate of the predictive accuracy of a random forests classifier, is used as a by-product to propose a new measure that assesses interaction patterns of differential expression. Its performance is studied in synthetic scenarios and it is also applied to real studies on SARS-CoV-2 and colon cancer data where it uncovers associations that remain undetected by other methods. Our proposal is aimed at providing a novel approach that may help the experts in biomedical and life sciences to unravel insightful interaction patterns that may decipher the molecular mechanisms underlying biological and clinical outcomes.

高通量技术的不断发展使得同时监测数百或数千个生物输入的表达水平成为可能,同时也增加了被称为基因组数据源的内容。在分析这些数据源时,一个相关问题涉及到在两种实验条件、临床状态或两类生物学结果之间检测差异表达。虽然已经开发了大量的单变量数据分析方法来解决这个问题,但评估差异表达的相互作用模式的策略在文献中很少,并且仅限于临时解决方案。本文通过利用像随机森林这样的集成学习算法的设施来解决这个问题,提出了一种评估由组学变量相互作用解释的差异表达的度量,因此可能会发现微妙的生物模式。袋外错误率是随机森林分类器预测精度的估计,它被用作副产品,提出了一种评估差异表达交互模式的新措施。它的性能在合成场景中进行了研究,也应用于对SARS-CoV-2和结肠癌数据的实际研究,在这些研究中,它揭示了其他方法未发现的关联。我们的建议旨在提供一种新的方法,可以帮助生物医学和生命科学专家揭示有见地的相互作用模式,从而可能破译生物学和临床结果背后的分子机制。
{"title":"Patterns of differential expression by association in omic data using a new measure based on ensemble learning.","authors":"Jorge M Arevalillo, Raquel Martin-Arevalillo","doi":"10.1515/sagmb-2023-0009","DOIUrl":"10.1515/sagmb-2023-0009","url":null,"abstract":"<p><p>The ongoing development of high-throughput technologies is allowing the simultaneous monitoring of the expression levels for hundreds or thousands of biological inputs with the proliferation of what has been coined as omic data sources. One relevant issue when analyzing such data sources is concerned with the detection of differential expression across two experimental conditions, clinical status or two classes of a biological outcome. While a great deal of univariate data analysis approaches have been developed to address the issue, strategies for assessing interaction patterns of differential expression are scarce in the literature and have been limited to ad hoc solutions. This paper contributes to the problem by exploiting the facilities of an ensemble learning algorithm like random forests to propose a measure that assesses the differential expression explained by the interaction of the omic variables so subtle biological patterns may be uncovered as a result. The out of bag error rate, which is an estimate of the predictive accuracy of a random forests classifier, is used as a by-product to propose a new measure that assesses interaction patterns of differential expression. Its performance is studied in synthetic scenarios and it is also applied to real studies on SARS-CoV-2 and colon cancer data where it uncovers associations that remain undetected by other methods. Our proposal is aimed at providing a novel approach that may help the experts in biomedical and life sciences to unravel insightful interaction patterns that may decipher the molecular mechanisms underlying biological and clinical outcomes.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138292231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrated regulatory and metabolic networks of the tumor microenvironment for therapeutic target prioritization. 肿瘤微环境的综合调节和代谢网络对治疗目标的优先排序。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2023-11-21 eCollection Date: 2023-01-01 DOI: 10.1515/sagmb-2022-0054
Tiange Shi, Han Yu, Rachael Hageman Blair

Translation of genomic discovery, such as single-cell sequencing data, to clinical decisions remains a longstanding bottleneck in the field. Meanwhile, computational systems biological models, such as cellular metabolism models and cell signaling pathways, have emerged as powerful approaches to provide efficient predictions in metabolites and gene expression levels, respectively. However, there has been limited research on the integration between these two models. This work develops a methodology for integrating computational models of probabilistic gene regulatory networks with a constraint-based metabolism model. By using probabilistic reasoning with Bayesian Networks, we aim to predict cell-specific changes under different interventions, which are embedded into the constraint-based models of metabolism. Applications to single-cell sequencing data of glioblastoma brain tumors generate predictions about the effects of pharmaceutical interventions on the regulatory network and downstream metabolisms in different cell types from the tumor microenvironment. The model presents possible insights into treatments that could potentially suppress anaerobic metabolism in malignant cells with minimal impact on other cell types' metabolism. The proposed integrated model can guide therapeutic target prioritization, the formulation of combination therapies, and future drug discovery. This model integration framework is also generalizable to other applications, such as different cell types, organisms, and diseases.

将基因组发现(如单细胞测序数据)转化为临床决策仍然是该领域长期存在的瓶颈。与此同时,计算系统生物学模型,如细胞代谢模型和细胞信号传导途径,已经成为提供有效预测代谢物和基因表达水平的有力方法。然而,关于这两种模型之间的整合研究却很少。这项工作开发了一种方法,将概率基因调控网络的计算模型与基于约束的代谢模型相结合。通过使用贝叶斯网络的概率推理,我们的目标是预测不同干预下的细胞特异性变化,这些变化嵌入到基于约束的代谢模型中。应用于胶质母细胞瘤脑肿瘤的单细胞测序数据,可以从肿瘤微环境中预测药物干预对不同细胞类型的调控网络和下游代谢的影响。该模型为可能抑制恶性细胞厌氧代谢而对其他细胞类型代谢影响最小的治疗提供了可能的见解。所提出的集成模型可以指导治疗目标的优先排序、联合疗法的制定以及未来的药物发现。该模型集成框架也可推广到其他应用,例如不同的细胞类型、生物体和疾病。
{"title":"Integrated regulatory and metabolic networks of the tumor microenvironment for therapeutic target prioritization.","authors":"Tiange Shi, Han Yu, Rachael Hageman Blair","doi":"10.1515/sagmb-2022-0054","DOIUrl":"10.1515/sagmb-2022-0054","url":null,"abstract":"<p><p>Translation of genomic discovery, such as single-cell sequencing data, to clinical decisions remains a longstanding bottleneck in the field. Meanwhile, computational systems biological models, such as cellular metabolism models and cell signaling pathways, have emerged as powerful approaches to provide efficient predictions in metabolites and gene expression levels, respectively. However, there has been limited research on the integration between these two models. This work develops a methodology for integrating computational models of probabilistic gene regulatory networks with a constraint-based metabolism model. By using probabilistic reasoning with Bayesian Networks, we aim to predict cell-specific changes under different interventions, which are embedded into the constraint-based models of metabolism. Applications to single-cell sequencing data of glioblastoma brain tumors generate predictions about the effects of pharmaceutical interventions on the regulatory network and downstream metabolisms in different cell types from the tumor microenvironment. The model presents possible insights into treatments that could potentially suppress anaerobic metabolism in malignant cells with minimal impact on other cell types' metabolism. The proposed integrated model can guide therapeutic target prioritization, the formulation of combination therapies, and future drug discovery. This model integration framework is also generalizable to other applications, such as different cell types, organisms, and diseases.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138292230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Randomized singular value decomposition for integrative subtype analysis of 'omics data' using non-negative matrix factorization. 使用非负矩阵因子分解对“组学数据”进行综合亚型分析的随机奇异值分解。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2023-11-09 eCollection Date: 2023-01-01 DOI: 10.1515/sagmb-2022-0047
Yonghui Ni, Jianghua He, Prabhakar Chalise

Integration of multiple 'omics datasets for differentiating cancer subtypes is a powerful technic that leverages the consistent and complementary information across multi-omics data. Matrix factorization is a common technique used in integrative clustering for identifying latent subtype structure across multi-omics data. High dimensionality of the omics data and long computation time have been common challenges of clustering methods. In order to address the challenges, we propose randomized singular value decomposition (RSVD) for integrative clustering using Non-negative Matrix Factorization: intNMF-rsvd. The method utilizes RSVD to reduce the dimensionality by projecting the data into eigen vector space with user specified lower rank. Then, clustering analysis is carried out by estimating common basis matrix across the projected multi-omics datasets. The performance of the proposed method was assessed using the simulated datasets and compared with six state-of-the-art integrative clustering methods using real-life datasets from The Cancer Genome Atlas Study. intNMF-rsvd was found working efficiently and competitively as compared to standard intNMF and other multi-omics clustering methods. Most importantly, intNMF-rsvd can handle large number of features and significantly reduce the computation time. The identified subtypes can be utilized for further clinical association studies to understand the etiology of the disease.

整合多组学数据集以区分癌症亚型是一种强大的技术,可以利用多组学的一致性和互补性信息。矩阵分解是综合聚类中用于识别多组学数据中潜在亚型结构的常用技术。组学数据的高维性和长计算时间一直是聚类方法的常见挑战。为了应对这些挑战,我们提出了使用非负矩阵因子分解进行综合聚类的随机奇异值分解(RSVD):intNMF-RSVD。该方法利用RSVD将数据投影到具有用户指定的较低秩的特征向量空间中来降维。然后,通过估计投影的多组学数据集的公共基矩阵来进行聚类分析。使用模拟数据集评估了所提出方法的性能,并使用癌症基因组图谱研究的真实数据集与六种最先进的综合聚类方法进行了比较。发现与标准intNMF和其他多组分聚类方法相比,intNMF-rsvd有效且具有竞争力。最重要的是,intNMF-rsvd可以处理大量特征,并显著减少计算时间。确定的亚型可用于进一步的临床关联研究,以了解疾病的病因。
{"title":"Randomized singular value decomposition for integrative subtype analysis of 'omics data' using non-negative matrix factorization.","authors":"Yonghui Ni, Jianghua He, Prabhakar Chalise","doi":"10.1515/sagmb-2022-0047","DOIUrl":"10.1515/sagmb-2022-0047","url":null,"abstract":"<p><p>Integration of multiple 'omics datasets for differentiating cancer subtypes is a powerful technic that leverages the consistent and complementary information across multi-omics data. Matrix factorization is a common technique used in integrative clustering for identifying latent subtype structure across multi-omics data. High dimensionality of the omics data and long computation time have been common challenges of clustering methods. In order to address the challenges, we propose randomized singular value decomposition (RSVD) for integrative clustering using Non-negative Matrix Factorization: <i>intNMF-rsvd</i>. The method utilizes RSVD to reduce the dimensionality by projecting the data into eigen vector space with user specified lower rank. Then, clustering analysis is carried out by estimating common basis matrix across the projected multi-omics datasets. The performance of the proposed method was assessed using the simulated datasets and compared with six state-of-the-art integrative clustering methods using real-life datasets from The Cancer Genome Atlas Study. <i>intNMF-rsvd</i> was found working efficiently and competitively as compared to standard intNMF and other multi-omics clustering methods. Most importantly, <i>intNMF-rsvd</i> can handle large number of features and significantly reduce the computation time. The identified subtypes can be utilized for further clinical association studies to understand the etiology of the disease.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71488028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CAT PETR: a graphical user interface for differential analysis of phosphorylation and expression data. CAT PETR:用于磷酸化和表达数据差异分析的图形用户界面。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2023-08-21 eCollection Date: 2023-01-01 DOI: 10.1515/sagmb-2023-0017
Keegan Flanagan, Steven Pelech, Yossef Av-Gay, Khanh Dao Duc

Antibody microarray data provides a powerful and high-throughput tool to monitor global changes in cellular response to perturbation or genetic manipulation. However, while collecting such data has become increasingly accessible, a lack of specific computational tools has made their analysis limited. Here we present CAT PETR, a user friendly web application for the differential analysis of expression and phosphorylation data collected via antibody microarrays. Our application addresses the limitations of other GUI based tools by providing various data input options and visualizations. To illustrate its capabilities on real data, we show that CAT PETR both replicates previous findings, and reveals additional insights, using its advanced visualization and statistical options.

抗体微阵列数据提供了一种强大的高通量工具,可用于监测细胞对扰动或遗传操作反应的整体变化。然而,虽然收集此类数据的途径越来越多,但由于缺乏特定的计算工具,对它们的分析受到了限制。我们在此介绍 CAT PETR,这是一款用户友好型网络应用程序,用于对通过抗体微阵列收集的表达和磷酸化数据进行差异分析。我们的应用程序通过提供各种数据输入选项和可视化效果,解决了其他基于图形用户界面的工具的局限性。为了说明其在真实数据上的能力,我们展示了 CAT PETR 利用其先进的可视化和统计选项,既复制了以前的发现,又揭示了更多的见解。
{"title":"CAT PETR: a graphical user interface for differential analysis of phosphorylation and expression data.","authors":"Keegan Flanagan, Steven Pelech, Yossef Av-Gay, Khanh Dao Duc","doi":"10.1515/sagmb-2023-0017","DOIUrl":"10.1515/sagmb-2023-0017","url":null,"abstract":"<p><p>Antibody microarray data provides a powerful and high-throughput tool to monitor global changes in cellular response to perturbation or genetic manipulation. However, while collecting such data has become increasingly accessible, a lack of specific computational tools has made their analysis limited. Here we present CAT PETR, a user friendly web application for the differential analysis of expression and phosphorylation data collected via antibody microarrays. Our application addresses the limitations of other GUI based tools by providing various data input options and visualizations. To illustrate its capabilities on real data, we show that CAT PETR both replicates previous findings, and reveals additional insights, using its advanced visualization and statistical options.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10238016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving the accuracy and internal consistency of regression-based clustering of high-dimensional datasets. 提高基于回归的高维数据集聚类法的准确性和内部一致性。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2023-07-25 eCollection Date: 2023-01-01 DOI: 10.1515/sagmb-2022-0031
Bo Zhang, Jianghua He, Jinxiang Hu, Prabhakar Chalise, Devin C Koestler

Component-wise Sparse Mixture Regression (CSMR) is a recently proposed regression-based clustering method that shows promise in detecting heterogeneous relationships between molecular markers and a continuous phenotype of interest. However, CSMR can yield inconsistent results when applied to high-dimensional molecular data, which we hypothesize is in part due to inherent limitations associated with the feature selection method used in the CSMR algorithm. To assess this hypothesis, we explored whether substituting different regularized regression methods (i.e. Lasso, Elastic Net, Smoothly Clipped Absolute Deviation (SCAD), Minmax Convex Penalty (MCP), and Adaptive-Lasso) within the CSMR framework can improve the clustering accuracy and internal consistency (IC) of CSMR in high-dimensional settings. We calculated the true positive rate (TPR), true negative rate (TNR), IC and clustering accuracy of our proposed modifications, benchmarked against the existing CSMR algorithm, using an extensive set of simulation studies and real biological datasets. Our results demonstrated that substituting Adaptive-Lasso within the existing feature selection method used in CSMR led to significantly improved IC and clustering accuracy, with strong performance even in high-dimensional scenarios. In conclusion, our modifications of the CSMR method resulted in improved clustering performance and may thus serve as viable alternatives for the regression-based clustering of high-dimensional datasets.

成分稀疏混杂回归(Component-wise Sparse Mixture Regression,CSMR)是最近提出的一种基于回归的聚类方法,它在检测分子标记物与感兴趣的连续表型之间的异质性关系方面显示出良好的前景。然而,当 CSMR 应用于高维分子数据时,可能会产生不一致的结果,我们假设部分原因是 CSMR 算法中使用的特征选择方法存在固有的局限性。为了评估这一假设,我们探讨了在 CSMR 框架内替换不同的正则化回归方法(即 Lasso、Elastic Net、Smoothly Clipped Absolute Deviation (SCAD)、Minmax Convex Penalty (MCP) 和 Adaptive-Lasso)是否能提高 CSMR 在高维环境下的聚类准确性和内部一致性(IC)。我们利用大量模拟研究和真实生物数据集,以现有的 CSMR 算法为基准,计算了我们提出的修改方案的真阳性率 (TPR)、真阴性率 (TNR)、内部一致性 (IC) 和聚类精度。我们的研究结果表明,在 CSMR 中使用的现有特征选择方法中替换自适应拉索,可以显著提高 IC 和聚类准确率,即使在高维场景中也能表现出色。总之,我们对 CSMR 方法的修改提高了聚类性能,因此可以作为基于回归的高维数据集聚类的可行替代方法。
{"title":"Improving the accuracy and internal consistency of regression-based clustering of high-dimensional datasets.","authors":"Bo Zhang, Jianghua He, Jinxiang Hu, Prabhakar Chalise, Devin C Koestler","doi":"10.1515/sagmb-2022-0031","DOIUrl":"10.1515/sagmb-2022-0031","url":null,"abstract":"<p><p>Component-wise Sparse Mixture Regression (CSMR) is a recently proposed regression-based clustering method that shows promise in detecting heterogeneous relationships between molecular markers and a continuous phenotype of interest. However, CSMR can yield inconsistent results when applied to high-dimensional molecular data, which we hypothesize is in part due to inherent limitations associated with the feature selection method used in the CSMR algorithm. To assess this hypothesis, we explored whether substituting different regularized regression methods (i.e. Lasso, Elastic Net, Smoothly Clipped Absolute Deviation (SCAD), Minmax Convex Penalty (MCP), and Adaptive-Lasso) within the CSMR framework can improve the clustering accuracy and internal consistency (IC) of CSMR in high-dimensional settings. We calculated the true positive rate (TPR), true negative rate (TNR), IC and clustering accuracy of our proposed modifications, benchmarked against the existing CSMR algorithm, using an extensive set of simulation studies and real biological datasets. Our results demonstrated that substituting Adaptive-Lasso within the existing feature selection method used in CSMR led to significantly improved IC and clustering accuracy, with strong performance even in high-dimensional scenarios. In conclusion, our modifications of the CSMR method resulted in improved clustering performance and may thus serve as viable alternatives for the regression-based clustering of high-dimensional datasets.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10891458/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10227703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A fast and efficient approach for gene-based association studies of ordinal phenotypes. 序型基因关联研究的一种快速有效的方法。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2023-01-01 DOI: 10.1515/sagmb-2021-0068
Nanxing Li, Lili Chen, Yajing Zhou, Qianran Wei

Many human disease conditions need to be measured by ordinal phenotypes, so analysis of ordinal phenotypes is valuable in genome-wide association studies (GWAS). However, existing association methods for dichotomous or quantitative phenotypes are not appropriate to ordinal phenotypes. Therefore, based on an aggregated Cauchy association test, we propose a fast and efficient association method to test the association between genetic variants and an ordinal phenotype. To enrich association signals of rare variants, we first use the burden method to aggregate rare variants. Then we respectively test the significance of the aggregated rare variants and other common variants. Finally, the combination of transformed variant-level P values is taken as test statistic, that approximately follows Cauchy distribution under the null hypothesis. Extensive simulation studies and analysis of GAW19 show that our proposed method is powerful and computationally fast as a gene-based method. Especially, in the presence of an extremely low proportion of causal variants in a gene, our method has better performance.

许多人类疾病状况需要通过有序表型来测量,因此分析有序表型在全基因组关联研究(GWAS)中是有价值的。然而,现有的二分类或定量表型关联方法不适用于顺序表型。因此,基于聚合柯西关联检验,我们提出了一种快速有效的关联方法来检测遗传变异与有序表型之间的关联。为了丰富稀有变异的关联信号,我们首先采用负担法对稀有变异进行聚合。然后分别对汇总的稀有变异和其他常见变异的显著性进行检验。最后,将变换后的变水平P值组合作为检验统计量,在零假设下近似服从柯西分布。对GAW19的大量仿真研究和分析表明,作为一种基于基因的方法,我们提出的方法功能强大,计算速度快。特别是,在一个基因中存在极低比例的因果变异时,我们的方法具有更好的性能。
{"title":"A fast and efficient approach for gene-based association studies of ordinal phenotypes.","authors":"Nanxing Li,&nbsp;Lili Chen,&nbsp;Yajing Zhou,&nbsp;Qianran Wei","doi":"10.1515/sagmb-2021-0068","DOIUrl":"https://doi.org/10.1515/sagmb-2021-0068","url":null,"abstract":"<p><p>Many human disease conditions need to be measured by ordinal phenotypes, so analysis of ordinal phenotypes is valuable in genome-wide association studies (GWAS). However, existing association methods for dichotomous or quantitative phenotypes are not appropriate to ordinal phenotypes. Therefore, based on an aggregated Cauchy association test, we propose a fast and efficient association method to test the association between genetic variants and an ordinal phenotype. To enrich association signals of rare variants, we first use the burden method to aggregate rare variants. Then we respectively test the significance of the aggregated rare variants and other common variants. Finally, the combination of transformed variant-level <i>P</i> values is taken as test statistic, that approximately follows Cauchy distribution under the null hypothesis. Extensive simulation studies and analysis of GAW19 show that our proposed method is powerful and computationally fast as a gene-based method. Especially, in the presence of an extremely low proportion of causal variants in a gene, our method has better performance.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10825661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel hybrid CNN and BiGRU-Attention based deep learning model for protein function prediction. 一种基于CNN和BiGRU-Attention的新型混合深度学习模型用于蛋白质功能预测。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2023-01-01 DOI: 10.1515/sagmb-2022-0057
Lavkush Sharma, Akshay Deepak, Ashish Ranjan, Gopalakrishnan Krishnasamy

Proteins are the building blocks of all living things. Protein function must be ascertained if the molecular mechanism of life is to be understood. While CNN is good at capturing short-term relationships, GRU and LSTM can capture long-term dependencies. A hybrid approach that combines the complementary benefits of these deep-learning models motivates our work. Protein Language models, which use attention networks to gather meaningful data and build representations for proteins, have seen tremendous success in recent years processing the protein sequences. In this paper, we propose a hybrid CNN + BiGRU - Attention based model with protein language model embedding that effectively combines the output of CNN with the output of BiGRU-Attention for predicting protein functions. We evaluated the performance of our proposed hybrid model on human and yeast datasets. The proposed hybrid model improves the Fmax value over the state-of-the-art model SDN2GO for the cellular component prediction task by 1.9 %, for the molecular function prediction task by 3.8 % and for the biological process prediction task by 0.6 % for human dataset and for yeast dataset the cellular component prediction task by 2.4 %, for the molecular function prediction task by 5.2 % and for the biological process prediction task by 1.2 %.

蛋白质是所有生物的基本组成部分。如果要了解生命的分子机制,就必须确定蛋白质的功能。CNN擅长捕捉短期关系,而GRU和LSTM可以捕捉长期依赖关系。结合了这些深度学习模型的互补优势的混合方法激励了我们的工作。蛋白质语言模型使用注意力网络来收集有意义的数据并建立蛋白质的表示,近年来在处理蛋白质序列方面取得了巨大的成功。在本文中,我们提出了一种基于蛋白质语言模型嵌入的CNN + BiGRU-Attention混合模型,有效地将CNN的输出与BiGRU-Attention的输出相结合,用于预测蛋白质功能。我们在人类和酵母数据集上评估了我们提出的混合模型的性能。所提出的混合模型比最先进的模型sn2go在细胞成分预测任务中的Fmax值提高了1.9 %,在分子功能预测任务中提高了3.8 %,在生物过程预测任务中提高了0.6 %,在人类数据集和酵母数据集的细胞成分预测任务中提高了2.4 %,在分子功能预测任务中提高了5.2 %,在生物过程预测任务中提高了1.2 %。
{"title":"A novel hybrid CNN and BiGRU-Attention based deep learning model for protein function prediction.","authors":"Lavkush Sharma,&nbsp;Akshay Deepak,&nbsp;Ashish Ranjan,&nbsp;Gopalakrishnan Krishnasamy","doi":"10.1515/sagmb-2022-0057","DOIUrl":"https://doi.org/10.1515/sagmb-2022-0057","url":null,"abstract":"<p><p>Proteins are the building blocks of all living things. Protein function must be ascertained if the molecular mechanism of life is to be understood. While CNN is good at capturing short-term relationships, GRU and LSTM can capture long-term dependencies. A hybrid approach that combines the complementary benefits of these deep-learning models motivates our work. Protein Language models, which use attention networks to gather meaningful data and build representations for proteins, have seen tremendous success in recent years processing the protein sequences. In this paper, we propose a hybrid CNN + BiGRU - Attention based model with protein language model embedding that effectively combines the output of CNN with the output of BiGRU-Attention for predicting protein functions. We evaluated the performance of our proposed hybrid model on human and yeast datasets. The proposed hybrid model improves the Fmax value over the state-of-the-art model SDN2GO for the cellular component prediction task by 1.9 %, for the molecular function prediction task by 3.8 % and for the biological process prediction task by 0.6 % for human dataset and for yeast dataset the cellular component prediction task by 2.4 %, for the molecular function prediction task by 5.2 % and for the biological process prediction task by 1.2 %.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10229953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CAT PETR: a graphical user interface for differential analysis of phosphorylation and expression data CAT PETR:用于磷酸化和表达数据差异分析的图形用户界面
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2023-01-01 DOI: 10.1101/2023.01.04.522665
Keegan Flanagan, S. Pelech, Y. Av‐Gay, K. Dao Duc
Abstract Antibody microarray data provides a powerful and high-throughput tool to monitor global changes in cellular response to perturbation or genetic manipulation. However, while collecting such data has become increasingly accessible, a lack of specific computational tools has made their analysis limited. Here we present CAT PETR, a user friendly web application for the differential analysis of expression and phosphorylation data collected via antibody microarrays. Our application addresses the limitations of other GUI based tools by providing various data input options and visualizations. To illustrate its capabilities on real data, we show that CAT PETR both replicates previous findings, and reveals additional insights, using its advanced visualization and statistical options.
摘要抗体微阵列数据提供了一种强大的高通量工具来监测细胞对扰动或基因操纵反应的整体变化。然而,尽管收集此类数据变得越来越容易获得,但由于缺乏特定的计算工具,其分析受到限制。在这里,我们介绍了CAT PETR,一个用户友好的网络应用程序,用于通过抗体微阵列收集的表达和磷酸化数据的差异分析。我们的应用程序通过提供各种数据输入选项和可视化来解决其他基于GUI的工具的局限性。为了说明其在真实数据上的能力,我们展示了CAT PETR既复制了以前的发现,又利用其先进的可视化和统计选项揭示了更多的见解。
{"title":"CAT PETR: a graphical user interface for differential analysis of phosphorylation and expression data","authors":"Keegan Flanagan, S. Pelech, Y. Av‐Gay, K. Dao Duc","doi":"10.1101/2023.01.04.522665","DOIUrl":"https://doi.org/10.1101/2023.01.04.522665","url":null,"abstract":"Abstract Antibody microarray data provides a powerful and high-throughput tool to monitor global changes in cellular response to perturbation or genetic manipulation. However, while collecting such data has become increasingly accessible, a lack of specific computational tools has made their analysis limited. Here we present CAT PETR, a user friendly web application for the differential analysis of expression and phosphorylation data collected via antibody microarrays. Our application addresses the limitations of other GUI based tools by providing various data input options and visualizations. To illustrate its capabilities on real data, we show that CAT PETR both replicates previous findings, and reveals additional insights, using its advanced visualization and statistical options.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49512934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accurate and fast small p-value estimation for permutation tests in high-throughput genomic data analysis with the cross-entropy method. 交叉熵法用于高通量基因组数据分析中排列检验的准确、快速小p值估计。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2023-01-01 DOI: 10.1515/sagmb-2021-0067
Yang Shi, Weiping Shi, Mengqiao Wang, Ji-Hyun Lee, Huining Kang, Hui Jiang

Permutation tests are widely used for statistical hypothesis testing when the sampling distribution of the test statistic under the null hypothesis is analytically intractable or unreliable due to finite sample sizes. One critical challenge in the application of permutation tests in genomic studies is that an enormous number of permutations are often needed to obtain reliable estimates of very small p-values, leading to intensive computational effort. To address this issue, we develop algorithms for the accurate and efficient estimation of small p-values in permutation tests for paired and independent two-group genomic data, and our approaches leverage a novel framework for parameterizing the permutation sample spaces of those two types of data respectively using the Bernoulli and conditional Bernoulli distributions, combined with the cross-entropy method. The performance of our proposed algorithms is demonstrated through the application to two simulated datasets and two real-world gene expression datasets generated by microarray and RNA-Seq technologies and comparisons to existing methods such as crude permutations and SAMC, and the results show that our approaches can achieve orders of magnitude of computational efficiency gains in estimating small p-values. Our approaches offer promising solutions for the improvement of computational efficiencies of existing permutation test procedures and the development of new testing methods using permutations in genomic data analysis.

当检验统计量在零假设下的抽样分布由于样本量有限而难以分析或不可靠时,排列检验被广泛用于统计假设检验。在基因组研究中应用排列测试的一个关键挑战是,通常需要大量的排列来获得非常小的p值的可靠估计,从而导致大量的计算工作。为了解决这个问题,我们开发了一种算法,用于准确有效地估计配对和独立的两组基因组数据的排列检验中的小p值,我们的方法利用一个新的框架,分别使用伯努利分布和条件伯努利分布,结合交叉熵方法,参数化这两类数据的排列样本空间。通过应用于两个模拟数据集和两个由微阵列和RNA-Seq技术生成的真实基因表达数据集,并与现有方法(如粗排列和SAMC)进行比较,证明了我们提出的算法的性能,结果表明我们的方法在估计小p值方面可以实现数量级的计算效率提升。我们的方法为提高现有排列测试程序的计算效率和开发基因组数据分析中使用排列的新测试方法提供了有希望的解决方案。
{"title":"Accurate and fast small <i>p</i>-value estimation for permutation tests in high-throughput genomic data analysis with the cross-entropy method.","authors":"Yang Shi,&nbsp;Weiping Shi,&nbsp;Mengqiao Wang,&nbsp;Ji-Hyun Lee,&nbsp;Huining Kang,&nbsp;Hui Jiang","doi":"10.1515/sagmb-2021-0067","DOIUrl":"https://doi.org/10.1515/sagmb-2021-0067","url":null,"abstract":"<p><p>Permutation tests are widely used for statistical hypothesis testing when the sampling distribution of the test statistic under the null hypothesis is analytically intractable or unreliable due to finite sample sizes. One critical challenge in the application of permutation tests in genomic studies is that an enormous number of permutations are often needed to obtain reliable estimates of very small <i>p</i>-values, leading to intensive computational effort. To address this issue, we develop algorithms for the accurate and efficient estimation of small <i>p</i>-values in permutation tests for paired and independent two-group genomic data, and our approaches leverage a novel framework for parameterizing the permutation sample spaces of those two types of data respectively using the Bernoulli and conditional Bernoulli distributions, combined with the cross-entropy method. The performance of our proposed algorithms is demonstrated through the application to two simulated datasets and two real-world gene expression datasets generated by microarray and RNA-Seq technologies and comparisons to existing methods such as crude permutations and SAMC, and the results show that our approaches can achieve orders of magnitude of computational efficiency gains in estimating small <i>p</i>-values. Our approaches offer promising solutions for the improvement of computational efficiencies of existing permutation test procedures and the development of new testing methods using permutations in genomic data analysis.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10229904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Bayesian model to identify multiple expression patterns with simultaneous FDR control for a multi-factor RNA-seq experiment. 一个贝叶斯模型,以识别多种表达模式与同时FDR控制的多因素RNA-seq实验。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2023-01-01 DOI: 10.1515/sagmb-2022-0025
Yuanyuan Bian, Chong He, Jing Qiu

It is often of research interest to identify genes that satisfy a particular expression pattern across different conditions such as tissues, genotypes, etc. One common practice is to perform differential expression analysis for each condition separately and then take the intersection of differentially expressed (DE) genes or non-DE genes under each condition to obtain genes that satisfy a particular pattern. Such a method can lead to many false positives, especially when the desired gene expression pattern involves equivalent expression under one condition. In this paper, we apply a Bayesian partition model to identify genes of all desired patterns while simultaneously controlling their false discovery rates (FDRs). Our simulation studies show that the common practice fails to control group specific FDRs for patterns involving equivalent expression while the proposed Bayesian method simultaneously controls group specific FDRs at all settings studied. In addition, the proposed method is more powerful when the FDR of the common practice is under control for identifying patterns only involving DE genes. Our simulation studies also show that it is an inherently more challenging problem to identify patterns involving equivalent expression than patterns only involving differential expression. Therefore, larger sample sizes are required to obtain the same target power to identify the former types of patterns than the latter types of patterns.

识别在不同条件下(如组织、基因型等)满足特定表达模式的基因通常是研究兴趣。一种常见的做法是分别对每种情况进行差异表达分析,然后取每种情况下差异表达(DE)基因或非DE基因的交集,获得满足特定模式的基因。这种方法可能导致许多假阳性,特别是当所需的基因表达模式在一种条件下涉及等效表达时。在本文中,我们应用贝叶斯分割模型来识别所有期望模式的基因,同时控制它们的错误发现率(FDRs)。我们的模拟研究表明,对于涉及等效表达的模式,通常的做法无法控制组特定fdr,而所提出的贝叶斯方法同时控制了在所研究的所有设置下的组特定fdr。此外,当通用实践的FDR在识别仅涉及DE基因的模式的控制下时,所提出的方法更强大。我们的模拟研究还表明,识别涉及等效表达的模式比仅涉及差分表达的模式本质上更具挑战性。因此,与识别后一种模式相比,需要更大的样本量来获得识别前一种模式类型的相同目标功率。
{"title":"A Bayesian model to identify multiple expression patterns with simultaneous FDR control for a multi-factor RNA-seq experiment.","authors":"Yuanyuan Bian,&nbsp;Chong He,&nbsp;Jing Qiu","doi":"10.1515/sagmb-2022-0025","DOIUrl":"https://doi.org/10.1515/sagmb-2022-0025","url":null,"abstract":"<p><p>It is often of research interest to identify genes that satisfy a particular expression pattern across different conditions such as tissues, genotypes, etc. One common practice is to perform differential expression analysis for each condition separately and then take the intersection of differentially expressed (DE) genes or non-DE genes under each condition to obtain genes that satisfy a particular pattern. Such a method can lead to many false positives, especially when the desired gene expression pattern involves equivalent expression under one condition. In this paper, we apply a Bayesian partition model to identify genes of all desired patterns while simultaneously controlling their false discovery rates (FDRs). Our simulation studies show that the common practice fails to control group specific FDRs for patterns involving equivalent expression while the proposed Bayesian method simultaneously controls group specific FDRs at all settings studied. In addition, the proposed method is more powerful when the FDR of the common practice is under control for identifying patterns only involving DE genes. Our simulation studies also show that it is an inherently more challenging problem to identify patterns involving equivalent expression than patterns only involving differential expression. Therefore, larger sample sizes are required to obtain the same target power to identify the former types of patterns than the latter types of patterns.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9391532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistical Applications in Genetics and Molecular Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1