Statistical Applications in Genetics and Molecular Biology最新文献_第2页

Bayesian LASSO for population stratification correction in rare haplotype association studies. 贝叶斯 LASSO 用于稀有单倍型关联研究中的人群分层校正。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2024-01-19 eCollection Date: 2024-01-01 DOI: 10.1515/sagmb-2022-0034

Zilu Liu, Asuman Seda Turkmen, Shili Lin

Population stratification (PS) is one major source of confounding in both single nucleotide polymorphism (SNP) and haplotype association studies. To address PS, principal component regression (PCR) and linear mixed model (LMM) are the current standards for SNP associations, which are also commonly borrowed for haplotype studies. However, the underfitting and overfitting problems introduced by PCR and LMM, respectively, have yet to be addressed. Furthermore, there have been only a few theoretical approaches proposed to address PS specifically for haplotypes. In this paper, we propose a new method under the Bayesian LASSO framework, QBLstrat, to account for PS in identifying rare and common haplotypes associated with a continuous trait of interest. QBLstrat utilizes a large number of principal components (PCs) with appropriate priors to sufficiently correct for PS, while shrinking the estimates of unassociated haplotypes and PCs. We compare the performance of QBLstrat with the Bayesian counterparts of PCR and LMM and a current method, haplo.stats. Extensive simulation studies and real data analyses show that QBLstrat is superior in controlling false positives while maintaining competitive power for identifying true positives under PS.

人群分层（PS）是单核苷酸多态性（SNP）和单体型关联研究中混杂因素的一个主要来源。为解决单核苷酸多态性问题，主成分回归（PCR）和线性混合模型（LMM）是目前 SNP 关联研究的标准，也是单体型研究常用的方法。然而，PCR 和 LMM 分别带来的欠拟合和过拟合问题仍有待解决。此外，专门针对单倍型的 PS 理论方法也寥寥无几。在本文中，我们在贝叶斯 LASSO 框架下提出了一种新方法 QBLstrat，用于在识别与感兴趣的连续性状相关的稀有和常见单倍型时考虑 PS。QBLstrat 利用大量具有适当先验的主成分（PC）来充分校正多态性，同时缩小无关联单倍型和 PC 的估计值。我们将 QBLstrat 的性能与 PCR 和 LMM 的贝叶斯对应方法以及当前的一种方法 haplo.stats 进行了比较。广泛的模拟研究和实际数据分析表明，QBLstrat 在控制假阳性方面更胜一筹，同时在 PS 下识别真阳性方面也保持了竞争力。

{"title":"Bayesian LASSO for population stratification correction in rare haplotype association studies.","authors":"Zilu Liu, Asuman Seda Turkmen, Shili Lin","doi":"10.1515/sagmb-2022-0034","DOIUrl":"10.1515/sagmb-2022-0034","url":null,"abstract":"Population stratification (PS) is one major source of confounding in both single nucleotide polymorphism (SNP) and haplotype association studies. To address PS, principal component regression (PCR) and linear mixed model (LMM) are the current standards for SNP associations, which are also commonly borrowed for haplotype studies. However, the underfitting and overfitting problems introduced by PCR and LMM, respectively, have yet to be addressed. Furthermore, there have been only a few theoretical approaches proposed to address PS specifically for haplotypes. In this paper, we propose a new method under the Bayesian LASSO framework, QBLstrat, to account for PS in identifying rare and common haplotypes associated with a continuous trait of interest. QBLstrat utilizes a large number of principal components (PCs) with appropriate priors to sufficiently correct for PS, while shrinking the estimates of unassociated haplotypes and PCs. We compare the performance of QBLstrat with the Bayesian counterparts of PCR and LMM and a current method, haplo.stats. Extensive simulation studies and real data analyses show that QBLstrat is superior in controlling false positives while maintaining competitive power for identifying true positives under PS.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"23 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10794901/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139486664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mediation analysis method review of high throughput data. 高通量数据的中介分析方法综述。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2023-11-29 eCollection Date: 2023-01-01 DOI: 10.1515/sagmb-2023-0031

Qiang Han, Yu Wang, Na Sun, Jiadong Chu, Wei Hu, Yueping Shen

High-throughput technologies have made high-dimensional settings increasingly common, providing opportunities for the development of high-dimensional mediation methods. We aimed to provide useful guidance for researchers using high-dimensional mediation analysis and ideas for biostatisticians to develop it by summarizing and discussing recent advances in high-dimensional mediation analysis. The method still faces many challenges when extended single and multiple mediation analyses to high-dimensional settings. The development of high-dimensional mediation methods attempts to address these issues, such as screening true mediators, estimating mediation effects by variable selection, reducing the mediation dimension to resolve correlations between variables, and utilizing composite null hypothesis testing to test them. Although these problems regarding high-dimensional mediation have been solved to some extent, some challenges remain. First, the correlation between mediators are rarely considered when the variables are selected for mediation. Second, downscaling without incorporating prior biological knowledge makes the results difficult to interpret. In addition, a method of sensitivity analysis for the strict sequential ignorability assumption in high-dimensional mediation analysis is still lacking. An analyst needs to consider the applicability of each method when utilizing them, while a biostatistician could consider extensions and improvements in the methodology.

高通量技术使得高维设置越来越普遍，为高维中介方法的发展提供了机会。本文旨在通过总结和讨论高维中介分析的最新进展，为研究人员使用高维中介分析提供有益的指导，并为生物统计学家发展高维中介分析提供思路。该方法在将单、多中介分析扩展到高维环境时仍面临许多挑战。高维中介方法的发展试图解决这些问题，如筛选真正的中介，通过变量选择估计中介效果，降低中介维度以解决变量之间的相关性，并利用复合零假设检验来检验它们。虽然这些问题在一定程度上得到了解决，但仍然存在一些挑战。首先，在选择变量进行中介时，很少考虑中介之间的相关性。其次，在没有纳入先前的生物学知识的情况下缩小规模会使结果难以解释。此外，对于高维中介分析中严格顺序可忽略性假设，还缺乏一种灵敏度分析方法。分析人员在使用每种方法时需要考虑它们的适用性，而生物统计学家可以考虑方法的扩展和改进。

{"title":"Mediation analysis method review of high throughput data.","authors":"Qiang Han, Yu Wang, Na Sun, Jiadong Chu, Wei Hu, Yueping Shen","doi":"10.1515/sagmb-2023-0031","DOIUrl":"10.1515/sagmb-2023-0031","url":null,"abstract":"High-throughput technologies have made high-dimensional settings increasingly common, providing opportunities for the development of high-dimensional mediation methods. We aimed to provide useful guidance for researchers using high-dimensional mediation analysis and ideas for biostatisticians to develop it by summarizing and discussing recent advances in high-dimensional mediation analysis. The method still faces many challenges when extended single and multiple mediation analyses to high-dimensional settings. The development of high-dimensional mediation methods attempts to address these issues, such as screening true mediators, estimating mediation effects by variable selection, reducing the mediation dimension to resolve correlations between variables, and utilizing composite null hypothesis testing to test them. Although these problems regarding high-dimensional mediation have been solved to some extent, some challenges remain. First, the correlation between mediators are rarely considered when the variables are selected for mediation. Second, downscaling without incorporating prior biological knowledge makes the results difficult to interpret. In addition, a method of sensitivity analysis for the strict sequential ignorability assumption in high-dimensional mediation analysis is still lacking. An analyst needs to consider the applicability of each method when utilizing them, while a biostatistician could consider extensions and improvements in the methodology.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"22 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138452936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Patterns of differential expression by association in omic data using a new measure based on ensemble learning. 基于集成学习的组学数据关联差分表达模式研究。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2023-11-23 eCollection Date: 2023-01-01 DOI: 10.1515/sagmb-2023-0009

Jorge M Arevalillo, Raquel Martin-Arevalillo

The ongoing development of high-throughput technologies is allowing the simultaneous monitoring of the expression levels for hundreds or thousands of biological inputs with the proliferation of what has been coined as omic data sources. One relevant issue when analyzing such data sources is concerned with the detection of differential expression across two experimental conditions, clinical status or two classes of a biological outcome. While a great deal of univariate data analysis approaches have been developed to address the issue, strategies for assessing interaction patterns of differential expression are scarce in the literature and have been limited to ad hoc solutions. This paper contributes to the problem by exploiting the facilities of an ensemble learning algorithm like random forests to propose a measure that assesses the differential expression explained by the interaction of the omic variables so subtle biological patterns may be uncovered as a result. The out of bag error rate, which is an estimate of the predictive accuracy of a random forests classifier, is used as a by-product to propose a new measure that assesses interaction patterns of differential expression. Its performance is studied in synthetic scenarios and it is also applied to real studies on SARS-CoV-2 and colon cancer data where it uncovers associations that remain undetected by other methods. Our proposal is aimed at providing a novel approach that may help the experts in biomedical and life sciences to unravel insightful interaction patterns that may decipher the molecular mechanisms underlying biological and clinical outcomes.

高通量技术的不断发展使得同时监测数百或数千个生物输入的表达水平成为可能，同时也增加了被称为基因组数据源的内容。在分析这些数据源时，一个相关问题涉及到在两种实验条件、临床状态或两类生物学结果之间检测差异表达。虽然已经开发了大量的单变量数据分析方法来解决这个问题，但评估差异表达的相互作用模式的策略在文献中很少，并且仅限于临时解决方案。本文通过利用像随机森林这样的集成学习算法的设施来解决这个问题，提出了一种评估由组学变量相互作用解释的差异表达的度量，因此可能会发现微妙的生物模式。袋外错误率是随机森林分类器预测精度的估计，它被用作副产品，提出了一种评估差异表达交互模式的新措施。它的性能在合成场景中进行了研究，也应用于对SARS-CoV-2和结肠癌数据的实际研究，在这些研究中，它揭示了其他方法未发现的关联。我们的建议旨在提供一种新的方法，可以帮助生物医学和生命科学专家揭示有见地的相互作用模式，从而可能破译生物学和临床结果背后的分子机制。

{"title":"Patterns of differential expression by association in omic data using a new measure based on ensemble learning.","authors":"Jorge M Arevalillo, Raquel Martin-Arevalillo","doi":"10.1515/sagmb-2023-0009","DOIUrl":"10.1515/sagmb-2023-0009","url":null,"abstract":"The ongoing development of high-throughput technologies is allowing the simultaneous monitoring of the expression levels for hundreds or thousands of biological inputs with the proliferation of what has been coined as omic data sources. One relevant issue when analyzing such data sources is concerned with the detection of differential expression across two experimental conditions, clinical status or two classes of a biological outcome. While a great deal of univariate data analysis approaches have been developed to address the issue, strategies for assessing interaction patterns of differential expression are scarce in the literature and have been limited to ad hoc solutions. This paper contributes to the problem by exploiting the facilities of an ensemble learning algorithm like random forests to propose a measure that assesses the differential expression explained by the interaction of the omic variables so subtle biological patterns may be uncovered as a result. The out of bag error rate, which is an estimate of the predictive accuracy of a random forests classifier, is used as a by-product to propose a new measure that assesses interaction patterns of differential expression. Its performance is studied in synthetic scenarios and it is also applied to real studies on SARS-CoV-2 and colon cancer data where it uncovers associations that remain undetected by other methods. Our proposal is aimed at providing a novel approach that may help the experts in biomedical and life sciences to unravel insightful interaction patterns that may decipher the molecular mechanisms underlying biological and clinical outcomes.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"22 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2023-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138292231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integrated regulatory and metabolic networks of the tumor microenvironment for therapeutic target prioritization. 肿瘤微环境的综合调节和代谢网络对治疗目标的优先排序。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2023-11-21 eCollection Date: 2023-01-01 DOI: 10.1515/sagmb-2022-0054

Tiange Shi, Han Yu, Rachael Hageman Blair

Translation of genomic discovery, such as single-cell sequencing data, to clinical decisions remains a longstanding bottleneck in the field. Meanwhile, computational systems biological models, such as cellular metabolism models and cell signaling pathways, have emerged as powerful approaches to provide efficient predictions in metabolites and gene expression levels, respectively. However, there has been limited research on the integration between these two models. This work develops a methodology for integrating computational models of probabilistic gene regulatory networks with a constraint-based metabolism model. By using probabilistic reasoning with Bayesian Networks, we aim to predict cell-specific changes under different interventions, which are embedded into the constraint-based models of metabolism. Applications to single-cell sequencing data of glioblastoma brain tumors generate predictions about the effects of pharmaceutical interventions on the regulatory network and downstream metabolisms in different cell types from the tumor microenvironment. The model presents possible insights into treatments that could potentially suppress anaerobic metabolism in malignant cells with minimal impact on other cell types' metabolism. The proposed integrated model can guide therapeutic target prioritization, the formulation of combination therapies, and future drug discovery. This model integration framework is also generalizable to other applications, such as different cell types, organisms, and diseases.

将基因组发现(如单细胞测序数据)转化为临床决策仍然是该领域长期存在的瓶颈。与此同时，计算系统生物学模型，如细胞代谢模型和细胞信号传导途径，已经成为提供有效预测代谢物和基因表达水平的有力方法。然而，关于这两种模型之间的整合研究却很少。这项工作开发了一种方法，将概率基因调控网络的计算模型与基于约束的代谢模型相结合。通过使用贝叶斯网络的概率推理，我们的目标是预测不同干预下的细胞特异性变化，这些变化嵌入到基于约束的代谢模型中。应用于胶质母细胞瘤脑肿瘤的单细胞测序数据，可以从肿瘤微环境中预测药物干预对不同细胞类型的调控网络和下游代谢的影响。该模型为可能抑制恶性细胞厌氧代谢而对其他细胞类型代谢影响最小的治疗提供了可能的见解。所提出的集成模型可以指导治疗目标的优先排序、联合疗法的制定以及未来的药物发现。该模型集成框架也可推广到其他应用，例如不同的细胞类型、生物体和疾病。

{"title":"Integrated regulatory and metabolic networks of the tumor microenvironment for therapeutic target prioritization.","authors":"Tiange Shi, Han Yu, Rachael Hageman Blair","doi":"10.1515/sagmb-2022-0054","DOIUrl":"10.1515/sagmb-2022-0054","url":null,"abstract":"Translation of genomic discovery, such as single-cell sequencing data, to clinical decisions remains a longstanding bottleneck in the field. Meanwhile, computational systems biological models, such as cellular metabolism models and cell signaling pathways, have emerged as powerful approaches to provide efficient predictions in metabolites and gene expression levels, respectively. However, there has been limited research on the integration between these two models. This work develops a methodology for integrating computational models of probabilistic gene regulatory networks with a constraint-based metabolism model. By using probabilistic reasoning with Bayesian Networks, we aim to predict cell-specific changes under different interventions, which are embedded into the constraint-based models of metabolism. Applications to single-cell sequencing data of glioblastoma brain tumors generate predictions about the effects of pharmaceutical interventions on the regulatory network and downstream metabolisms in different cell types from the tumor microenvironment. The model presents possible insights into treatments that could potentially suppress anaerobic metabolism in malignant cells with minimal impact on other cell types' metabolism. The proposed integrated model can guide therapeutic target prioritization, the formulation of combination therapies, and future drug discovery. This model integration framework is also generalizable to other applications, such as different cell types, organisms, and diseases.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"22 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2023-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138292230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Randomized singular value decomposition for integrative subtype analysis of 'omics data' using non-negative matrix factorization. 使用非负矩阵因子分解对“组学数据”进行综合亚型分析的随机奇异值分解。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2023-11-09 eCollection Date: 2023-01-01 DOI: 10.1515/sagmb-2022-0047

Yonghui Ni, Jianghua He, Prabhakar Chalise

Integration of multiple 'omics datasets for differentiating cancer subtypes is a powerful technic that leverages the consistent and complementary information across multi-omics data. Matrix factorization is a common technique used in integrative clustering for identifying latent subtype structure across multi-omics data. High dimensionality of the omics data and long computation time have been common challenges of clustering methods. In order to address the challenges, we propose randomized singular value decomposition (RSVD) for integrative clustering using Non-negative Matrix Factorization: intNMF-rsvd. The method utilizes RSVD to reduce the dimensionality by projecting the data into eigen vector space with user specified lower rank. Then, clustering analysis is carried out by estimating common basis matrix across the projected multi-omics datasets. The performance of the proposed method was assessed using the simulated datasets and compared with six state-of-the-art integrative clustering methods using real-life datasets from The Cancer Genome Atlas Study. intNMF-rsvd was found working efficiently and competitively as compared to standard intNMF and other multi-omics clustering methods. Most importantly, intNMF-rsvd can handle large number of features and significantly reduce the computation time. The identified subtypes can be utilized for further clinical association studies to understand the etiology of the disease.

整合多组学数据集以区分癌症亚型是一种强大的技术，可以利用多组学的一致性和互补性信息。矩阵分解是综合聚类中用于识别多组学数据中潜在亚型结构的常用技术。组学数据的高维性和长计算时间一直是聚类方法的常见挑战。为了应对这些挑战，我们提出了使用非负矩阵因子分解进行综合聚类的随机奇异值分解（RSVD）：intNMF-RSVD。该方法利用RSVD将数据投影到具有用户指定的较低秩的特征向量空间中来降维。然后，通过估计投影的多组学数据集的公共基矩阵来进行聚类分析。使用模拟数据集评估了所提出方法的性能，并使用癌症基因组图谱研究的真实数据集与六种最先进的综合聚类方法进行了比较。发现与标准intNMF和其他多组分聚类方法相比，intNMF-rsvd有效且具有竞争力。最重要的是，intNMF-rsvd可以处理大量特征，并显著减少计算时间。确定的亚型可用于进一步的临床关联研究，以了解疾病的病因。

{"title":"Randomized singular value decomposition for integrative subtype analysis of 'omics data' using non-negative matrix factorization.","authors":"Yonghui Ni, Jianghua He, Prabhakar Chalise","doi":"10.1515/sagmb-2022-0047","DOIUrl":"10.1515/sagmb-2022-0047","url":null,"abstract":"Integration of multiple 'omics datasets for differentiating cancer subtypes is a powerful technic that leverages the consistent and complementary information across multi-omics data. Matrix factorization is a common technique used in integrative clustering for identifying latent subtype structure across multi-omics data. High dimensionality of the omics data and long computation time have been common challenges of clustering methods. In order to address the challenges, we propose randomized singular value decomposition (RSVD) for integrative clustering using Non-negative Matrix Factorization: intNMF-rsvd. The method utilizes RSVD to reduce the dimensionality by projecting the data into eigen vector space with user specified lower rank. Then, clustering analysis is carried out by estimating common basis matrix across the projected multi-omics datasets. The performance of the proposed method was assessed using the simulated datasets and compared with six state-of-the-art integrative clustering methods using real-life datasets from The Cancer Genome Atlas Study. intNMF-rsvd was found working efficiently and competitively as compared to standard intNMF and other multi-omics clustering methods. Most importantly, intNMF-rsvd can handle large number of features and significantly reduce the computation time. The identified subtypes can be utilized for further clinical association studies to understand the etiology of the disease.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"22 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71488028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CAT PETR: a graphical user interface for differential analysis of phosphorylation and expression data. CAT PETR：用于磷酸化和表达数据差异分析的图形用户界面。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2023-08-21 eCollection Date: 2023-01-01 DOI: 10.1515/sagmb-2023-0017

Keegan Flanagan, Steven Pelech, Yossef Av-Gay, Khanh Dao Duc

Antibody microarray data provides a powerful and high-throughput tool to monitor global changes in cellular response to perturbation or genetic manipulation. However, while collecting such data has become increasingly accessible, a lack of specific computational tools has made their analysis limited. Here we present CAT PETR, a user friendly web application for the differential analysis of expression and phosphorylation data collected via antibody microarrays. Our application addresses the limitations of other GUI based tools by providing various data input options and visualizations. To illustrate its capabilities on real data, we show that CAT PETR both replicates previous findings, and reveals additional insights, using its advanced visualization and statistical options.

抗体微阵列数据提供了一种强大的高通量工具，可用于监测细胞对扰动或遗传操作反应的整体变化。然而，虽然收集此类数据的途径越来越多，但由于缺乏特定的计算工具，对它们的分析受到了限制。我们在此介绍 CAT PETR，这是一款用户友好型网络应用程序，用于对通过抗体微阵列收集的表达和磷酸化数据进行差异分析。我们的应用程序通过提供各种数据输入选项和可视化效果，解决了其他基于图形用户界面的工具的局限性。为了说明其在真实数据上的能力，我们展示了 CAT PETR 利用其先进的可视化和统计选项，既复制了以前的发现，又揭示了更多的见解。

引用次数: 0

Improving the accuracy and internal consistency of regression-based clustering of high-dimensional datasets. 提高基于回归的高维数据集聚类法的准确性和内部一致性。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2023-07-25 eCollection Date: 2023-01-01 DOI: 10.1515/sagmb-2022-0031

Bo Zhang, Jianghua He, Jinxiang Hu, Prabhakar Chalise, Devin C Koestler

Component-wise Sparse Mixture Regression (CSMR) is a recently proposed regression-based clustering method that shows promise in detecting heterogeneous relationships between molecular markers and a continuous phenotype of interest. However, CSMR can yield inconsistent results when applied to high-dimensional molecular data, which we hypothesize is in part due to inherent limitations associated with the feature selection method used in the CSMR algorithm. To assess this hypothesis, we explored whether substituting different regularized regression methods (i.e. Lasso, Elastic Net, Smoothly Clipped Absolute Deviation (SCAD), Minmax Convex Penalty (MCP), and Adaptive-Lasso) within the CSMR framework can improve the clustering accuracy and internal consistency (IC) of CSMR in high-dimensional settings. We calculated the true positive rate (TPR), true negative rate (TNR), IC and clustering accuracy of our proposed modifications, benchmarked against the existing CSMR algorithm, using an extensive set of simulation studies and real biological datasets. Our results demonstrated that substituting Adaptive-Lasso within the existing feature selection method used in CSMR led to significantly improved IC and clustering accuracy, with strong performance even in high-dimensional scenarios. In conclusion, our modifications of the CSMR method resulted in improved clustering performance and may thus serve as viable alternatives for the regression-based clustering of high-dimensional datasets.

成分稀疏混杂回归（Component-wise Sparse Mixture Regression，CSMR）是最近提出的一种基于回归的聚类方法，它在检测分子标记物与感兴趣的连续表型之间的异质性关系方面显示出良好的前景。然而，当 CSMR 应用于高维分子数据时，可能会产生不一致的结果，我们假设部分原因是 CSMR 算法中使用的特征选择方法存在固有的局限性。为了评估这一假设，我们探讨了在 CSMR 框架内替换不同的正则化回归方法（即 Lasso、Elastic Net、Smoothly Clipped Absolute Deviation (SCAD)、Minmax Convex Penalty (MCP) 和 Adaptive-Lasso）是否能提高 CSMR 在高维环境下的聚类准确性和内部一致性（IC）。我们利用大量模拟研究和真实生物数据集，以现有的 CSMR 算法为基准，计算了我们提出的修改方案的真阳性率 (TPR)、真阴性率 (TNR)、内部一致性 (IC) 和聚类精度。我们的研究结果表明，在 CSMR 中使用的现有特征选择方法中替换自适应拉索，可以显著提高 IC 和聚类准确率，即使在高维场景中也能表现出色。总之，我们对 CSMR 方法的修改提高了聚类性能，因此可以作为基于回归的高维数据集聚类的可行替代方法。

{"title":"Improving the accuracy and internal consistency of regression-based clustering of high-dimensional datasets.","authors":"Bo Zhang, Jianghua He, Jinxiang Hu, Prabhakar Chalise, Devin C Koestler","doi":"10.1515/sagmb-2022-0031","DOIUrl":"10.1515/sagmb-2022-0031","url":null,"abstract":"Component-wise Sparse Mixture Regression (CSMR) is a recently proposed regression-based clustering method that shows promise in detecting heterogeneous relationships between molecular markers and a continuous phenotype of interest. However, CSMR can yield inconsistent results when applied to high-dimensional molecular data, which we hypothesize is in part due to inherent limitations associated with the feature selection method used in the CSMR algorithm. To assess this hypothesis, we explored whether substituting different regularized regression methods (i.e. Lasso, Elastic Net, Smoothly Clipped Absolute Deviation (SCAD), Minmax Convex Penalty (MCP), and Adaptive-Lasso) within the CSMR framework can improve the clustering accuracy and internal consistency (IC) of CSMR in high-dimensional settings. We calculated the true positive rate (TPR), true negative rate (TNR), IC and clustering accuracy of our proposed modifications, benchmarked against the existing CSMR algorithm, using an extensive set of simulation studies and real biological datasets. Our results demonstrated that substituting Adaptive-Lasso within the existing feature selection method used in CSMR led to significantly improved IC and clustering accuracy, with strong performance even in high-dimensional scenarios. In conclusion, our modifications of the CSMR method resulted in improved clustering performance and may thus serve as viable alternatives for the regression-based clustering of high-dimensional datasets.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"22 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2023-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10891458/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10227703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A fast and efficient approach for gene-based association studies of ordinal phenotypes. 序型基因关联研究的一种快速有效的方法。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2023-01-01 DOI: 10.1515/sagmb-2021-0068

Nanxing Li, Lili Chen, Yajing Zhou, Qianran Wei

Many human disease conditions need to be measured by ordinal phenotypes, so analysis of ordinal phenotypes is valuable in genome-wide association studies (GWAS). However, existing association methods for dichotomous or quantitative phenotypes are not appropriate to ordinal phenotypes. Therefore, based on an aggregated Cauchy association test, we propose a fast and efficient association method to test the association between genetic variants and an ordinal phenotype. To enrich association signals of rare variants, we first use the burden method to aggregate rare variants. Then we respectively test the significance of the aggregated rare variants and other common variants. Finally, the combination of transformed variant-level P values is taken as test statistic, that approximately follows Cauchy distribution under the null hypothesis. Extensive simulation studies and analysis of GAW19 show that our proposed method is powerful and computationally fast as a gene-based method. Especially, in the presence of an extremely low proportion of causal variants in a gene, our method has better performance.

许多人类疾病状况需要通过有序表型来测量，因此分析有序表型在全基因组关联研究(GWAS)中是有价值的。然而，现有的二分类或定量表型关联方法不适用于顺序表型。因此，基于聚合柯西关联检验，我们提出了一种快速有效的关联方法来检测遗传变异与有序表型之间的关联。为了丰富稀有变异的关联信号，我们首先采用负担法对稀有变异进行聚合。然后分别对汇总的稀有变异和其他常见变异的显著性进行检验。最后，将变换后的变水平P值组合作为检验统计量，在零假设下近似服从柯西分布。对GAW19的大量仿真研究和分析表明，作为一种基于基因的方法，我们提出的方法功能强大，计算速度快。特别是，在一个基因中存在极低比例的因果变异时，我们的方法具有更好的性能。

{"title":"A fast and efficient approach for gene-based association studies of ordinal phenotypes.","authors":"Nanxing Li, Lili Chen, Yajing Zhou, Qianran Wei","doi":"10.1515/sagmb-2021-0068","DOIUrl":"https://doi.org/10.1515/sagmb-2021-0068","url":null,"abstract":"Many human disease conditions need to be measured by ordinal phenotypes, so analysis of ordinal phenotypes is valuable in genome-wide association studies (GWAS). However, existing association methods for dichotomous or quantitative phenotypes are not appropriate to ordinal phenotypes. Therefore, based on an aggregated Cauchy association test, we propose a fast and efficient association method to test the association between genetic variants and an ordinal phenotype. To enrich association signals of rare variants, we first use the burden method to aggregate rare variants. Then we respectively test the significance of the aggregated rare variants and other common variants. Finally, the combination of transformed variant-level P values is taken as test statistic, that approximately follows Cauchy distribution under the null hypothesis. Extensive simulation studies and analysis of GAW19 show that our proposed method is powerful and computationally fast as a gene-based method. Especially, in the presence of an extremely low proportion of causal variants in a gene, our method has better performance.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"22 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10825661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A novel hybrid CNN and BiGRU-Attention based deep learning model for protein function prediction. 一种基于CNN和BiGRU-Attention的新型混合深度学习模型用于蛋白质功能预测。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2023-01-01 DOI: 10.1515/sagmb-2022-0057

Lavkush Sharma, Akshay Deepak, Ashish Ranjan, Gopalakrishnan Krishnasamy

Proteins are the building blocks of all living things. Protein function must be ascertained if the molecular mechanism of life is to be understood. While CNN is good at capturing short-term relationships, GRU and LSTM can capture long-term dependencies. A hybrid approach that combines the complementary benefits of these deep-learning models motivates our work. Protein Language models, which use attention networks to gather meaningful data and build representations for proteins, have seen tremendous success in recent years processing the protein sequences. In this paper, we propose a hybrid CNN + BiGRU - Attention based model with protein language model embedding that effectively combines the output of CNN with the output of BiGRU-Attention for predicting protein functions. We evaluated the performance of our proposed hybrid model on human and yeast datasets. The proposed hybrid model improves the Fmax value over the state-of-the-art model SDN2GO for the cellular component prediction task by 1.9 %, for the molecular function prediction task by 3.8 % and for the biological process prediction task by 0.6 % for human dataset and for yeast dataset the cellular component prediction task by 2.4 %, for the molecular function prediction task by 5.2 % and for the biological process prediction task by 1.2 %.

蛋白质是所有生物的基本组成部分。如果要了解生命的分子机制，就必须确定蛋白质的功能。CNN擅长捕捉短期关系，而GRU和LSTM可以捕捉长期依赖关系。结合了这些深度学习模型的互补优势的混合方法激励了我们的工作。蛋白质语言模型使用注意力网络来收集有意义的数据并建立蛋白质的表示，近年来在处理蛋白质序列方面取得了巨大的成功。在本文中，我们提出了一种基于蛋白质语言模型嵌入的CNN + BiGRU-Attention混合模型，有效地将CNN的输出与BiGRU-Attention的输出相结合，用于预测蛋白质功能。我们在人类和酵母数据集上评估了我们提出的混合模型的性能。所提出的混合模型比最先进的模型sn2go在细胞成分预测任务中的Fmax值提高了1.9 %，在分子功能预测任务中提高了3.8 %，在生物过程预测任务中提高了0.6 %，在人类数据集和酵母数据集的细胞成分预测任务中提高了2.4 %，在分子功能预测任务中提高了5.2 %，在生物过程预测任务中提高了1.2 %。

{"title":"A novel hybrid CNN and BiGRU-Attention based deep learning model for protein function prediction.","authors":"Lavkush Sharma, Akshay Deepak, Ashish Ranjan, Gopalakrishnan Krishnasamy","doi":"10.1515/sagmb-2022-0057","DOIUrl":"https://doi.org/10.1515/sagmb-2022-0057","url":null,"abstract":"Proteins are the building blocks of all living things. Protein function must be ascertained if the molecular mechanism of life is to be understood. While CNN is good at capturing short-term relationships, GRU and LSTM can capture long-term dependencies. A hybrid approach that combines the complementary benefits of these deep-learning models motivates our work. Protein Language models, which use attention networks to gather meaningful data and build representations for proteins, have seen tremendous success in recent years processing the protein sequences. In this paper, we propose a hybrid CNN + BiGRU - Attention based model with protein language model embedding that effectively combines the output of CNN with the output of BiGRU-Attention for predicting protein functions. We evaluated the performance of our proposed hybrid model on human and yeast datasets. The proposed hybrid model improves the Fmax value over the state-of-the-art model SDN2GO for the cellular component prediction task by 1.9 %, for the molecular function prediction task by 3.8 % and for the biological process prediction task by 0.6 % for human dataset and for yeast dataset the cellular component prediction task by 2.4 %, for the molecular function prediction task by 5.2 % and for the biological process prediction task by 1.2 %.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"22 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10229953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CAT PETR: a graphical user interface for differential analysis of phosphorylation and expression data CAT PETR：用于磷酸化和表达数据差异分析的图形用户界面

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2023-01-01 DOI: 10.1101/2023.01.04.522665

Keegan Flanagan, S. Pelech, Y. Av‐Gay, K. Dao Duc

Abstract Antibody microarray data provides a powerful and high-throughput tool to monitor global changes in cellular response to perturbation or genetic manipulation. However, while collecting such data has become increasingly accessible, a lack of specific computational tools has made their analysis limited. Here we present CAT PETR, a user friendly web application for the differential analysis of expression and phosphorylation data collected via antibody microarrays. Our application addresses the limitations of other GUI based tools by providing various data input options and visualizations. To illustrate its capabilities on real data, we show that CAT PETR both replicates previous findings, and reveals additional insights, using its advanced visualization and statistical options.

摘要抗体微阵列数据提供了一种强大的高通量工具来监测细胞对扰动或基因操纵反应的整体变化。然而，尽管收集此类数据变得越来越容易获得，但由于缺乏特定的计算工具，其分析受到限制。在这里，我们介绍了CAT PETR，一个用户友好的网络应用程序，用于通过抗体微阵列收集的表达和磷酸化数据的差异分析。我们的应用程序通过提供各种数据输入选项和可视化来解决其他基于GUI的工具的局限性。为了说明其在真实数据上的能力，我们展示了CAT PETR既复制了以前的发现，又利用其先进的可视化和统计选项揭示了更多的见解。

引用次数: 0