Pub Date : 2023-11-23eCollection Date: 2023-01-01DOI: 10.1515/sagmb-2023-0009
Jorge M Arevalillo, Raquel Martin-Arevalillo
The ongoing development of high-throughput technologies is allowing the simultaneous monitoring of the expression levels for hundreds or thousands of biological inputs with the proliferation of what has been coined as omic data sources. One relevant issue when analyzing such data sources is concerned with the detection of differential expression across two experimental conditions, clinical status or two classes of a biological outcome. While a great deal of univariate data analysis approaches have been developed to address the issue, strategies for assessing interaction patterns of differential expression are scarce in the literature and have been limited to ad hoc solutions. This paper contributes to the problem by exploiting the facilities of an ensemble learning algorithm like random forests to propose a measure that assesses the differential expression explained by the interaction of the omic variables so subtle biological patterns may be uncovered as a result. The out of bag error rate, which is an estimate of the predictive accuracy of a random forests classifier, is used as a by-product to propose a new measure that assesses interaction patterns of differential expression. Its performance is studied in synthetic scenarios and it is also applied to real studies on SARS-CoV-2 and colon cancer data where it uncovers associations that remain undetected by other methods. Our proposal is aimed at providing a novel approach that may help the experts in biomedical and life sciences to unravel insightful interaction patterns that may decipher the molecular mechanisms underlying biological and clinical outcomes.
{"title":"Patterns of differential expression by association in omic data using a new measure based on ensemble learning.","authors":"Jorge M Arevalillo, Raquel Martin-Arevalillo","doi":"10.1515/sagmb-2023-0009","DOIUrl":"10.1515/sagmb-2023-0009","url":null,"abstract":"<p><p>The ongoing development of high-throughput technologies is allowing the simultaneous monitoring of the expression levels for hundreds or thousands of biological inputs with the proliferation of what has been coined as omic data sources. One relevant issue when analyzing such data sources is concerned with the detection of differential expression across two experimental conditions, clinical status or two classes of a biological outcome. While a great deal of univariate data analysis approaches have been developed to address the issue, strategies for assessing interaction patterns of differential expression are scarce in the literature and have been limited to ad hoc solutions. This paper contributes to the problem by exploiting the facilities of an ensemble learning algorithm like random forests to propose a measure that assesses the differential expression explained by the interaction of the omic variables so subtle biological patterns may be uncovered as a result. The out of bag error rate, which is an estimate of the predictive accuracy of a random forests classifier, is used as a by-product to propose a new measure that assesses interaction patterns of differential expression. Its performance is studied in synthetic scenarios and it is also applied to real studies on SARS-CoV-2 and colon cancer data where it uncovers associations that remain undetected by other methods. Our proposal is aimed at providing a novel approach that may help the experts in biomedical and life sciences to unravel insightful interaction patterns that may decipher the molecular mechanisms underlying biological and clinical outcomes.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138292231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-21eCollection Date: 2023-01-01DOI: 10.1515/sagmb-2022-0054
Tiange Shi, Han Yu, Rachael Hageman Blair
Translation of genomic discovery, such as single-cell sequencing data, to clinical decisions remains a longstanding bottleneck in the field. Meanwhile, computational systems biological models, such as cellular metabolism models and cell signaling pathways, have emerged as powerful approaches to provide efficient predictions in metabolites and gene expression levels, respectively. However, there has been limited research on the integration between these two models. This work develops a methodology for integrating computational models of probabilistic gene regulatory networks with a constraint-based metabolism model. By using probabilistic reasoning with Bayesian Networks, we aim to predict cell-specific changes under different interventions, which are embedded into the constraint-based models of metabolism. Applications to single-cell sequencing data of glioblastoma brain tumors generate predictions about the effects of pharmaceutical interventions on the regulatory network and downstream metabolisms in different cell types from the tumor microenvironment. The model presents possible insights into treatments that could potentially suppress anaerobic metabolism in malignant cells with minimal impact on other cell types' metabolism. The proposed integrated model can guide therapeutic target prioritization, the formulation of combination therapies, and future drug discovery. This model integration framework is also generalizable to other applications, such as different cell types, organisms, and diseases.
{"title":"Integrated regulatory and metabolic networks of the tumor microenvironment for therapeutic target prioritization.","authors":"Tiange Shi, Han Yu, Rachael Hageman Blair","doi":"10.1515/sagmb-2022-0054","DOIUrl":"10.1515/sagmb-2022-0054","url":null,"abstract":"<p><p>Translation of genomic discovery, such as single-cell sequencing data, to clinical decisions remains a longstanding bottleneck in the field. Meanwhile, computational systems biological models, such as cellular metabolism models and cell signaling pathways, have emerged as powerful approaches to provide efficient predictions in metabolites and gene expression levels, respectively. However, there has been limited research on the integration between these two models. This work develops a methodology for integrating computational models of probabilistic gene regulatory networks with a constraint-based metabolism model. By using probabilistic reasoning with Bayesian Networks, we aim to predict cell-specific changes under different interventions, which are embedded into the constraint-based models of metabolism. Applications to single-cell sequencing data of glioblastoma brain tumors generate predictions about the effects of pharmaceutical interventions on the regulatory network and downstream metabolisms in different cell types from the tumor microenvironment. The model presents possible insights into treatments that could potentially suppress anaerobic metabolism in malignant cells with minimal impact on other cell types' metabolism. The proposed integrated model can guide therapeutic target prioritization, the formulation of combination therapies, and future drug discovery. This model integration framework is also generalizable to other applications, such as different cell types, organisms, and diseases.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138292230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-09eCollection Date: 2023-01-01DOI: 10.1515/sagmb-2022-0047
Yonghui Ni, Jianghua He, Prabhakar Chalise
Integration of multiple 'omics datasets for differentiating cancer subtypes is a powerful technic that leverages the consistent and complementary information across multi-omics data. Matrix factorization is a common technique used in integrative clustering for identifying latent subtype structure across multi-omics data. High dimensionality of the omics data and long computation time have been common challenges of clustering methods. In order to address the challenges, we propose randomized singular value decomposition (RSVD) for integrative clustering using Non-negative Matrix Factorization: intNMF-rsvd. The method utilizes RSVD to reduce the dimensionality by projecting the data into eigen vector space with user specified lower rank. Then, clustering analysis is carried out by estimating common basis matrix across the projected multi-omics datasets. The performance of the proposed method was assessed using the simulated datasets and compared with six state-of-the-art integrative clustering methods using real-life datasets from The Cancer Genome Atlas Study. intNMF-rsvd was found working efficiently and competitively as compared to standard intNMF and other multi-omics clustering methods. Most importantly, intNMF-rsvd can handle large number of features and significantly reduce the computation time. The identified subtypes can be utilized for further clinical association studies to understand the etiology of the disease.
{"title":"Randomized singular value decomposition for integrative subtype analysis of 'omics data' using non-negative matrix factorization.","authors":"Yonghui Ni, Jianghua He, Prabhakar Chalise","doi":"10.1515/sagmb-2022-0047","DOIUrl":"10.1515/sagmb-2022-0047","url":null,"abstract":"<p><p>Integration of multiple 'omics datasets for differentiating cancer subtypes is a powerful technic that leverages the consistent and complementary information across multi-omics data. Matrix factorization is a common technique used in integrative clustering for identifying latent subtype structure across multi-omics data. High dimensionality of the omics data and long computation time have been common challenges of clustering methods. In order to address the challenges, we propose randomized singular value decomposition (RSVD) for integrative clustering using Non-negative Matrix Factorization: <i>intNMF-rsvd</i>. The method utilizes RSVD to reduce the dimensionality by projecting the data into eigen vector space with user specified lower rank. Then, clustering analysis is carried out by estimating common basis matrix across the projected multi-omics datasets. The performance of the proposed method was assessed using the simulated datasets and compared with six state-of-the-art integrative clustering methods using real-life datasets from The Cancer Genome Atlas Study. <i>intNMF-rsvd</i> was found working efficiently and competitively as compared to standard intNMF and other multi-omics clustering methods. Most importantly, <i>intNMF-rsvd</i> can handle large number of features and significantly reduce the computation time. The identified subtypes can be utilized for further clinical association studies to understand the etiology of the disease.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71488028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-21eCollection Date: 2023-01-01DOI: 10.1515/sagmb-2023-0017
Keegan Flanagan, Steven Pelech, Yossef Av-Gay, Khanh Dao Duc
Antibody microarray data provides a powerful and high-throughput tool to monitor global changes in cellular response to perturbation or genetic manipulation. However, while collecting such data has become increasingly accessible, a lack of specific computational tools has made their analysis limited. Here we present CAT PETR, a user friendly web application for the differential analysis of expression and phosphorylation data collected via antibody microarrays. Our application addresses the limitations of other GUI based tools by providing various data input options and visualizations. To illustrate its capabilities on real data, we show that CAT PETR both replicates previous findings, and reveals additional insights, using its advanced visualization and statistical options.
抗体微阵列数据提供了一种强大的高通量工具,可用于监测细胞对扰动或遗传操作反应的整体变化。然而,虽然收集此类数据的途径越来越多,但由于缺乏特定的计算工具,对它们的分析受到了限制。我们在此介绍 CAT PETR,这是一款用户友好型网络应用程序,用于对通过抗体微阵列收集的表达和磷酸化数据进行差异分析。我们的应用程序通过提供各种数据输入选项和可视化效果,解决了其他基于图形用户界面的工具的局限性。为了说明其在真实数据上的能力,我们展示了 CAT PETR 利用其先进的可视化和统计选项,既复制了以前的发现,又揭示了更多的见解。
{"title":"CAT PETR: a graphical user interface for differential analysis of phosphorylation and expression data.","authors":"Keegan Flanagan, Steven Pelech, Yossef Av-Gay, Khanh Dao Duc","doi":"10.1515/sagmb-2023-0017","DOIUrl":"10.1515/sagmb-2023-0017","url":null,"abstract":"<p><p>Antibody microarray data provides a powerful and high-throughput tool to monitor global changes in cellular response to perturbation or genetic manipulation. However, while collecting such data has become increasingly accessible, a lack of specific computational tools has made their analysis limited. Here we present CAT PETR, a user friendly web application for the differential analysis of expression and phosphorylation data collected via antibody microarrays. Our application addresses the limitations of other GUI based tools by providing various data input options and visualizations. To illustrate its capabilities on real data, we show that CAT PETR both replicates previous findings, and reveals additional insights, using its advanced visualization and statistical options.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10238016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-25eCollection Date: 2023-01-01DOI: 10.1515/sagmb-2022-0031
Bo Zhang, Jianghua He, Jinxiang Hu, Prabhakar Chalise, Devin C Koestler
Component-wise Sparse Mixture Regression (CSMR) is a recently proposed regression-based clustering method that shows promise in detecting heterogeneous relationships between molecular markers and a continuous phenotype of interest. However, CSMR can yield inconsistent results when applied to high-dimensional molecular data, which we hypothesize is in part due to inherent limitations associated with the feature selection method used in the CSMR algorithm. To assess this hypothesis, we explored whether substituting different regularized regression methods (i.e. Lasso, Elastic Net, Smoothly Clipped Absolute Deviation (SCAD), Minmax Convex Penalty (MCP), and Adaptive-Lasso) within the CSMR framework can improve the clustering accuracy and internal consistency (IC) of CSMR in high-dimensional settings. We calculated the true positive rate (TPR), true negative rate (TNR), IC and clustering accuracy of our proposed modifications, benchmarked against the existing CSMR algorithm, using an extensive set of simulation studies and real biological datasets. Our results demonstrated that substituting Adaptive-Lasso within the existing feature selection method used in CSMR led to significantly improved IC and clustering accuracy, with strong performance even in high-dimensional scenarios. In conclusion, our modifications of the CSMR method resulted in improved clustering performance and may thus serve as viable alternatives for the regression-based clustering of high-dimensional datasets.
{"title":"Improving the accuracy and internal consistency of regression-based clustering of high-dimensional datasets.","authors":"Bo Zhang, Jianghua He, Jinxiang Hu, Prabhakar Chalise, Devin C Koestler","doi":"10.1515/sagmb-2022-0031","DOIUrl":"10.1515/sagmb-2022-0031","url":null,"abstract":"<p><p>Component-wise Sparse Mixture Regression (CSMR) is a recently proposed regression-based clustering method that shows promise in detecting heterogeneous relationships between molecular markers and a continuous phenotype of interest. However, CSMR can yield inconsistent results when applied to high-dimensional molecular data, which we hypothesize is in part due to inherent limitations associated with the feature selection method used in the CSMR algorithm. To assess this hypothesis, we explored whether substituting different regularized regression methods (i.e. Lasso, Elastic Net, Smoothly Clipped Absolute Deviation (SCAD), Minmax Convex Penalty (MCP), and Adaptive-Lasso) within the CSMR framework can improve the clustering accuracy and internal consistency (IC) of CSMR in high-dimensional settings. We calculated the true positive rate (TPR), true negative rate (TNR), IC and clustering accuracy of our proposed modifications, benchmarked against the existing CSMR algorithm, using an extensive set of simulation studies and real biological datasets. Our results demonstrated that substituting Adaptive-Lasso within the existing feature selection method used in CSMR led to significantly improved IC and clustering accuracy, with strong performance even in high-dimensional scenarios. In conclusion, our modifications of the CSMR method resulted in improved clustering performance and may thus serve as viable alternatives for the regression-based clustering of high-dimensional datasets.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10891458/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10227703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many human disease conditions need to be measured by ordinal phenotypes, so analysis of ordinal phenotypes is valuable in genome-wide association studies (GWAS). However, existing association methods for dichotomous or quantitative phenotypes are not appropriate to ordinal phenotypes. Therefore, based on an aggregated Cauchy association test, we propose a fast and efficient association method to test the association between genetic variants and an ordinal phenotype. To enrich association signals of rare variants, we first use the burden method to aggregate rare variants. Then we respectively test the significance of the aggregated rare variants and other common variants. Finally, the combination of transformed variant-level P values is taken as test statistic, that approximately follows Cauchy distribution under the null hypothesis. Extensive simulation studies and analysis of GAW19 show that our proposed method is powerful and computationally fast as a gene-based method. Especially, in the presence of an extremely low proportion of causal variants in a gene, our method has better performance.
{"title":"A fast and efficient approach for gene-based association studies of ordinal phenotypes.","authors":"Nanxing Li, Lili Chen, Yajing Zhou, Qianran Wei","doi":"10.1515/sagmb-2021-0068","DOIUrl":"https://doi.org/10.1515/sagmb-2021-0068","url":null,"abstract":"<p><p>Many human disease conditions need to be measured by ordinal phenotypes, so analysis of ordinal phenotypes is valuable in genome-wide association studies (GWAS). However, existing association methods for dichotomous or quantitative phenotypes are not appropriate to ordinal phenotypes. Therefore, based on an aggregated Cauchy association test, we propose a fast and efficient association method to test the association between genetic variants and an ordinal phenotype. To enrich association signals of rare variants, we first use the burden method to aggregate rare variants. Then we respectively test the significance of the aggregated rare variants and other common variants. Finally, the combination of transformed variant-level <i>P</i> values is taken as test statistic, that approximately follows Cauchy distribution under the null hypothesis. Extensive simulation studies and analysis of GAW19 show that our proposed method is powerful and computationally fast as a gene-based method. Especially, in the presence of an extremely low proportion of causal variants in a gene, our method has better performance.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10825661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Proteins are the building blocks of all living things. Protein function must be ascertained if the molecular mechanism of life is to be understood. While CNN is good at capturing short-term relationships, GRU and LSTM can capture long-term dependencies. A hybrid approach that combines the complementary benefits of these deep-learning models motivates our work. Protein Language models, which use attention networks to gather meaningful data and build representations for proteins, have seen tremendous success in recent years processing the protein sequences. In this paper, we propose a hybrid CNN + BiGRU - Attention based model with protein language model embedding that effectively combines the output of CNN with the output of BiGRU-Attention for predicting protein functions. We evaluated the performance of our proposed hybrid model on human and yeast datasets. The proposed hybrid model improves the Fmax value over the state-of-the-art model SDN2GO for the cellular component prediction task by 1.9 %, for the molecular function prediction task by 3.8 % and for the biological process prediction task by 0.6 % for human dataset and for yeast dataset the cellular component prediction task by 2.4 %, for the molecular function prediction task by 5.2 % and for the biological process prediction task by 1.2 %.
{"title":"A novel hybrid CNN and BiGRU-Attention based deep learning model for protein function prediction.","authors":"Lavkush Sharma, Akshay Deepak, Ashish Ranjan, Gopalakrishnan Krishnasamy","doi":"10.1515/sagmb-2022-0057","DOIUrl":"https://doi.org/10.1515/sagmb-2022-0057","url":null,"abstract":"<p><p>Proteins are the building blocks of all living things. Protein function must be ascertained if the molecular mechanism of life is to be understood. While CNN is good at capturing short-term relationships, GRU and LSTM can capture long-term dependencies. A hybrid approach that combines the complementary benefits of these deep-learning models motivates our work. Protein Language models, which use attention networks to gather meaningful data and build representations for proteins, have seen tremendous success in recent years processing the protein sequences. In this paper, we propose a hybrid CNN + BiGRU - Attention based model with protein language model embedding that effectively combines the output of CNN with the output of BiGRU-Attention for predicting protein functions. We evaluated the performance of our proposed hybrid model on human and yeast datasets. The proposed hybrid model improves the Fmax value over the state-of-the-art model SDN2GO for the cellular component prediction task by 1.9 %, for the molecular function prediction task by 3.8 % and for the biological process prediction task by 0.6 % for human dataset and for yeast dataset the cellular component prediction task by 2.4 %, for the molecular function prediction task by 5.2 % and for the biological process prediction task by 1.2 %.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10229953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01DOI: 10.1101/2023.01.04.522665
Keegan Flanagan, S. Pelech, Y. Av‐Gay, K. Dao Duc
Abstract Antibody microarray data provides a powerful and high-throughput tool to monitor global changes in cellular response to perturbation or genetic manipulation. However, while collecting such data has become increasingly accessible, a lack of specific computational tools has made their analysis limited. Here we present CAT PETR, a user friendly web application for the differential analysis of expression and phosphorylation data collected via antibody microarrays. Our application addresses the limitations of other GUI based tools by providing various data input options and visualizations. To illustrate its capabilities on real data, we show that CAT PETR both replicates previous findings, and reveals additional insights, using its advanced visualization and statistical options.
{"title":"CAT PETR: a graphical user interface for differential analysis of phosphorylation and expression data","authors":"Keegan Flanagan, S. Pelech, Y. Av‐Gay, K. Dao Duc","doi":"10.1101/2023.01.04.522665","DOIUrl":"https://doi.org/10.1101/2023.01.04.522665","url":null,"abstract":"Abstract Antibody microarray data provides a powerful and high-throughput tool to monitor global changes in cellular response to perturbation or genetic manipulation. However, while collecting such data has become increasingly accessible, a lack of specific computational tools has made their analysis limited. Here we present CAT PETR, a user friendly web application for the differential analysis of expression and phosphorylation data collected via antibody microarrays. Our application addresses the limitations of other GUI based tools by providing various data input options and visualizations. To illustrate its capabilities on real data, we show that CAT PETR both replicates previous findings, and reveals additional insights, using its advanced visualization and statistical options.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49512934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Permutation tests are widely used for statistical hypothesis testing when the sampling distribution of the test statistic under the null hypothesis is analytically intractable or unreliable due to finite sample sizes. One critical challenge in the application of permutation tests in genomic studies is that an enormous number of permutations are often needed to obtain reliable estimates of very small p-values, leading to intensive computational effort. To address this issue, we develop algorithms for the accurate and efficient estimation of small p-values in permutation tests for paired and independent two-group genomic data, and our approaches leverage a novel framework for parameterizing the permutation sample spaces of those two types of data respectively using the Bernoulli and conditional Bernoulli distributions, combined with the cross-entropy method. The performance of our proposed algorithms is demonstrated through the application to two simulated datasets and two real-world gene expression datasets generated by microarray and RNA-Seq technologies and comparisons to existing methods such as crude permutations and SAMC, and the results show that our approaches can achieve orders of magnitude of computational efficiency gains in estimating small p-values. Our approaches offer promising solutions for the improvement of computational efficiencies of existing permutation test procedures and the development of new testing methods using permutations in genomic data analysis.
{"title":"Accurate and fast small <i>p</i>-value estimation for permutation tests in high-throughput genomic data analysis with the cross-entropy method.","authors":"Yang Shi, Weiping Shi, Mengqiao Wang, Ji-Hyun Lee, Huining Kang, Hui Jiang","doi":"10.1515/sagmb-2021-0067","DOIUrl":"https://doi.org/10.1515/sagmb-2021-0067","url":null,"abstract":"<p><p>Permutation tests are widely used for statistical hypothesis testing when the sampling distribution of the test statistic under the null hypothesis is analytically intractable or unreliable due to finite sample sizes. One critical challenge in the application of permutation tests in genomic studies is that an enormous number of permutations are often needed to obtain reliable estimates of very small <i>p</i>-values, leading to intensive computational effort. To address this issue, we develop algorithms for the accurate and efficient estimation of small <i>p</i>-values in permutation tests for paired and independent two-group genomic data, and our approaches leverage a novel framework for parameterizing the permutation sample spaces of those two types of data respectively using the Bernoulli and conditional Bernoulli distributions, combined with the cross-entropy method. The performance of our proposed algorithms is demonstrated through the application to two simulated datasets and two real-world gene expression datasets generated by microarray and RNA-Seq technologies and comparisons to existing methods such as crude permutations and SAMC, and the results show that our approaches can achieve orders of magnitude of computational efficiency gains in estimating small <i>p</i>-values. Our approaches offer promising solutions for the improvement of computational efficiencies of existing permutation test procedures and the development of new testing methods using permutations in genomic data analysis.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10229904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It is often of research interest to identify genes that satisfy a particular expression pattern across different conditions such as tissues, genotypes, etc. One common practice is to perform differential expression analysis for each condition separately and then take the intersection of differentially expressed (DE) genes or non-DE genes under each condition to obtain genes that satisfy a particular pattern. Such a method can lead to many false positives, especially when the desired gene expression pattern involves equivalent expression under one condition. In this paper, we apply a Bayesian partition model to identify genes of all desired patterns while simultaneously controlling their false discovery rates (FDRs). Our simulation studies show that the common practice fails to control group specific FDRs for patterns involving equivalent expression while the proposed Bayesian method simultaneously controls group specific FDRs at all settings studied. In addition, the proposed method is more powerful when the FDR of the common practice is under control for identifying patterns only involving DE genes. Our simulation studies also show that it is an inherently more challenging problem to identify patterns involving equivalent expression than patterns only involving differential expression. Therefore, larger sample sizes are required to obtain the same target power to identify the former types of patterns than the latter types of patterns.
{"title":"A Bayesian model to identify multiple expression patterns with simultaneous FDR control for a multi-factor RNA-seq experiment.","authors":"Yuanyuan Bian, Chong He, Jing Qiu","doi":"10.1515/sagmb-2022-0025","DOIUrl":"https://doi.org/10.1515/sagmb-2022-0025","url":null,"abstract":"<p><p>It is often of research interest to identify genes that satisfy a particular expression pattern across different conditions such as tissues, genotypes, etc. One common practice is to perform differential expression analysis for each condition separately and then take the intersection of differentially expressed (DE) genes or non-DE genes under each condition to obtain genes that satisfy a particular pattern. Such a method can lead to many false positives, especially when the desired gene expression pattern involves equivalent expression under one condition. In this paper, we apply a Bayesian partition model to identify genes of all desired patterns while simultaneously controlling their false discovery rates (FDRs). Our simulation studies show that the common practice fails to control group specific FDRs for patterns involving equivalent expression while the proposed Bayesian method simultaneously controls group specific FDRs at all settings studied. In addition, the proposed method is more powerful when the FDR of the common practice is under control for identifying patterns only involving DE genes. Our simulation studies also show that it is an inherently more challenging problem to identify patterns involving equivalent expression than patterns only involving differential expression. Therefore, larger sample sizes are required to obtain the same target power to identify the former types of patterns than the latter types of patterns.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9391532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}