Pub Date : 2022-10-10eCollection Date: 2022-01-01DOI: 10.1515/sagmb-2022-0003
Qing Xia, Jeffrey A Thompson, Devin C Koestler
Batch effect Reduction of mIcroarray data with Dependent samples usinGEmpirical Bayes (BRIDGE) is a recently developed statistical method to address the issue of batch effect correction in batch-confounded microarray studies with dependent samples. The key component of the BRIDGE methodology is the use of samples run as technical replicates in two or more batches, "bridging samples", to inform batch effect correction/attenuation. While previously published results indicate a relationship between the number of bridging samples, M, and the statistical power of downstream statistical testing on the batch-corrected data, there is of yet no formal statistical framework or user-friendly software, for estimating M to achieve a specific statistical power for hypothesis tests conducted on the batch-corrected data. To fill this gap, we developed pwrBRIDGE, a simulation-based approach to estimate the bridging sample size, M, in batch-confounded longitudinal microarray studies. To illustrate the use of pwrBRIDGE, we consider a hypothetical, longitudinal batch-confounded study whose goal is to identify Alzheimer's disease (AD) progression-associated genes from amnestic mild cognitive impairment (aMCI) to AD in human blood after a 5-year follow-up. pwrBRIDGE helps researchers design and plan batch-confounded microarray studies with dependent samples to avoid over- or under-powered studies.
{"title":"<i>pwrBRIDGE</i>: a user-friendly web application for power and sample size estimation in batch-confounded microarray studies with dependent samples.","authors":"Qing Xia, Jeffrey A Thompson, Devin C Koestler","doi":"10.1515/sagmb-2022-0003","DOIUrl":"https://doi.org/10.1515/sagmb-2022-0003","url":null,"abstract":"<p><p><u>B</u>atch effect <u>R</u>eduction of m<u>I</u>croarray data with <u>D</u>ependent samples usin<u>G</u> <u>E</u>mpirical Bayes (<i>BRIDGE</i>) is a recently developed statistical method to address the issue of batch effect correction in batch-confounded microarray studies with dependent samples. The key component of the <i>BRIDGE</i> methodology is the use of samples run as technical replicates in two or more batches, \"bridging samples\", to inform batch effect correction/attenuation. While previously published results indicate a relationship between the number of bridging samples, <i>M</i>, and the statistical power of downstream statistical testing on the batch-corrected data, there is of yet no formal statistical framework or user-friendly software, for estimating <i>M</i> to achieve a specific statistical power for hypothesis tests conducted on the batch-corrected data. To fill this gap, we developed <i>pwrBRIDGE</i>, a simulation-based approach to estimate the bridging sample size, <i>M</i>, in batch-confounded longitudinal microarray studies. To illustrate the use of <i>pwrBRIDGE</i>, we consider a hypothetical, longitudinal batch-confounded study whose goal is to identify Alzheimer's disease (AD) progression-associated genes from amnestic mild cognitive impairment (aMCI) to AD in human blood after a 5-year follow-up. <i>pwrBRIDGE</i> helps researchers design and plan batch-confounded microarray studies with dependent samples to avoid over- or under-powered studies.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9550194/pdf/sagmb-21-1-sagmb-2022-0003.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33519105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-02eCollection Date: 2022-01-01DOI: 10.1515/sagmb-2022-0015
Guoyu Wu, Yuchao Li
Correlation analysis is widely used in biological studies to infer molecular relationships within biological networks. Recently, single-cell analysis has drawn tremendous interests, for its ability to obtain high-resolution molecular phenotypes. It turns out that there is little overlap of co-expressed genes identified in single-cell level investigations with that of population level investigations. However, the nature of the relationship of correlations between single-cell and population levels remains unclear. In this manuscript, we aimed to unveil the origin of the differences between the correlation coefficients at the single-cell level and that at the population level, and bridge the gap between them. Through developing formulations to link correlations at the single-cell and the population level, we illustrated that aggregated correlations could be stronger, weaker or equal to the corresponding individual correlations, depending on the variations and the correlations within the population. When the correlation within the population is weaker than the individual correlation, the aggregated correlation is stronger than the corresponding individual correlation. Besides, our data indicated that aggregated correlation is more likely to be stronger than the corresponding individual correlation, and it was rare to find gene-pairs exclusively strongly correlated at the single-cell level. Through a bottom-up approach to model interactions between molecules in a signaling cascade or a multi-regulator-controlled gene expression, we surprisingly found that the existence of interaction between two components could not be excluded simply based on their low correlation coefficients, suggesting a reconsideration of connectivity within biological networks which was derived solely from correlation analysis. We also investigated the impact of technical random measurement errors on the correlation coefficients for the single-cell level and the population level. The results indicate that the aggregated correlation is relatively robust and less affected. Because of the heterogeneity among single cells, correlation coefficients calculated based on data of the single-cell level might be different from that of the population level. Depending on the specific question we are asking, proper sampling and normalization procedure should be done before we draw any conclusions.
{"title":"Distinct characteristics of correlation analysis at the single-cell and the population level.","authors":"Guoyu Wu, Yuchao Li","doi":"10.1515/sagmb-2022-0015","DOIUrl":"10.1515/sagmb-2022-0015","url":null,"abstract":"<p><p>Correlation analysis is widely used in biological studies to infer molecular relationships within biological networks. Recently, single-cell analysis has drawn tremendous interests, for its ability to obtain high-resolution molecular phenotypes. It turns out that there is little overlap of co-expressed genes identified in single-cell level investigations with that of population level investigations. However, the nature of the relationship of correlations between single-cell and population levels remains unclear. In this manuscript, we aimed to unveil the origin of the differences between the correlation coefficients at the single-cell level and that at the population level, and bridge the gap between them. Through developing formulations to link correlations at the single-cell and the population level, we illustrated that aggregated correlations could be stronger, weaker or equal to the corresponding individual correlations, depending on the variations and the correlations within the population. When the correlation within the population is weaker than the individual correlation, the aggregated correlation is stronger than the corresponding individual correlation. Besides, our data indicated that aggregated correlation is more likely to be stronger than the corresponding individual correlation, and it was rare to find gene-pairs exclusively strongly correlated at the single-cell level. Through a bottom-up approach to model interactions between molecules in a signaling cascade or a multi-regulator-controlled gene expression, we surprisingly found that the existence of interaction between two components could not be excluded simply based on their low correlation coefficients, suggesting a reconsideration of connectivity within biological networks which was derived solely from correlation analysis. We also investigated the impact of technical random measurement errors on the correlation coefficients for the single-cell level and the population level. The results indicate that the aggregated correlation is relatively robust and less affected. Because of the heterogeneity among single cells, correlation coefficients calculated based on data of the single-cell level might be different from that of the population level. Depending on the specific question we are asking, proper sampling and normalization procedure should be done before we draw any conclusions.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2022-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40578441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gene selection is one of the key steps for gene expression data analysis. An SVM-based ensemble feature selection method is proposed in this paper. Firstly, the method builds many subsets by using Monte Carlo sampling. Secondly, ranking all the features on each of the subsets and integrating them to obtain a final ranking list. Finally, the optimum feature set is determined by a backward feature elimination strategy. This method is applied to the analysis of 4 public datasets: the Leukemia, Prostate, Colorectal, and SMK_CAN, resulting 7, 10, 13, and 32 features. The AUC obtained from independent test sets are 0.9867, 0.9796, 0.9571, and 0.9575, respectively. These results indicate that the features selected by the proposed method can improve sample classification accuracy, and thus be effective for gene selection from gene expression data.
{"title":"Use of SVM-based ensemble feature selection method for gene expression data analysis.","authors":"Shizhi Zhang, Mingjin Zhang","doi":"10.1515/sagmb-2022-0002","DOIUrl":"https://doi.org/10.1515/sagmb-2022-0002","url":null,"abstract":"<p><p>Gene selection is one of the key steps for gene expression data analysis. An SVM-based ensemble feature selection method is proposed in this paper. Firstly, the method builds many subsets by using Monte Carlo sampling. Secondly, ranking all the features on each of the subsets and integrating them to obtain a final ranking list. Finally, the optimum feature set is determined by a backward feature elimination strategy. This method is applied to the analysis of 4 public datasets: the Leukemia, Prostate, Colorectal, and SMK_CAN, resulting 7, 10, 13, and 32 features. The AUC obtained from independent test sets are 0.9867, 0.9796, 0.9571, and 0.9575, respectively. These results indicate that the features selected by the proposed method can improve sample classification accuracy, and thus be effective for gene selection from gene expression data.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2022-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40515868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the advancement of genome sequencing techniques, a great stride has been made in exome sequencing such that the association study between disease and genetic variants has become feasible. Some powerful and well-known association tests have been proposed to test the association between a group of genes and the disease of interest. However, some challenges still remain, in particular, many factors can affect the performance of testing power, e.g., the sample size, the number of causal and non-causal variants, and direction of the effect of causal variants. Recently, a powerful test, called TREM , is derived based on a random effects model. TREM has the advantages of being less sensitive to the inclusion of non-causal rare variants or low effect common variants or the presence of missing genotypes. However, the testing power of TREM can be low when a portion of causal variants has effects in opposite directions. To improve the drawback of TREM , we propose a novel test, called TROB , which keeps the advantages of TREM and is more robust than TREM in terms of having adequate power in the case of variants with opposite directions of effect. Simulation results show that TROB has a stable type I error rate and outperforms TREM when the proportion of risk variants decreases to a certain level and its advantage over TREM increases as the proportion decreases. Furthermore, TROB outperforms several other competing tests in most scenarios. The proposed methodology is illustrated using the Shanghai Breast Cancer Study.
{"title":"A robust association test with multiple genetic variants and covariates.","authors":"Jen-Yu Lee, Pao-Sheng Shen, Kuang-Fu Cheng","doi":"10.1515/sagmb-2021-0029","DOIUrl":"https://doi.org/10.1515/sagmb-2021-0029","url":null,"abstract":"<p><p>Due to the advancement of genome sequencing techniques, a great stride has been made in exome sequencing such that the association study between disease and genetic variants has become feasible. Some powerful and well-known association tests have been proposed to test the association between a group of genes and the disease of interest. However, some challenges still remain, in particular, many factors can affect the performance of testing power, e.g., the sample size, the number of causal and non-causal variants, and direction of the effect of causal variants. Recently, a powerful test, called <i>T</i><sub><i>REM</i></sub> , is derived based on a random effects model. <i>T</i><sub><i>REM</i></sub> has the advantages of being less sensitive to the inclusion of non-causal rare variants or low effect common variants or the presence of missing genotypes. However, the testing power of <i>T</i><sub><i>REM</i></sub> can be low when a portion of causal variants has effects in opposite directions. To improve the drawback of <i>T</i><sub><i>REM</i></sub> , we propose a novel test, called <i>T</i><sub><i>ROB</i></sub> , which keeps the advantages of <i>T</i><sub><i>REM</i></sub> and is more robust than <i>T</i><sub><i>REM</i></sub> in terms of having adequate power in the case of variants with opposite directions of effect. Simulation results show that <i>T</i><sub><i>ROB</i></sub> has a stable type I error rate and outperforms <i>T</i><sub><i>REM</i></sub> when the proportion of risk variants decreases to a certain level and its advantage over <i>T</i><sub><i>REM</i></sub> increases as the proportion decreases. Furthermore, <i>T</i><sub><i>ROB</i></sub> outperforms several other competing tests in most scenarios. The proposed methodology is illustrated using the Shanghai Breast Cancer Study.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40515867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
RNA-protein interactions have long being recognised as crucial regulators of gene expression. Recently, the development of scalable experimental techniques to measure these interactions has revolutionised the field, leading to the production of large-scale datasets which offer both opportunities and challenges for machine learning techniques. In this brief note, we will discuss some of the major stumbling blocks towards the use of machine learning in computational RNA biology, focusing specifically on the problem of predicting RNA-protein interactions from next-generation sequencing data.
{"title":"Challenges for machine learning in RNA-protein interaction prediction.","authors":"Viplove Arora, Guido Sanguinetti","doi":"10.1515/sagmb-2021-0087","DOIUrl":"https://doi.org/10.1515/sagmb-2021-0087","url":null,"abstract":"<p><p>RNA-protein interactions have long being recognised as crucial regulators of gene expression. Recently, the development of scalable experimental techniques to measure these interactions has revolutionised the field, leading to the production of large-scale datasets which offer both opportunities and challenges for machine learning techniques. In this brief note, we will discuss some of the major stumbling blocks towards the use of machine learning in computational RNA biology, focusing specifically on the problem of predicting RNA-protein interactions from next-generation sequencing data.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2022-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39963066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract We propose two new statistics, V ̂ $hat{V}$ and S ̂ $hat{S}$ , to disentangle the population history of related populations from SNP frequency data. If the populations are related by a tree, we show by theoretical means as well as by simulation that the new statistics are able to identify the root of a tree correctly, in contrast to standard statistics, such as the observed matrix of F 2-statistics (distances between pairs of populations). The statistic V ̂ $hat{V}$ is obtained by averaging over all SNPs (similar to standard statistics). Its expectation is the true covariance matrix of the observed population SNP frequencies, offset by a matrix with identical entries. In contrast, the statistic S ̂ $hat{S}$ is put in a Bayesian context and is obtained by averaging over pairs of SNPs, such that each SNP is only used once. It thus makes use of the joint distribution of pairs of SNPs. In addition, we provide a number of novel mathematical results about old and new statistics, and their mutual relationship.
{"title":"Estimation of the covariance structure from SNP allele frequencies","authors":"J. van Waaij, Zilong Li, C. Wiuf","doi":"10.1515/sagmb-2022-0005","DOIUrl":"https://doi.org/10.1515/sagmb-2022-0005","url":null,"abstract":"Abstract We propose two new statistics, V ̂ $hat{V}$ and S ̂ $hat{S}$ , to disentangle the population history of related populations from SNP frequency data. If the populations are related by a tree, we show by theoretical means as well as by simulation that the new statistics are able to identify the root of a tree correctly, in contrast to standard statistics, such as the observed matrix of F 2-statistics (distances between pairs of populations). The statistic V ̂ $hat{V}$ is obtained by averaging over all SNPs (similar to standard statistics). Its expectation is the true covariance matrix of the observed population SNP frequencies, offset by a matrix with identical entries. In contrast, the statistic S ̂ $hat{S}$ is put in a Bayesian context and is obtained by averaging over pairs of SNPs, such that each SNP is only used once. It thus makes use of the joint distribution of pairs of SNPs. In addition, we provide a number of novel mathematical results about old and new statistics, and their mutual relationship.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43452592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Liyanage, J. Estepp, K. Srivastava, Yun Li, Motomi Mori, G. Kang
Abstract Due to many advantages such as higher statistical power of detecting the association of genetic variants in human disorders and cost saving, extreme phenotype sequencing (EPS) is a rapidly emerging study design in epidemiological and clinical studies investigating how genetic variations associate with complex phenotypes. However, the investigation of the mediation effect of genetic variants on phenotypes is strictly restrictive under the EPS design because existing methods cannot well accommodate the non-random extreme tails sampling process incurred by the EPS design. In this paper, we propose a likelihood approach for testing the mediation effect of genetic variants through continuous and binary mediators on a continuous phenotype under the EPS design (GMEPS). Besides implementing in EPS design, it can also be utilized as a general mediation analysis procedure. Extensive simulations and two real data applications of a genome-wide association study of benign ethnic neutropenia under EPS design and a candidate-gene study of neurocognitive performance in patients with sickle cell disease under random sampling design demonstrate the superiority of GMEPS under the EPS design over widely used mediation analysis procedures, while demonstrating compatible capabilities under the general random sampling framework.
{"title":"GMEPS: a fast and efficient likelihood approach for genome-wide mediation analysis under extreme phenotype sequencing","authors":"J. Liyanage, J. Estepp, K. Srivastava, Yun Li, Motomi Mori, G. Kang","doi":"10.1515/sagmb-2021-0071","DOIUrl":"https://doi.org/10.1515/sagmb-2021-0071","url":null,"abstract":"Abstract Due to many advantages such as higher statistical power of detecting the association of genetic variants in human disorders and cost saving, extreme phenotype sequencing (EPS) is a rapidly emerging study design in epidemiological and clinical studies investigating how genetic variations associate with complex phenotypes. However, the investigation of the mediation effect of genetic variants on phenotypes is strictly restrictive under the EPS design because existing methods cannot well accommodate the non-random extreme tails sampling process incurred by the EPS design. In this paper, we propose a likelihood approach for testing the mediation effect of genetic variants through continuous and binary mediators on a continuous phenotype under the EPS design (GMEPS). Besides implementing in EPS design, it can also be utilized as a general mediation analysis procedure. Extensive simulations and two real data applications of a genome-wide association study of benign ethnic neutropenia under EPS design and a candidate-gene study of neurocognitive performance in patients with sickle cell disease under random sampling design demonstrate the superiority of GMEPS under the EPS design over widely used mediation analysis procedures, while demonstrating compatible capabilities under the general random sampling framework.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46601961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The inference of genetic regulatory networks (GRNs) reveals how genes interact with each other. A few genes can regulate many genes as targets to control cell functions. We present new methods based on the order-1 vector autoregression (VAR1) for inferring GRNs from gene expression time series. The methods use the automatic relevance determination (ARD) to incorporate the regulatory hub structure into the estimation of VAR1 in a Bayesian framework. Several sparse approximation schemes are applied to the estimated regression weights or VAR1 model to generate the sparse weighted adjacency matrices representing the inferred GRNs. We apply the proposed and several widespread reference methods to infer GRNs with up to 100 genes using simulated, DREAM4 in silico and experimental E. coli gene expression time series. We show that the proposed methods are efficient on simulated hub GRNs and scale-free GRNs using short time series simulated by VAR1s and outperform reference methods on small-scale DREAM4 in silico GRNs and E. coli GRNs. They can utilize the known major regulatory hubs to improve the performance on larger DREAM4 in silico GRNs and E. coli GRNs. The impact of nonlinear time series data on the performance of proposed methods is discussed.
{"title":"Inference of genetic regulatory networks with regulatory hubs using vector autoregressions and automatic relevance determination with model selections.","authors":"Chi-Kan Chen","doi":"10.1515/sagmb-2020-0054","DOIUrl":"https://doi.org/10.1515/sagmb-2020-0054","url":null,"abstract":"<p><p>The inference of genetic regulatory networks (GRNs) reveals how genes interact with each other. A few genes can regulate many genes as targets to control cell functions. We present new methods based on the order-1 vector autoregression (VAR1) for inferring GRNs from gene expression time series. The methods use the automatic relevance determination (ARD) to incorporate the regulatory hub structure into the estimation of VAR1 in a Bayesian framework. Several sparse approximation schemes are applied to the estimated regression weights or VAR1 model to generate the sparse weighted adjacency matrices representing the inferred GRNs. We apply the proposed and several widespread reference methods to infer GRNs with up to 100 genes using simulated, DREAM4 in silico and experimental <i>E. coli</i> gene expression time series. We show that the proposed methods are efficient on simulated hub GRNs and scale-free GRNs using short time series simulated by VAR1s and outperform reference methods on small-scale DREAM4 in silico GRNs and <i>E. coli</i> GRNs. They can utilize the known major regulatory hubs to improve the performance on larger DREAM4 in silico GRNs and <i>E. coli</i> GRNs. The impact of nonlinear time series data on the performance of proposed methods is discussed.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2021-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39646200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Batch-effects present challenges in the analysis of high-throughput molecular data and are particularly problematic in longitudinal studies when interest lies in identifying genes/features whose expression changes over time, but time is confounded with batch. While many methods to correct for batch-effects exist, most assume independence across samples; an assumption that is unlikely to hold in longitudinal microarray studies. We propose Batch effect Reduction of mIcroarray data with Dependent samples usinGEmpirical Bayes (BRIDGE), a three-step parametric empirical Bayes approach that leverages technical replicate samples profiled at multiple timepoints/batches, so-called "bridge samples", to inform batch-effect reduction/attenuation in longitudinal microarray studies. Extensive simulation studies and an analysis of a real biological data set were conducted to benchmark the performance of BRIDGE against both ComBat and longitudinalComBat. Our results demonstrate that while all methods perform well in facilitating accurate estimates of time effects, BRIDGE outperforms both ComBat and longitudinal ComBat in the removal of batch-effects in data sets with bridging samples, and perhaps as a result, was observed to have improved statistical power for detecting genes with a time effect. BRIDGE demonstrated competitive performance in batch effect reduction of confounded longitudinal microarray studies, both in simulated and a real data sets, and may serve as a useful preprocessing method for researchers conducting longitudinal microarray studies that include bridging samples.
{"title":"Batch effect reduction of microarray data with dependent samples using an empirical Bayes approach (BRIDGE).","authors":"Qing Xia, Jeffrey A Thompson, Devin C Koestler","doi":"10.1515/sagmb-2021-0020","DOIUrl":"10.1515/sagmb-2021-0020","url":null,"abstract":"<p><p>Batch-effects present challenges in the analysis of high-throughput molecular data and are particularly problematic in longitudinal studies when interest lies in identifying genes/features whose expression changes over time, but time is confounded with batch. While many methods to correct for batch-effects exist, most assume independence across samples; an assumption that is unlikely to hold in longitudinal microarray studies. We propose <u>B</u>atch effect <u>R</u>eduction of m<u>I</u>croarray data with <u>D</u>ependent samples usin<u>G</u><u>E</u>mpirical Bayes (<i>BRIDGE</i>), a three-step parametric empirical Bayes approach that leverages technical replicate samples profiled at multiple timepoints/batches, so-called \"bridge samples\", to inform batch-effect reduction/attenuation in longitudinal microarray studies. Extensive simulation studies and an analysis of a real biological data set were conducted to benchmark the performance of <i>BRIDGE</i> against both <i>ComBat</i> and <i>longitudinal</i><i>ComBat</i>. Our results demonstrate that while all methods perform well in facilitating accurate estimates of time effects, <i>BRIDGE</i> outperforms both <i>ComBat</i> and <i>longitudinal ComBat</i> in the removal of batch-effects in data sets with bridging samples, and perhaps as a result, was observed to have improved statistical power for detecting genes with a time effect. <i>BRIDGE</i> demonstrated competitive performance in batch effect reduction of confounded longitudinal microarray studies, both in simulated and a real data sets, and may serve as a useful preprocessing method for researchers conducting longitudinal microarray studies that include bridging samples.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2021-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9617207/pdf/nihms-1843789.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39586240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-01DOI: 10.1515/sagmb-2021-frontmatter4-6
{"title":"Frontmatter","authors":"","doi":"10.1515/sagmb-2021-frontmatter4-6","DOIUrl":"https://doi.org/10.1515/sagmb-2021-frontmatter4-6","url":null,"abstract":"","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43944170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}