Pub Date : 2024-09-30eCollection Date: 2024-01-01DOI: 10.1515/sagmb-2023-0041
Wimarsha T Jayanetti, Sinjini Sikdar
In recent years, meta-analyzing summary results from multiple studies has become a common practice in genomic research, leading to a significant improvement in the power of statistical detection compared to an individual genomic study. Meta analysis methods that combine statistical estimates across studies are known to be statistically more powerful than those combining statistical significance measures. An approach combining effect size estimates based on a fixed-effects model, called METAL, has gained extreme popularity to perform the former type of meta-analysis. In this article, we discuss the limitations of METAL due to its dependence on the theoretical null distribution, leading to incorrect significance testing results. Through various simulation studies and real genomic data application, we show how modifying the z-scores in METAL, using an empirical null distribution, can significantly improve the results, especially in presence of hidden confounders. For the estimation of the null distribution, we consider two different approaches, and we highlight the scenarios when one null estimation approach outperforms the other. This article will allow researchers to gain an insight into the importance of using an empirical null distribution in the fixed-effects meta-analysis as well as in choosing the appropriate empirical null distribution estimation approach.
近年来,对多项研究的汇总结果进行元分析已成为基因组研究中的一种常见做法,与单个基因组研究相比,元分析可显著提高统计检测能力。众所周知,结合各研究统计估计值的元分析方法比结合统计显著性度量的方法在统计上更强大。一种基于固定效应模型的效应大小估计值组合方法(称为 METAL)在进行前一种类型的元分析时极为流行。在本文中,我们将讨论 METAL 因依赖于理论空分布而导致显著性检验结果不正确的局限性。通过各种模拟研究和真实基因组数据的应用,我们展示了如何利用经验零分布修改 METAL 中的 z 分数,从而显著改善结果,尤其是在存在隐藏混杂因素的情况下。对于空分布的估计,我们考虑了两种不同的方法,并强调了一种空估计方法优于另一种方法的情况。本文将使研究人员深入了解在固定效应荟萃分析中使用经验零分布以及选择合适的经验零分布估计方法的重要性。
{"title":"Empirically adjusted fixed-effects meta-analysis methods in genomic studies.","authors":"Wimarsha T Jayanetti, Sinjini Sikdar","doi":"10.1515/sagmb-2023-0041","DOIUrl":"10.1515/sagmb-2023-0041","url":null,"abstract":"<p><p>In recent years, meta-analyzing summary results from multiple studies has become a common practice in genomic research, leading to a significant improvement in the power of statistical detection compared to an individual genomic study. Meta analysis methods that combine statistical estimates across studies are known to be statistically more powerful than those combining statistical significance measures. An approach combining effect size estimates based on a fixed-effects model, called METAL, has gained extreme popularity to perform the former type of meta-analysis. In this article, we discuss the limitations of METAL due to its dependence on the theoretical null distribution, leading to incorrect significance testing results. Through various simulation studies and real genomic data application, we show how modifying the <i>z</i>-scores in METAL, using an empirical null distribution, can significantly improve the results, especially in presence of hidden confounders. For the estimation of the null distribution, we consider two different approaches, and we highlight the scenarios when one null estimation approach outperforms the other. This article will allow researchers to gain an insight into the importance of using an empirical null distribution in the fixed-effects meta-analysis as well as in choosing the appropriate empirical null distribution estimation approach.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142331020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Understanding a protein's function based solely on its amino acid sequence is a crucial but intricate task in bioinformatics. Traditionally, this challenge has proven difficult. However, recent years have witnessed the rise of deep learning as a powerful tool, achieving significant success in protein function prediction. Their strength lies in their ability to automatically learn informative features from protein sequences, which can then be used to predict the protein's function. This study builds upon these advancements by proposing a novel model: CNN-CBAM+BiGRU. It incorporates a Convolutional Block Attention Module (CBAM) alongside BiGRUs. CBAM acts as a spotlight, guiding the CNN to focus on the most informative parts of the protein data, leading to more accurate feature extraction. BiGRUs, a type of Recurrent Neural Network (RNN), excel at capturing long-range dependencies within the protein sequence, which are essential for accurate function prediction. The proposed model integrates the strengths of both CNN-CBAM and BiGRU. This study's findings, validated through experimentation, showcase the effectiveness of this combined approach. For the human dataset, the suggested method outperforms the CNN-BIGRU+ATT model by +1.0 % for cellular components, +1.1 % for molecular functions, and +0.5 % for biological processes. For the yeast dataset, the suggested method outperforms the CNN-BIGRU+ATT model by +2.4 % for the cellular component, +1.2 % for molecular functions, and +0.6 % for biological processes.
{"title":"A CNN-CBAM-BIGRU model for protein function prediction.","authors":"Lavkush Sharma, Akshay Deepak, Ashish Ranjan, Gopalakrishnan Krishnasamy","doi":"10.1515/sagmb-2024-0004","DOIUrl":"https://doi.org/10.1515/sagmb-2024-0004","url":null,"abstract":"<p><p>Understanding a protein's function based solely on its amino acid sequence is a crucial but intricate task in bioinformatics. Traditionally, this challenge has proven difficult. However, recent years have witnessed the rise of deep learning as a powerful tool, achieving significant success in protein function prediction. Their strength lies in their ability to automatically learn informative features from protein sequences, which can then be used to predict the protein's function. This study builds upon these advancements by proposing a novel model: CNN-CBAM+BiGRU. It incorporates a Convolutional Block Attention Module (CBAM) alongside BiGRUs. CBAM acts as a spotlight, guiding the CNN to focus on the most informative parts of the protein data, leading to more accurate feature extraction. BiGRUs, a type of Recurrent Neural Network (RNN), excel at capturing long-range dependencies within the protein sequence, which are essential for accurate function prediction. The proposed model integrates the strengths of both CNN-CBAM and BiGRU. This study's findings, validated through experimentation, showcase the effectiveness of this combined approach. For the human dataset, the suggested method outperforms the CNN-BIGRU+ATT model by +1.0 % for cellular components, +1.1 % for molecular functions, and +0.5 % for biological processes. For the yeast dataset, the suggested method outperforms the CNN-BIGRU+ATT model by +2.4 % for the cellular component, +1.2 % for molecular functions, and +0.6 % for biological processes.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141471963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article addresses the limitations of existing statistical models in analyzing and interpreting highly skewed miRNA-seq raw read count data that can range from zero to millions. A heavy-tailed model using discrete stable distributions is proposed as a novel approach to better capture the heterogeneity and extreme values commonly observed in miRNA-seq data. Additionally, the parameters of the discrete stable distribution are proposed as an alternative target for differential expression analysis. An R package for computing and estimating the discrete stable distribution is provided. The proposed model is applied to miRNA-seq raw counts from the Norwegian Women and Cancer Study (NOWAC) and the Cancer Genome Atlas (TCGA) databases. The goodness-of-fit is compared with the popular Poisson and negative binomial distributions, and the discrete stable distributions are found to give a better fit for both datasets. In conclusion, the use of discrete stable distributions is shown to potentially lead to more accurate modeling of the underlying biological processes.
本文探讨了现有统计模型在分析和解释从零到数百万的高倾斜 miRNA-seq 原始读数数据时存在的局限性。本文提出了一种使用离散稳定分布的重尾模型,作为更好地捕捉 miRNA-seq 数据中常见的异质性和极端值的新方法。此外,还提出了离散稳定分布的参数作为差异表达分析的替代目标。提供了计算和估计离散稳定分布的 R 软件包。提出的模型适用于挪威妇女与癌症研究(NOWAC)和癌症基因组图谱(TCGA)数据库中的 miRNA-seq 原始计数。拟合优度与常用的泊松分布和负二项分布进行了比较,发现离散稳定分布对这两个数据集的拟合效果更好。总之,离散稳定分布的使用有可能为潜在的生物过程建立更准确的模型。
{"title":"A heavy-tailed model for analyzing miRNA-seq raw read counts.","authors":"Annika Krutto, Therese Haugdahl Nøst, Magne Thoresen","doi":"10.1515/sagmb-2023-0016","DOIUrl":"https://doi.org/10.1515/sagmb-2023-0016","url":null,"abstract":"<p><p>This article addresses the limitations of existing statistical models in analyzing and interpreting highly skewed miRNA-seq raw read count data that can range from zero to millions. A heavy-tailed model using discrete stable distributions is proposed as a novel approach to better capture the heterogeneity and extreme values commonly observed in miRNA-seq data. Additionally, the parameters of the discrete stable distribution are proposed as an alternative target for differential expression analysis. An R package for computing and estimating the discrete stable distribution is provided. The proposed model is applied to miRNA-seq raw counts from the Norwegian Women and Cancer Study (NOWAC) and the Cancer Genome Atlas (TCGA) databases. The goodness-of-fit is compared with the popular Poisson and negative binomial distributions, and the discrete stable distributions are found to give a better fit for both datasets. In conclusion, the use of discrete stable distributions is shown to potentially lead to more accurate modeling of the underlying biological processes.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141176757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-16eCollection Date: 2024-01-01DOI: 10.1515/sagmb-2023-0034
Ragnhild Laursen, Lasse Maretty, Asger Hobolth
Somatic mutations in cancer can be viewed as a mixture distribution of several mutational signatures, which can be inferred using non-negative matrix factorization (NMF). Mutational signatures have previously been parametrized using either simple mono-nucleotide interaction models or general tri-nucleotide interaction models. We describe a flexible and novel framework for identifying biologically plausible parametrizations of mutational signatures, and in particular for estimating di-nucleotide interaction models. Our novel estimation procedure is based on the expectation-maximization (EM) algorithm and regression in the log-linear quasi-Poisson model. We show that di-nucleotide interaction signatures are statistically stable and sufficiently complex to fit the mutational patterns. Di-nucleotide interaction signatures often strike the right balance between appropriately fitting the data and avoiding over-fitting. They provide a better fit to data and are biologically more plausible than mono-nucleotide interaction signatures, and the parametrization is more stable than the parameter-rich tri-nucleotide interaction signatures. We illustrate our framework in a large simulation study where we compare to state of the art methods, and show results for three data sets of somatic mutation counts from patients with cancer in the breast, Liver and urinary tract.
{"title":"Flexible model-based non-negative matrix factorization with application to mutational signatures.","authors":"Ragnhild Laursen, Lasse Maretty, Asger Hobolth","doi":"10.1515/sagmb-2023-0034","DOIUrl":"10.1515/sagmb-2023-0034","url":null,"abstract":"<p><p>Somatic mutations in cancer can be viewed as a mixture distribution of several mutational signatures, which can be inferred using non-negative matrix factorization (NMF). Mutational signatures have previously been parametrized using either simple mono-nucleotide interaction models or general tri-nucleotide interaction models. We describe a flexible and novel framework for identifying biologically plausible parametrizations of mutational signatures, and in particular for estimating di-nucleotide interaction models. Our novel estimation procedure is based on the expectation-maximization (EM) algorithm and regression in the log-linear quasi-Poisson model. We show that di-nucleotide interaction signatures are statistically stable and sufficiently complex to fit the mutational patterns. Di-nucleotide interaction signatures often strike the right balance between appropriately fitting the data and avoiding over-fitting. They provide a better fit to data and are biologically more plausible than mono-nucleotide interaction signatures, and the parametrization is more stable than the parameter-rich tri-nucleotide interaction signatures. We illustrate our framework in a large simulation study where we compare to state of the art methods, and show results for three data sets of somatic mutation counts from patients with cancer in the breast, Liver and urinary tract.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140945949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-13eCollection Date: 2024-01-01DOI: 10.1515/sagmb-2023-0038
Anand Hari, Edakkalathoor George Jinto, Divya Dennis, Kumarapillai Mohanan Nair Jagathnath Krishna, Preethi S George, Sivasevan Roshni, Aleyamma Mathew
Longitudinal time-to-event analysis is a statistical method to analyze data where covariates are measured repeatedly. In survival studies, the risk for an event is estimated using Cox-proportional hazard model or extended Cox-model for exogenous time-dependent covariates. However, these models are inappropriate for endogenous time-dependent covariates like longitudinally measured biomarkers, Carcinoembryonic Antigen (CEA). Joint models that can simultaneously model the longitudinal covariates and time-to-event data have been proposed as an alternative. The present study highlights the importance of choosing the baseline hazards to get more accurate risk estimation. The study used colon cancer patient data to illustrate and compare four different joint models which differs based on the choice of baseline hazards [piecewise-constant Gauss-Hermite (GH), piecewise-constant pseudo-adaptive GH, Weibull Accelerated Failure time model with GH & B-spline GH]. We conducted simulation study to assess the model consistency with varying sample size (N = 100, 250, 500) and censoring (20 %, 50 %, 70 %) proportions. In colon cancer patient data, based on Akaike information criteria (AIC) and Bayesian information criteria (BIC), piecewise-constant pseudo-adaptive GH was found to be the best fitted model. Despite differences in model fit, the hazards obtained from the four models were similar. The study identified composite stage as a prognostic factor for time-to-event and the longitudinal outcome, CEA as a dynamic predictor for overall survival in colon cancer patients. Based on the simulation study Piecewise-PH-aGH was found to be the best model with least AIC and BIC values, and highest coverage probability(CP). While the Bias, and RMSE for all the models showed a competitive performance. However, Piecewise-PH-aGH has shown least bias and RMSE in most of the combinations and has taken the shortest computation time, which shows its computational efficiency. This study is the first of its kind to discuss on the choice of baseline hazards.
{"title":"Choice of baseline hazards in joint modeling of longitudinal and time-to-event cancer survival data.","authors":"Anand Hari, Edakkalathoor George Jinto, Divya Dennis, Kumarapillai Mohanan Nair Jagathnath Krishna, Preethi S George, Sivasevan Roshni, Aleyamma Mathew","doi":"10.1515/sagmb-2023-0038","DOIUrl":"https://doi.org/10.1515/sagmb-2023-0038","url":null,"abstract":"<p><p>Longitudinal time-to-event analysis is a statistical method to analyze data where covariates are measured repeatedly. In survival studies, the risk for an event is estimated using Cox-proportional hazard model or extended Cox-model for exogenous time-dependent covariates. However, these models are inappropriate for endogenous time-dependent covariates like longitudinally measured biomarkers, Carcinoembryonic Antigen (CEA). Joint models that can simultaneously model the longitudinal covariates and time-to-event data have been proposed as an alternative. The present study highlights the importance of choosing the baseline hazards to get more accurate risk estimation. The study used colon cancer patient data to illustrate and compare four different joint models which differs based on the choice of baseline hazards [piecewise-constant Gauss-Hermite (GH), piecewise-constant pseudo-adaptive GH, Weibull Accelerated Failure time model with GH & B-spline GH]. We conducted simulation study to assess the model consistency with varying sample size (<i>N</i> = 100, 250, 500) and censoring (20 %, 50 %, 70 %) proportions. In colon cancer patient data, based on Akaike information criteria (AIC) and Bayesian information criteria (BIC), piecewise-constant pseudo-adaptive GH was found to be the best fitted model. Despite differences in model fit, the hazards obtained from the four models were similar. The study identified composite stage as a prognostic factor for time-to-event and the longitudinal outcome, CEA as a dynamic predictor for overall survival in colon cancer patients. Based on the simulation study Piecewise-PH-aGH was found to be the best model with least AIC and BIC values, and highest coverage probability(CP). While the Bias, and RMSE for all the models showed a competitive performance. However, Piecewise-PH-aGH has shown least bias and RMSE in most of the combinations and has taken the shortest computation time, which shows its computational efficiency. This study is the first of its kind to discuss on the choice of baseline hazards.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140913025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-03eCollection Date: 2024-01-01DOI: 10.1515/sagmb-2023-0027
Thomas Minotto, Philippe A Robert, Ingrid Hobæk Haff, Geir K Sandve
Simulation frameworks are useful to stress-test predictive models when data is scarce, or to assert model sensitivity to specific data distributions. Such frameworks often need to recapitulate several layers of data complexity, including emergent properties that arise implicitly from the interaction between simulation components. Antibody-antigen binding is a complex mechanism by which an antibody sequence wraps itself around an antigen with high affinity. In this study, we use a synthetic simulation framework for antibody-antigen folding and binding on a 3D lattice that include full details on the spatial conformation of both molecules. We investigate how emergent properties arise in this framework, in particular the physical proximity of amino acids, their presence on the binding interface, or the binding status of a sequence, and relate that to the individual and pairwise contributions of amino acids in statistical models for binding prediction. We show that weights learnt from a simple logistic regression model align with some but not all features of amino acids involved in the binding, and that predictive sequence binding patterns can be enriched. In particular, main effects correlated with the capacity of a sequence to bind any antigen, while statistical interactions were related to sequence specificity.
{"title":"Assessing the feasibility of statistical inference using synthetic antibody-antigen datasets.","authors":"Thomas Minotto, Philippe A Robert, Ingrid Hobæk Haff, Geir K Sandve","doi":"10.1515/sagmb-2023-0027","DOIUrl":"10.1515/sagmb-2023-0027","url":null,"abstract":"<p><p>Simulation frameworks are useful to stress-test predictive models when data is scarce, or to assert model sensitivity to specific data distributions. Such frameworks often need to recapitulate several layers of data complexity, including emergent properties that arise implicitly from the interaction between simulation components. Antibody-antigen binding is a complex mechanism by which an antibody sequence wraps itself around an antigen with high affinity. In this study, we use a synthetic simulation framework for antibody-antigen folding and binding on a 3D lattice that include full details on the spatial conformation of both molecules. We investigate how emergent properties arise in this framework, in particular the physical proximity of amino acids, their presence on the binding interface, or the binding status of a sequence, and relate that to the individual and pairwise contributions of amino acids in statistical models for binding prediction. We show that weights learnt from a simple logistic regression model align with some but not all features of amino acids involved in the binding, and that predictive sequence binding patterns can be enriched. In particular, main effects correlated with the capacity of a sequence to bind any antigen, while statistical interactions were related to sequence specificity.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140337377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-19eCollection Date: 2024-01-01DOI: 10.1515/sagmb-2022-0061
Md Rejuan Haque, Laura Kubatko
Methods based on the multi-species coalescent have been widely used in phylogenetic tree estimation using genome-scale DNA sequence data to understand the underlying evolutionary relationship between the sampled species. Evolutionary processes such as hybridization, which creates new species through interbreeding between two different species, necessitate inferring a species network instead of a species tree. A species tree is strictly bifurcating and thus fails to incorporate hybridization events which require an internal node of degree three. Hence, it is crucial to decide whether a tree or network analysis should be performed given a DNA sequence data set, a decision that is based on the presence of hybrid species in the sampled species. Although many methods have been proposed for hybridization detection, it is rare to find a technique that does so globally while considering a data generation mechanism that allows both hybridization and incomplete lineage sorting. In this paper, we consider hybridization and coalescence in a unified framework and propose a new test that can detect whether there are any hybrid species in a set of species of arbitrary size. Based on this global test of hybridization, one can decide whether a tree or network analysis is appropriate for a given data set.
基于多物种聚合的方法已被广泛应用于利用基因组尺度的 DNA 序列数据进行系统发生树估计,以了解采样物种之间的潜在进化关系。杂交等进化过程会通过两个不同物种之间的杂交产生新物种,因此有必要推断物种网络而不是物种树。物种树是严格分叉的,因此无法包含杂交事件,而杂交需要一个度数为三的内部节点。因此,在获得 DNA 序列数据集的情况下,决定是进行物种树分析还是物种网络分析至关重要。虽然已经提出了很多杂交检测方法,但很少有技术能在考虑数据生成机制的同时,在全局范围内进行杂交检测,从而实现杂交和不完全世系分类。在本文中,我们在一个统一的框架中考虑了杂交和凝聚,并提出了一种新的检验方法,可以检测任意大小的物种集合中是否存在杂交物种。基于这种杂交的全局检验,我们可以决定树状分析还是网络分析是否适合给定的数据集。
{"title":"A global test of hybrid ancestry from genome-scale data.","authors":"Md Rejuan Haque, Laura Kubatko","doi":"10.1515/sagmb-2022-0061","DOIUrl":"10.1515/sagmb-2022-0061","url":null,"abstract":"<p><p>Methods based on the multi-species coalescent have been widely used in phylogenetic tree estimation using genome-scale DNA sequence data to understand the underlying evolutionary relationship between the sampled species. Evolutionary processes such as hybridization, which creates new species through interbreeding between two different species, necessitate inferring a species network instead of a species tree. A species tree is strictly bifurcating and thus fails to incorporate hybridization events which require an internal node of degree three. Hence, it is crucial to decide whether a tree or network analysis should be performed given a DNA sequence data set, a decision that is based on the presence of hybrid species in the sampled species. Although many methods have been proposed for hybridization detection, it is rare to find a technique that does so globally while considering a data generation mechanism that allows both hybridization and incomplete lineage sorting. In this paper, we consider hybridization and coalescence in a unified framework and propose a new test that can detect whether there are any hybrid species in a set of species of arbitrary size. Based on this global test of hybridization, one can decide whether a tree or network analysis is appropriate for a given data set.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2024-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139747669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-19eCollection Date: 2024-01-01DOI: 10.1515/sagmb-2019-0050
Henry Linder, Yuping Zhang, Yunqi Wang, Zhengqing Ouyang
Developments in biotechnologies enable multi-platform data collection for functional genomic units apart from the gene. Profiling of non-coding microRNAs (miRNAs) is a valuable tool for understanding the molecular profile of the cell, both for canonical functions and malignant behavior due to complex diseases. We propose a graphical mixed-effects statistical model incorporating miRNA-gene target relationships. We implement an integrative pathway analysis that leverages measurements of miRNA activity for joint analysis with multimodal observations of gene activity including gene expression, methylation, and copy number variation. We apply our analysis to a breast cancer dataset, and consider differential activity in signaling pathways across breast tumor subtypes. We offer discussion of specific signaling pathways and the effect of miRNA integration, as well as publish an interactive data visualization to give public access to the results of our analysis.
{"title":"Integrative pathway analysis with gene expression, miRNA, methylation and copy number variation for breast cancer subtypes.","authors":"Henry Linder, Yuping Zhang, Yunqi Wang, Zhengqing Ouyang","doi":"10.1515/sagmb-2019-0050","DOIUrl":"10.1515/sagmb-2019-0050","url":null,"abstract":"<p><p>Developments in biotechnologies enable multi-platform data collection for functional genomic units apart from the gene. Profiling of non-coding microRNAs (miRNAs) is a valuable tool for understanding the molecular profile of the cell, both for canonical functions and malignant behavior due to complex diseases. We propose a graphical mixed-effects statistical model incorporating miRNA-gene target relationships. We implement an integrative pathway analysis that leverages measurements of miRNA activity for joint analysis with multimodal observations of gene activity including gene expression, methylation, and copy number variation. We apply our analysis to a breast cancer dataset, and consider differential activity in signaling pathways across breast tumor subtypes. We offer discussion of specific signaling pathways and the effect of miRNA integration, as well as publish an interactive data visualization to give public access to the results of our analysis.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2024-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139742425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-19eCollection Date: 2024-01-01DOI: 10.1515/sagmb-2022-0034
Zilu Liu, Asuman Seda Turkmen, Shili Lin
Population stratification (PS) is one major source of confounding in both single nucleotide polymorphism (SNP) and haplotype association studies. To address PS, principal component regression (PCR) and linear mixed model (LMM) are the current standards for SNP associations, which are also commonly borrowed for haplotype studies. However, the underfitting and overfitting problems introduced by PCR and LMM, respectively, have yet to be addressed. Furthermore, there have been only a few theoretical approaches proposed to address PS specifically for haplotypes. In this paper, we propose a new method under the Bayesian LASSO framework, QBLstrat, to account for PS in identifying rare and common haplotypes associated with a continuous trait of interest. QBLstrat utilizes a large number of principal components (PCs) with appropriate priors to sufficiently correct for PS, while shrinking the estimates of unassociated haplotypes and PCs. We compare the performance of QBLstrat with the Bayesian counterparts of PCR and LMM and a current method, haplo.stats. Extensive simulation studies and real data analyses show that QBLstrat is superior in controlling false positives while maintaining competitive power for identifying true positives under PS.
{"title":"Bayesian LASSO for population stratification correction in rare haplotype association studies.","authors":"Zilu Liu, Asuman Seda Turkmen, Shili Lin","doi":"10.1515/sagmb-2022-0034","DOIUrl":"10.1515/sagmb-2022-0034","url":null,"abstract":"<p><p>Population stratification (PS) is one major source of confounding in both single nucleotide polymorphism (SNP) and haplotype association studies. To address PS, principal component regression (PCR) and linear mixed model (LMM) are the current standards for SNP associations, which are also commonly borrowed for haplotype studies. However, the underfitting and overfitting problems introduced by PCR and LMM, respectively, have yet to be addressed. Furthermore, there have been only a few theoretical approaches proposed to address PS specifically for haplotypes. In this paper, we propose a new method under the Bayesian LASSO framework, QBLstrat, to account for PS in identifying rare and common haplotypes associated with a continuous trait of interest. QBLstrat utilizes a large number of principal components (PCs) with appropriate priors to sufficiently correct for PS, while shrinking the estimates of unassociated haplotypes and PCs. We compare the performance of QBLstrat with the Bayesian counterparts of PCR and LMM and a current method, haplo.stats. Extensive simulation studies and real data analyses show that QBLstrat is superior in controlling false positives while maintaining competitive power for identifying true positives under PS.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10794901/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139486664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-throughput technologies have made high-dimensional settings increasingly common, providing opportunities for the development of high-dimensional mediation methods. We aimed to provide useful guidance for researchers using high-dimensional mediation analysis and ideas for biostatisticians to develop it by summarizing and discussing recent advances in high-dimensional mediation analysis. The method still faces many challenges when extended single and multiple mediation analyses to high-dimensional settings. The development of high-dimensional mediation methods attempts to address these issues, such as screening true mediators, estimating mediation effects by variable selection, reducing the mediation dimension to resolve correlations between variables, and utilizing composite null hypothesis testing to test them. Although these problems regarding high-dimensional mediation have been solved to some extent, some challenges remain. First, the correlation between mediators are rarely considered when the variables are selected for mediation. Second, downscaling without incorporating prior biological knowledge makes the results difficult to interpret. In addition, a method of sensitivity analysis for the strict sequential ignorability assumption in high-dimensional mediation analysis is still lacking. An analyst needs to consider the applicability of each method when utilizing them, while a biostatistician could consider extensions and improvements in the methodology.
{"title":"Mediation analysis method review of high throughput data.","authors":"Qiang Han, Yu Wang, Na Sun, Jiadong Chu, Wei Hu, Yueping Shen","doi":"10.1515/sagmb-2023-0031","DOIUrl":"10.1515/sagmb-2023-0031","url":null,"abstract":"<p><p>High-throughput technologies have made high-dimensional settings increasingly common, providing opportunities for the development of high-dimensional mediation methods. We aimed to provide useful guidance for researchers using high-dimensional mediation analysis and ideas for biostatisticians to develop it by summarizing and discussing recent advances in high-dimensional mediation analysis. The method still faces many challenges when extended single and multiple mediation analyses to high-dimensional settings. The development of high-dimensional mediation methods attempts to address these issues, such as screening true mediators, estimating mediation effects by variable selection, reducing the mediation dimension to resolve correlations between variables, and utilizing composite null hypothesis testing to test them. Although these problems regarding high-dimensional mediation have been solved to some extent, some challenges remain. First, the correlation between mediators are rarely considered when the variables are selected for mediation. Second, downscaling without incorporating prior biological knowledge makes the results difficult to interpret. In addition, a method of sensitivity analysis for the strict sequential ignorability assumption in high-dimensional mediation analysis is still lacking. An analyst needs to consider the applicability of each method when utilizing them, while a biostatistician could consider extensions and improvements in the methodology.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138452936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}