Statistical Applications in Genetics and Molecular Biology最新文献

Empirically adjusted fixed-effects meta-analysis methods in genomic studies. 基因组研究中的经验调整固定效应荟萃分析方法。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2024-09-30 eCollection Date: 2024-01-01 DOI: 10.1515/sagmb-2023-0041

Wimarsha T Jayanetti, Sinjini Sikdar

In recent years, meta-analyzing summary results from multiple studies has become a common practice in genomic research, leading to a significant improvement in the power of statistical detection compared to an individual genomic study. Meta analysis methods that combine statistical estimates across studies are known to be statistically more powerful than those combining statistical significance measures. An approach combining effect size estimates based on a fixed-effects model, called METAL, has gained extreme popularity to perform the former type of meta-analysis. In this article, we discuss the limitations of METAL due to its dependence on the theoretical null distribution, leading to incorrect significance testing results. Through various simulation studies and real genomic data application, we show how modifying the z-scores in METAL, using an empirical null distribution, can significantly improve the results, especially in presence of hidden confounders. For the estimation of the null distribution, we consider two different approaches, and we highlight the scenarios when one null estimation approach outperforms the other. This article will allow researchers to gain an insight into the importance of using an empirical null distribution in the fixed-effects meta-analysis as well as in choosing the appropriate empirical null distribution estimation approach.

近年来，对多项研究的汇总结果进行元分析已成为基因组研究中的一种常见做法，与单个基因组研究相比，元分析可显著提高统计检测能力。众所周知，结合各研究统计估计值的元分析方法比结合统计显著性度量的方法在统计上更强大。一种基于固定效应模型的效应大小估计值组合方法（称为 METAL）在进行前一种类型的元分析时极为流行。在本文中，我们将讨论 METAL 因依赖于理论空分布而导致显著性检验结果不正确的局限性。通过各种模拟研究和真实基因组数据的应用，我们展示了如何利用经验零分布修改 METAL 中的 z 分数，从而显著改善结果，尤其是在存在隐藏混杂因素的情况下。对于空分布的估计，我们考虑了两种不同的方法，并强调了一种空估计方法优于另一种方法的情况。本文将使研究人员深入了解在固定效应荟萃分析中使用经验零分布以及选择合适的经验零分布估计方法的重要性。

{"title":"Empirically adjusted fixed-effects meta-analysis methods in genomic studies.","authors":"Wimarsha T Jayanetti, Sinjini Sikdar","doi":"10.1515/sagmb-2023-0041","DOIUrl":"10.1515/sagmb-2023-0041","url":null,"abstract":"In recent years, meta-analyzing summary results from multiple studies has become a common practice in genomic research, leading to a significant improvement in the power of statistical detection compared to an individual genomic study. Meta analysis methods that combine statistical estimates across studies are known to be statistically more powerful than those combining statistical significance measures. An approach combining effect size estimates based on a fixed-effects model, called METAL, has gained extreme popularity to perform the former type of meta-analysis. In this article, we discuss the limitations of METAL due to its dependence on the theoretical null distribution, leading to incorrect significance testing results. Through various simulation studies and real genomic data application, we show how modifying the z-scores in METAL, using an empirical null distribution, can significantly improve the results, especially in presence of hidden confounders. For the estimation of the null distribution, we consider two different approaches, and we highlight the scenarios when one null estimation approach outperforms the other. This article will allow researchers to gain an insight into the importance of using an empirical null distribution in the fixed-effects meta-analysis as well as in choosing the appropriate empirical null distribution estimation approach.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"23 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142331020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A CNN-CBAM-BIGRU model for protein function prediction. 用于蛋白质功能预测的 CNN-CBAM-BIGRU 模型。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2024-07-01 eCollection Date: 2024-01-01 DOI: 10.1515/sagmb-2024-0004

Lavkush Sharma, Akshay Deepak, Ashish Ranjan, Gopalakrishnan Krishnasamy

Understanding a protein's function based solely on its amino acid sequence is a crucial but intricate task in bioinformatics. Traditionally, this challenge has proven difficult. However, recent years have witnessed the rise of deep learning as a powerful tool, achieving significant success in protein function prediction. Their strength lies in their ability to automatically learn informative features from protein sequences, which can then be used to predict the protein's function. This study builds upon these advancements by proposing a novel model: CNN-CBAM+BiGRU. It incorporates a Convolutional Block Attention Module (CBAM) alongside BiGRUs. CBAM acts as a spotlight, guiding the CNN to focus on the most informative parts of the protein data, leading to more accurate feature extraction. BiGRUs, a type of Recurrent Neural Network (RNN), excel at capturing long-range dependencies within the protein sequence, which are essential for accurate function prediction. The proposed model integrates the strengths of both CNN-CBAM and BiGRU. This study's findings, validated through experimentation, showcase the effectiveness of this combined approach. For the human dataset, the suggested method outperforms the CNN-BIGRU+ATT model by +1.0 % for cellular components, +1.1 % for molecular functions, and +0.5 % for biological processes. For the yeast dataset, the suggested method outperforms the CNN-BIGRU+ATT model by +2.4 % for the cellular component, +1.2 % for molecular functions, and +0.6 % for biological processes.

仅根据氨基酸序列来了解蛋白质的功能是生物信息学中一项至关重要但又错综复杂的任务。传统上，这一挑战被证明是困难的。然而，近年来，深度学习作为一种强大的工具崛起，在蛋白质功能预测方面取得了巨大成功。它们的优势在于能够自动学习蛋白质序列中的信息特征，然后利用这些特征预测蛋白质的功能。本研究在这些进步的基础上提出了一个新模型：CNN-CBAM+BiGRU。它将卷积块注意模块（CBAM）与 BiGRUs 结合在一起。CBAM 就像聚光灯一样，引导 CNN 聚焦于蛋白质数据中信息量最大的部分，从而实现更准确的特征提取。作为递归神经网络（RNN）的一种，BiGRUs 擅长捕捉蛋白质序列中的长程依赖关系，这对于准确预测功能至关重要。所提出的模型综合了 CNN-CBAM 和 BiGRU 的优势。通过实验验证，本研究的结果展示了这种组合方法的有效性。就人类数据集而言，所建议的方法在细胞成分方面优于 CNN-BIGRU+ATT 模型 +1.0 %，在分子功能方面优于 CNN-BIGRU+ATT 模型 +1.1 %，在生物过程方面优于 CNN-BIGRU+ATT 模型 +0.5 %。对于酵母数据集，建议的方法在细胞成分方面优于 CNN-BIGRU+ATT 模型 +2.4 %，在分子功能方面优于 CNN-BIGRU+ATT 模型 +1.2 %，在生物过程方面优于 CNN-BIGRU+ATT 模型 +0.6 %。

{"title":"A CNN-CBAM-BIGRU model for protein function prediction.","authors":"Lavkush Sharma, Akshay Deepak, Ashish Ranjan, Gopalakrishnan Krishnasamy","doi":"10.1515/sagmb-2024-0004","DOIUrl":"https://doi.org/10.1515/sagmb-2024-0004","url":null,"abstract":"Understanding a protein's function based solely on its amino acid sequence is a crucial but intricate task in bioinformatics. Traditionally, this challenge has proven difficult. However, recent years have witnessed the rise of deep learning as a powerful tool, achieving significant success in protein function prediction. Their strength lies in their ability to automatically learn informative features from protein sequences, which can then be used to predict the protein's function. This study builds upon these advancements by proposing a novel model: CNN-CBAM+BiGRU. It incorporates a Convolutional Block Attention Module (CBAM) alongside BiGRUs. CBAM acts as a spotlight, guiding the CNN to focus on the most informative parts of the protein data, leading to more accurate feature extraction. BiGRUs, a type of Recurrent Neural Network (RNN), excel at capturing long-range dependencies within the protein sequence, which are essential for accurate function prediction. The proposed model integrates the strengths of both CNN-CBAM and BiGRU. This study's findings, validated through experimentation, showcase the effectiveness of this combined approach. For the human dataset, the suggested method outperforms the CNN-BIGRU+ATT model by +1.0 % for cellular components, +1.1 % for molecular functions, and +0.5 % for biological processes. For the yeast dataset, the suggested method outperforms the CNN-BIGRU+ATT model by +2.4 % for the cellular component, +1.2 % for molecular functions, and +0.6 % for biological processes.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"23 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141471963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A heavy-tailed model for analyzing miRNA-seq raw read counts. 用于分析 miRNA-seq 原始读数的重尾模型。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2024-05-29 eCollection Date: 2024-01-01 DOI: 10.1515/sagmb-2023-0016

Annika Krutto, Therese Haugdahl Nøst, Magne Thoresen

This article addresses the limitations of existing statistical models in analyzing and interpreting highly skewed miRNA-seq raw read count data that can range from zero to millions. A heavy-tailed model using discrete stable distributions is proposed as a novel approach to better capture the heterogeneity and extreme values commonly observed in miRNA-seq data. Additionally, the parameters of the discrete stable distribution are proposed as an alternative target for differential expression analysis. An R package for computing and estimating the discrete stable distribution is provided. The proposed model is applied to miRNA-seq raw counts from the Norwegian Women and Cancer Study (NOWAC) and the Cancer Genome Atlas (TCGA) databases. The goodness-of-fit is compared with the popular Poisson and negative binomial distributions, and the discrete stable distributions are found to give a better fit for both datasets. In conclusion, the use of discrete stable distributions is shown to potentially lead to more accurate modeling of the underlying biological processes.

本文探讨了现有统计模型在分析和解释从零到数百万的高倾斜 miRNA-seq 原始读数数据时存在的局限性。本文提出了一种使用离散稳定分布的重尾模型，作为更好地捕捉 miRNA-seq 数据中常见的异质性和极端值的新方法。此外，还提出了离散稳定分布的参数作为差异表达分析的替代目标。提供了计算和估计离散稳定分布的 R 软件包。提出的模型适用于挪威妇女与癌症研究（NOWAC）和癌症基因组图谱（TCGA）数据库中的 miRNA-seq 原始计数。拟合优度与常用的泊松分布和负二项分布进行了比较，发现离散稳定分布对这两个数据集的拟合效果更好。总之，离散稳定分布的使用有可能为潜在的生物过程建立更准确的模型。

引用次数: 0

Flexible model-based non-negative matrix factorization with application to mutational signatures. 基于模型的灵活非负矩阵因式分解，应用于突变特征。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2024-05-16 eCollection Date: 2024-01-01 DOI: 10.1515/sagmb-2023-0034

Ragnhild Laursen, Lasse Maretty, Asger Hobolth

Somatic mutations in cancer can be viewed as a mixture distribution of several mutational signatures, which can be inferred using non-negative matrix factorization (NMF). Mutational signatures have previously been parametrized using either simple mono-nucleotide interaction models or general tri-nucleotide interaction models. We describe a flexible and novel framework for identifying biologically plausible parametrizations of mutational signatures, and in particular for estimating di-nucleotide interaction models. Our novel estimation procedure is based on the expectation-maximization (EM) algorithm and regression in the log-linear quasi-Poisson model. We show that di-nucleotide interaction signatures are statistically stable and sufficiently complex to fit the mutational patterns. Di-nucleotide interaction signatures often strike the right balance between appropriately fitting the data and avoiding over-fitting. They provide a better fit to data and are biologically more plausible than mono-nucleotide interaction signatures, and the parametrization is more stable than the parameter-rich tri-nucleotide interaction signatures. We illustrate our framework in a large simulation study where we compare to state of the art methods, and show results for three data sets of somatic mutation counts from patients with cancer in the breast, Liver and urinary tract.

癌症中的体细胞突变可以看作是多种突变特征的混合分布，可以通过非负矩阵因式分解（NMF）来推断。突变特征以前是通过简单的单核苷酸相互作用模型或一般的三核苷酸相互作用模型进行参数化的。我们描述了一个灵活而新颖的框架，用于识别突变特征的生物合理参数化，特别是用于估算二核苷酸相互作用模型。我们新颖的估计程序基于期望最大化（EM）算法和对数线性准泊松模型回归。我们的研究表明，二核苷酸相互作用特征在统计学上是稳定的，而且足够复杂，可以拟合突变模式。二核苷酸相互作用特征通常能在适当拟合数据和避免过度拟合之间取得恰当的平衡。与单核苷酸相互作用特征相比，二核苷酸相互作用特征能更好地拟合数据，在生物学上更可信，而且参数化比参数丰富的三核苷酸相互作用特征更稳定。我们在一项大型模拟研究中说明了我们的框架，并将其与最先进的方法进行了比较，还展示了乳腺癌、肝癌和泌尿系统癌症患者体细胞突变计数的三个数据集的结果。

{"title":"Flexible model-based non-negative matrix factorization with application to mutational signatures.","authors":"Ragnhild Laursen, Lasse Maretty, Asger Hobolth","doi":"10.1515/sagmb-2023-0034","DOIUrl":"10.1515/sagmb-2023-0034","url":null,"abstract":"Somatic mutations in cancer can be viewed as a mixture distribution of several mutational signatures, which can be inferred using non-negative matrix factorization (NMF). Mutational signatures have previously been parametrized using either simple mono-nucleotide interaction models or general tri-nucleotide interaction models. We describe a flexible and novel framework for identifying biologically plausible parametrizations of mutational signatures, and in particular for estimating di-nucleotide interaction models. Our novel estimation procedure is based on the expectation-maximization (EM) algorithm and regression in the log-linear quasi-Poisson model. We show that di-nucleotide interaction signatures are statistically stable and sufficiently complex to fit the mutational patterns. Di-nucleotide interaction signatures often strike the right balance between appropriately fitting the data and avoiding over-fitting. They provide a better fit to data and are biologically more plausible than mono-nucleotide interaction signatures, and the parametrization is more stable than the parameter-rich tri-nucleotide interaction signatures. We illustrate our framework in a large simulation study where we compare to state of the art methods, and show results for three data sets of somatic mutation counts from patients with cancer in the breast, Liver and urinary tract.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"23 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140945949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Choice of baseline hazards in joint modeling of longitudinal and time-to-event cancer survival data. 癌症生存数据纵向和时间到事件联合建模中基线危害的选择。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2024-05-13 eCollection Date: 2024-01-01 DOI: 10.1515/sagmb-2023-0038

Anand Hari, Edakkalathoor George Jinto, Divya Dennis, Kumarapillai Mohanan Nair Jagathnath Krishna, Preethi S George, Sivasevan Roshni, Aleyamma Mathew

Longitudinal time-to-event analysis is a statistical method to analyze data where covariates are measured repeatedly. In survival studies, the risk for an event is estimated using Cox-proportional hazard model or extended Cox-model for exogenous time-dependent covariates. However, these models are inappropriate for endogenous time-dependent covariates like longitudinally measured biomarkers, Carcinoembryonic Antigen (CEA). Joint models that can simultaneously model the longitudinal covariates and time-to-event data have been proposed as an alternative. The present study highlights the importance of choosing the baseline hazards to get more accurate risk estimation. The study used colon cancer patient data to illustrate and compare four different joint models which differs based on the choice of baseline hazards [piecewise-constant Gauss-Hermite (GH), piecewise-constant pseudo-adaptive GH, Weibull Accelerated Failure time model with GH & B-spline GH]. We conducted simulation study to assess the model consistency with varying sample size (N = 100, 250, 500) and censoring (20 %, 50 %, 70 %) proportions. In colon cancer patient data, based on Akaike information criteria (AIC) and Bayesian information criteria (BIC), piecewise-constant pseudo-adaptive GH was found to be the best fitted model. Despite differences in model fit, the hazards obtained from the four models were similar. The study identified composite stage as a prognostic factor for time-to-event and the longitudinal outcome, CEA as a dynamic predictor for overall survival in colon cancer patients. Based on the simulation study Piecewise-PH-aGH was found to be the best model with least AIC and BIC values, and highest coverage probability(CP). While the Bias, and RMSE for all the models showed a competitive performance. However, Piecewise-PH-aGH has shown least bias and RMSE in most of the combinations and has taken the shortest computation time, which shows its computational efficiency. This study is the first of its kind to discuss on the choice of baseline hazards.

纵向时间到事件分析是一种统计方法，用于分析重复测量协变量的数据。在生存研究中，对于外生的时间依赖性协变量，采用 Cox 比例危险模型或扩展 Cox 模型估算事件风险。然而，这些模型并不适用于内生的时间依赖性协变量，如纵向测量的生物标志物癌胚抗原（CEA）。有人提出了可同时对纵向协变量和时间到事件数据建模的联合模型作为替代方案。本研究强调了选择基线危险度以获得更准确风险估计的重要性。本研究使用结肠癌患者数据来说明和比较四种不同的联合模型，这些模型根据基线危险度的选择而有所不同[片断常数高斯-赫米特（GH）、片断常数伪适应 GH、带 GH 的 Weibull 加速失效时间模型和 B 样条 GH]。我们进行了模拟研究，以评估不同样本量（N = 100、250、500）和剔除比例（20%、50%、70%）下模型的一致性。在结肠癌患者数据中，根据阿凯克信息准则（AIC）和贝叶斯信息准则（BIC），片断恒定伪适应 GH 被认为是拟合效果最好的模型。尽管模型拟合度不同，但四个模型得出的危害值相似。该研究发现，综合分期是结肠癌患者从发病到死亡时间的预后因素，而纵向结果 CEA 则是结肠癌患者总生存期的动态预测因素。模拟研究发现，Piecewise-PH-aGH 是最佳模型，其 AIC 和 BIC 值最小，覆盖概率（CP）最高。而所有模型的偏差和均方误差（RMSE）都表现出了很强的竞争力。然而，在大多数组合中，Piecewise-PH-aGH 的偏差和 RMSE 值最小，计算时间最短，这表明了它的计算效率。本研究首次讨论了基线危害的选择问题。

{"title":"Choice of baseline hazards in joint modeling of longitudinal and time-to-event cancer survival data.","authors":"Anand Hari, Edakkalathoor George Jinto, Divya Dennis, Kumarapillai Mohanan Nair Jagathnath Krishna, Preethi S George, Sivasevan Roshni, Aleyamma Mathew","doi":"10.1515/sagmb-2023-0038","DOIUrl":"https://doi.org/10.1515/sagmb-2023-0038","url":null,"abstract":"Longitudinal time-to-event analysis is a statistical method to analyze data where covariates are measured repeatedly. In survival studies, the risk for an event is estimated using Cox-proportional hazard model or extended Cox-model for exogenous time-dependent covariates. However, these models are inappropriate for endogenous time-dependent covariates like longitudinally measured biomarkers, Carcinoembryonic Antigen (CEA). Joint models that can simultaneously model the longitudinal covariates and time-to-event data have been proposed as an alternative. The present study highlights the importance of choosing the baseline hazards to get more accurate risk estimation. The study used colon cancer patient data to illustrate and compare four different joint models which differs based on the choice of baseline hazards [piecewise-constant Gauss-Hermite (GH), piecewise-constant pseudo-adaptive GH, Weibull Accelerated Failure time model with GH & B-spline GH]. We conducted simulation study to assess the model consistency with varying sample size (N = 100, 250, 500) and censoring (20 %, 50 %, 70 %) proportions. In colon cancer patient data, based on Akaike information criteria (AIC) and Bayesian information criteria (BIC), piecewise-constant pseudo-adaptive GH was found to be the best fitted model. Despite differences in model fit, the hazards obtained from the four models were similar. The study identified composite stage as a prognostic factor for time-to-event and the longitudinal outcome, CEA as a dynamic predictor for overall survival in colon cancer patients. Based on the simulation study Piecewise-PH-aGH was found to be the best model with least AIC and BIC values, and highest coverage probability(CP). While the Bias, and RMSE for all the models showed a competitive performance. However, Piecewise-PH-aGH has shown least bias and RMSE in most of the combinations and has taken the shortest computation time, which shows its computational efficiency. This study is the first of its kind to discuss on the choice of baseline hazards.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"23 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140913025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Assessing the feasibility of statistical inference using synthetic antibody-antigen datasets. 评估使用合成抗体抗原数据集进行统计推断的可行性。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2024-04-03 eCollection Date: 2024-01-01 DOI: 10.1515/sagmb-2023-0027

Thomas Minotto, Philippe A Robert, Ingrid Hobæk Haff, Geir K Sandve

Simulation frameworks are useful to stress-test predictive models when data is scarce, or to assert model sensitivity to specific data distributions. Such frameworks often need to recapitulate several layers of data complexity, including emergent properties that arise implicitly from the interaction between simulation components. Antibody-antigen binding is a complex mechanism by which an antibody sequence wraps itself around an antigen with high affinity. In this study, we use a synthetic simulation framework for antibody-antigen folding and binding on a 3D lattice that include full details on the spatial conformation of both molecules. We investigate how emergent properties arise in this framework, in particular the physical proximity of amino acids, their presence on the binding interface, or the binding status of a sequence, and relate that to the individual and pairwise contributions of amino acids in statistical models for binding prediction. We show that weights learnt from a simple logistic regression model align with some but not all features of amino acids involved in the binding, and that predictive sequence binding patterns can be enriched. In particular, main effects correlated with the capacity of a sequence to bind any antigen, while statistical interactions were related to sequence specificity.

在数据稀缺的情况下，模拟框架有助于对预测模型进行压力测试，或确定模型对特定数据分布的敏感性。此类框架通常需要再现多层次的数据复杂性，包括模拟组件之间相互作用隐含产生的新特性。抗体-抗原结合是一种复杂的机制，通过这种机制，抗体序列能以高亲和力包裹抗原。在本研究中，我们使用了一个三维晶格上抗体-抗原折叠和结合的合成模拟框架，其中包括两种分子空间构象的全部细节。我们研究了这一框架如何产生新的特性，特别是氨基酸的物理邻近性、它们在结合界面上的存在或序列的结合状态，并将其与结合预测统计模型中氨基酸的单个和成对贡献联系起来。我们的研究表明，从简单的逻辑回归模型中学习到的权重与参与结合的氨基酸的部分而非全部特征相一致，而且可以丰富预测序列结合模式。特别是，主效应与序列结合任何抗原的能力相关，而统计交互作用与序列特异性相关。

{"title":"Assessing the feasibility of statistical inference using synthetic antibody-antigen datasets.","authors":"Thomas Minotto, Philippe A Robert, Ingrid Hobæk Haff, Geir K Sandve","doi":"10.1515/sagmb-2023-0027","DOIUrl":"10.1515/sagmb-2023-0027","url":null,"abstract":"Simulation frameworks are useful to stress-test predictive models when data is scarce, or to assert model sensitivity to specific data distributions. Such frameworks often need to recapitulate several layers of data complexity, including emergent properties that arise implicitly from the interaction between simulation components. Antibody-antigen binding is a complex mechanism by which an antibody sequence wraps itself around an antigen with high affinity. In this study, we use a synthetic simulation framework for antibody-antigen folding and binding on a 3D lattice that include full details on the spatial conformation of both molecules. We investigate how emergent properties arise in this framework, in particular the physical proximity of amino acids, their presence on the binding interface, or the binding status of a sequence, and relate that to the individual and pairwise contributions of amino acids in statistical models for binding prediction. We show that weights learnt from a simple logistic regression model align with some but not all features of amino acids involved in the binding, and that predictive sequence binding patterns can be enriched. In particular, main effects correlated with the capacity of a sequence to bind any antigen, while statistical interactions were related to sequence specificity.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"23 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140337377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A global test of hybrid ancestry from genome-scale data. 从基因组尺度数据对杂交血统进行全球测试。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2024-02-19 eCollection Date: 2024-01-01 DOI: 10.1515/sagmb-2022-0061

Md Rejuan Haque, Laura Kubatko

Methods based on the multi-species coalescent have been widely used in phylogenetic tree estimation using genome-scale DNA sequence data to understand the underlying evolutionary relationship between the sampled species. Evolutionary processes such as hybridization, which creates new species through interbreeding between two different species, necessitate inferring a species network instead of a species tree. A species tree is strictly bifurcating and thus fails to incorporate hybridization events which require an internal node of degree three. Hence, it is crucial to decide whether a tree or network analysis should be performed given a DNA sequence data set, a decision that is based on the presence of hybrid species in the sampled species. Although many methods have been proposed for hybridization detection, it is rare to find a technique that does so globally while considering a data generation mechanism that allows both hybridization and incomplete lineage sorting. In this paper, we consider hybridization and coalescence in a unified framework and propose a new test that can detect whether there are any hybrid species in a set of species of arbitrary size. Based on this global test of hybridization, one can decide whether a tree or network analysis is appropriate for a given data set.

基于多物种聚合的方法已被广泛应用于利用基因组尺度的 DNA 序列数据进行系统发生树估计，以了解采样物种之间的潜在进化关系。杂交等进化过程会通过两个不同物种之间的杂交产生新物种，因此有必要推断物种网络而不是物种树。物种树是严格分叉的，因此无法包含杂交事件，而杂交需要一个度数为三的内部节点。因此，在获得 DNA 序列数据集的情况下，决定是进行物种树分析还是物种网络分析至关重要。虽然已经提出了很多杂交检测方法，但很少有技术能在考虑数据生成机制的同时，在全局范围内进行杂交检测，从而实现杂交和不完全世系分类。在本文中，我们在一个统一的框架中考虑了杂交和凝聚，并提出了一种新的检验方法，可以检测任意大小的物种集合中是否存在杂交物种。基于这种杂交的全局检验，我们可以决定树状分析还是网络分析是否适合给定的数据集。

{"title":"A global test of hybrid ancestry from genome-scale data.","authors":"Md Rejuan Haque, Laura Kubatko","doi":"10.1515/sagmb-2022-0061","DOIUrl":"10.1515/sagmb-2022-0061","url":null,"abstract":"Methods based on the multi-species coalescent have been widely used in phylogenetic tree estimation using genome-scale DNA sequence data to understand the underlying evolutionary relationship between the sampled species. Evolutionary processes such as hybridization, which creates new species through interbreeding between two different species, necessitate inferring a species network instead of a species tree. A species tree is strictly bifurcating and thus fails to incorporate hybridization events which require an internal node of degree three. Hence, it is crucial to decide whether a tree or network analysis should be performed given a DNA sequence data set, a decision that is based on the presence of hybrid species in the sampled species. Although many methods have been proposed for hybridization detection, it is rare to find a technique that does so globally while considering a data generation mechanism that allows both hybridization and incomplete lineage sorting. In this paper, we consider hybridization and coalescence in a unified framework and propose a new test that can detect whether there are any hybrid species in a set of species of arbitrary size. Based on this global test of hybridization, one can decide whether a tree or network analysis is appropriate for a given data set.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"23 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2024-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139747669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integrative pathway analysis with gene expression, miRNA, methylation and copy number variation for breast cancer subtypes. 利用基因表达、miRNA、甲基化和拷贝数变异对乳腺癌亚型进行整合通路分析。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2024-02-19 eCollection Date: 2024-01-01 DOI: 10.1515/sagmb-2019-0050

Henry Linder, Yuping Zhang, Yunqi Wang, Zhengqing Ouyang

Developments in biotechnologies enable multi-platform data collection for functional genomic units apart from the gene. Profiling of non-coding microRNAs (miRNAs) is a valuable tool for understanding the molecular profile of the cell, both for canonical functions and malignant behavior due to complex diseases. We propose a graphical mixed-effects statistical model incorporating miRNA-gene target relationships. We implement an integrative pathway analysis that leverages measurements of miRNA activity for joint analysis with multimodal observations of gene activity including gene expression, methylation, and copy number variation. We apply our analysis to a breast cancer dataset, and consider differential activity in signaling pathways across breast tumor subtypes. We offer discussion of specific signaling pathways and the effect of miRNA integration, as well as publish an interactive data visualization to give public access to the results of our analysis.

生物技术的发展使基因以外的功能基因组单元的多平台数据收集成为可能。对非编码 microRNA（miRNA）进行分析是了解细胞分子特征的重要工具，既能了解典型功能，也能了解复杂疾病导致的恶性行为。我们提出了一种包含 miRNA 与基因靶标关系的图形混合效应统计模型。我们实施了一种综合通路分析，利用对 miRNA 活性的测量与基因活性的多模态观测（包括基因表达、甲基化和拷贝数变异）进行联合分析。我们将分析结果应用于乳腺癌数据集，并考虑了不同乳腺癌亚型的信号通路活动差异。我们对特定信号通路和 miRNA 整合的影响进行了讨论，并发布了交互式数据可视化，让公众了解我们的分析结果。

引用次数: 0

Bayesian LASSO for population stratification correction in rare haplotype association studies. 贝叶斯 LASSO 用于稀有单倍型关联研究中的人群分层校正。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2024-01-19 eCollection Date: 2024-01-01 DOI: 10.1515/sagmb-2022-0034

Zilu Liu, Asuman Seda Turkmen, Shili Lin

Population stratification (PS) is one major source of confounding in both single nucleotide polymorphism (SNP) and haplotype association studies. To address PS, principal component regression (PCR) and linear mixed model (LMM) are the current standards for SNP associations, which are also commonly borrowed for haplotype studies. However, the underfitting and overfitting problems introduced by PCR and LMM, respectively, have yet to be addressed. Furthermore, there have been only a few theoretical approaches proposed to address PS specifically for haplotypes. In this paper, we propose a new method under the Bayesian LASSO framework, QBLstrat, to account for PS in identifying rare and common haplotypes associated with a continuous trait of interest. QBLstrat utilizes a large number of principal components (PCs) with appropriate priors to sufficiently correct for PS, while shrinking the estimates of unassociated haplotypes and PCs. We compare the performance of QBLstrat with the Bayesian counterparts of PCR and LMM and a current method, haplo.stats. Extensive simulation studies and real data analyses show that QBLstrat is superior in controlling false positives while maintaining competitive power for identifying true positives under PS.

人群分层（PS）是单核苷酸多态性（SNP）和单体型关联研究中混杂因素的一个主要来源。为解决单核苷酸多态性问题，主成分回归（PCR）和线性混合模型（LMM）是目前 SNP 关联研究的标准，也是单体型研究常用的方法。然而，PCR 和 LMM 分别带来的欠拟合和过拟合问题仍有待解决。此外，专门针对单倍型的 PS 理论方法也寥寥无几。在本文中，我们在贝叶斯 LASSO 框架下提出了一种新方法 QBLstrat，用于在识别与感兴趣的连续性状相关的稀有和常见单倍型时考虑 PS。QBLstrat 利用大量具有适当先验的主成分（PC）来充分校正多态性，同时缩小无关联单倍型和 PC 的估计值。我们将 QBLstrat 的性能与 PCR 和 LMM 的贝叶斯对应方法以及当前的一种方法 haplo.stats 进行了比较。广泛的模拟研究和实际数据分析表明，QBLstrat 在控制假阳性方面更胜一筹，同时在 PS 下识别真阳性方面也保持了竞争力。

{"title":"Bayesian LASSO for population stratification correction in rare haplotype association studies.","authors":"Zilu Liu, Asuman Seda Turkmen, Shili Lin","doi":"10.1515/sagmb-2022-0034","DOIUrl":"10.1515/sagmb-2022-0034","url":null,"abstract":"Population stratification (PS) is one major source of confounding in both single nucleotide polymorphism (SNP) and haplotype association studies. To address PS, principal component regression (PCR) and linear mixed model (LMM) are the current standards for SNP associations, which are also commonly borrowed for haplotype studies. However, the underfitting and overfitting problems introduced by PCR and LMM, respectively, have yet to be addressed. Furthermore, there have been only a few theoretical approaches proposed to address PS specifically for haplotypes. In this paper, we propose a new method under the Bayesian LASSO framework, QBLstrat, to account for PS in identifying rare and common haplotypes associated with a continuous trait of interest. QBLstrat utilizes a large number of principal components (PCs) with appropriate priors to sufficiently correct for PS, while shrinking the estimates of unassociated haplotypes and PCs. We compare the performance of QBLstrat with the Bayesian counterparts of PCR and LMM and a current method, haplo.stats. Extensive simulation studies and real data analyses show that QBLstrat is superior in controlling false positives while maintaining competitive power for identifying true positives under PS.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"23 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10794901/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139486664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mediation analysis method review of high throughput data. 高通量数据的中介分析方法综述。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2023-11-29 eCollection Date: 2023-01-01 DOI: 10.1515/sagmb-2023-0031

Qiang Han, Yu Wang, Na Sun, Jiadong Chu, Wei Hu, Yueping Shen

High-throughput technologies have made high-dimensional settings increasingly common, providing opportunities for the development of high-dimensional mediation methods. We aimed to provide useful guidance for researchers using high-dimensional mediation analysis and ideas for biostatisticians to develop it by summarizing and discussing recent advances in high-dimensional mediation analysis. The method still faces many challenges when extended single and multiple mediation analyses to high-dimensional settings. The development of high-dimensional mediation methods attempts to address these issues, such as screening true mediators, estimating mediation effects by variable selection, reducing the mediation dimension to resolve correlations between variables, and utilizing composite null hypothesis testing to test them. Although these problems regarding high-dimensional mediation have been solved to some extent, some challenges remain. First, the correlation between mediators are rarely considered when the variables are selected for mediation. Second, downscaling without incorporating prior biological knowledge makes the results difficult to interpret. In addition, a method of sensitivity analysis for the strict sequential ignorability assumption in high-dimensional mediation analysis is still lacking. An analyst needs to consider the applicability of each method when utilizing them, while a biostatistician could consider extensions and improvements in the methodology.

高通量技术使得高维设置越来越普遍，为高维中介方法的发展提供了机会。本文旨在通过总结和讨论高维中介分析的最新进展，为研究人员使用高维中介分析提供有益的指导，并为生物统计学家发展高维中介分析提供思路。该方法在将单、多中介分析扩展到高维环境时仍面临许多挑战。高维中介方法的发展试图解决这些问题，如筛选真正的中介，通过变量选择估计中介效果，降低中介维度以解决变量之间的相关性，并利用复合零假设检验来检验它们。虽然这些问题在一定程度上得到了解决，但仍然存在一些挑战。首先，在选择变量进行中介时，很少考虑中介之间的相关性。其次，在没有纳入先前的生物学知识的情况下缩小规模会使结果难以解释。此外，对于高维中介分析中严格顺序可忽略性假设，还缺乏一种灵敏度分析方法。分析人员在使用每种方法时需要考虑它们的适用性，而生物统计学家可以考虑方法的扩展和改进。

{"title":"Mediation analysis method review of high throughput data.","authors":"Qiang Han, Yu Wang, Na Sun, Jiadong Chu, Wei Hu, Yueping Shen","doi":"10.1515/sagmb-2023-0031","DOIUrl":"10.1515/sagmb-2023-0031","url":null,"abstract":"High-throughput technologies have made high-dimensional settings increasingly common, providing opportunities for the development of high-dimensional mediation methods. We aimed to provide useful guidance for researchers using high-dimensional mediation analysis and ideas for biostatisticians to develop it by summarizing and discussing recent advances in high-dimensional mediation analysis. The method still faces many challenges when extended single and multiple mediation analyses to high-dimensional settings. The development of high-dimensional mediation methods attempts to address these issues, such as screening true mediators, estimating mediation effects by variable selection, reducing the mediation dimension to resolve correlations between variables, and utilizing composite null hypothesis testing to test them. Although these problems regarding high-dimensional mediation have been solved to some extent, some challenges remain. First, the correlation between mediators are rarely considered when the variables are selected for mediation. Second, downscaling without incorporating prior biological knowledge makes the results difficult to interpret. In addition, a method of sensitivity analysis for the strict sequential ignorability assumption in high-dimensional mediation analysis is still lacking. An analyst needs to consider the applicability of each method when utilizing them, while a biostatistician could consider extensions and improvements in the methodology.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"22 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138452936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0