Genetic Epidemiology最新文献_第9页

Gene-based association tests in family samples using GWAS summary statistics 使用 GWAS 概要统计在家族样本中进行基于基因的关联测试。

IF 2.1 4区医学 Q3 GENETICS & HEREDITY

Genetic Epidemiology

Pub Date : 2024-02-05 DOI: 10.1002/gepi.22548

Peng Wang, Xiao Xu, Ming Li, Xiang-Yang Lou, Siqi Xu, Baolin Wu, Guimin Gao, Ping Yin, Nianjun Liu

Genome-wide association studies (GWAS) have led to rapid growth in detecting genetic variants associated with various phenotypes. Owing to a great number of publicly accessible GWAS summary statistics, and the difficulty in obtaining individual-level genotype data, many existing gene-based association tests have been adapted to require only GWAS summary statistics rather than individual-level data. However, these association tests are restricted to unrelated individuals and thus do not apply to family samples directly. Moreover, due to its flexibility and effectiveness, the linear mixed model has been increasingly utilized in GWAS to handle correlated data, such as family samples. However, it remains unknown how to perform gene-based association tests in family samples using the GWAS summary statistics estimated from the linear mixed model. In this study, we show that, when family size is negligible compared to the total sample size, the diagonal block structure of the kinship matrix makes it possible to approximate the correlation matrix of marginal Z scores by linkage disequilibrium matrix. Based on this result, current methods utilizing summary statistics for unrelated individuals can be directly applied to family data without any modifications. Our simulation results demonstrate that this proposed strategy controls the type 1 error rate well in various situations. Finally, we exemplify the usefulness of the proposed approach with a dental caries GWAS data set.

全基因组关联研究（GWAS）在检测与各种表型相关的基因变异方面发展迅速。由于有大量可公开获取的全基因组关联研究摘要统计数据，而获取个体水平的基因型数据又十分困难，因此许多现有的基于基因的关联检验已被调整为只需要全基因组关联研究摘要统计数据，而不需要个体水平的数据。然而，这些关联检验仅限于非相关个体，因此不能直接应用于家族样本。此外，线性混合模型因其灵活性和有效性，越来越多地被用于 GWAS，以处理家族样本等相关数据。然而，如何利用线性混合模型估算出的 GWAS 概要统计量在家族样本中进行基于基因的关联检验仍是一个未知数。在本研究中，我们发现当家族规模与总样本规模相比可以忽略不计时，亲缘关系矩阵的对角块结构可以通过连锁不平衡矩阵近似得到边际 Z 分数的相关矩阵。基于这一结果，目前利用非亲属关系个体汇总统计的方法可以直接应用于家族数据，无需做任何修改。我们的模拟结果表明，所提出的这一策略在各种情况下都能很好地控制类型 1 错误率。最后，我们用一个龋齿 GWAS 数据集举例说明了所提方法的实用性。

{"title":"Gene-based association tests in family samples using GWAS summary statistics","authors":"Peng Wang, Xiao Xu, Ming Li, Xiang-Yang Lou, Siqi Xu, Baolin Wu, Guimin Gao, Ping Yin, Nianjun Liu","doi":"10.1002/gepi.22548","DOIUrl":"10.1002/gepi.22548","url":null,"abstract":"Genome-wide association studies (GWAS) have led to rapid growth in detecting genetic variants associated with various phenotypes. Owing to a great number of publicly accessible GWAS summary statistics, and the difficulty in obtaining individual-level genotype data, many existing gene-based association tests have been adapted to require only GWAS summary statistics rather than individual-level data. However, these association tests are restricted to unrelated individuals and thus do not apply to family samples directly. Moreover, due to its flexibility and effectiveness, the linear mixed model has been increasingly utilized in GWAS to handle correlated data, such as family samples. However, it remains unknown how to perform gene-based association tests in family samples using the GWAS summary statistics estimated from the linear mixed model. In this study, we show that, when family size is negligible compared to the total sample size, the diagonal block structure of the kinship matrix makes it possible to approximate the correlation matrix of marginal Z scores by linkage disequilibrium matrix. Based on this result, current methods utilizing summary statistics for unrelated individuals can be directly applied to family data without any modifications. Our simulation results demonstrate that this proposed strategy controls the type 1 error rate well in various situations. Finally, we exemplify the usefulness of the proposed approach with a dental caries GWAS data set.","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"48 3","pages":"103-113"},"PeriodicalIF":2.1,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22548","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139691642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mitigating type 1 error inflation and power loss in GxE PRS: Genotype–environment interaction in polygenic risk score models 缓解 GxE PRS 中的 1 型错误膨胀和功率损失：多基因风险评分模型中基因型与环境的相互作用。

IF 2.1 4区医学 Q3 GENETICS & HEREDITY

Genetic Epidemiology

Pub Date : 2024-02-01 DOI: 10.1002/gepi.22546

Dovini Jayasinghe, Md. Moksedul Momin, Kerri Beckmann, Elina Hyppönen, Beben Benyamin, S. Hong Lee

The use of polygenic risk score (PRS) models has transformed the field of genetics by enabling the prediction of complex traits and diseases based on an individual's genetic profile. However, the impact of genotype–environment interaction (GxE) on the performance and applicability of PRS models remains a crucial aspect to be explored. Currently, existing genotype–environment interaction polygenic risk score (GxE PRS) models are often inappropriately used, which can result in inflated type 1 error rates and compromised results. In this study, we propose novel GxE PRS models that jointly incorporate additive and interaction genetic effects although also including an additional quadratic term for nongenetic covariates, enhancing their robustness against model misspecification. Through extensive simulations, we demonstrate that our proposed models outperform existing models in terms of controlling type 1 error rates and enhancing statistical power. Furthermore, we apply the proposed models to real data, and report significant GxE effects. Specifically, we highlight the impact of our models on both quantitative and binary traits. For quantitative traits, we uncover the GxE modulation of genetic effects on body mass index by alcohol intake frequency. In the case of binary traits, we identify the GxE modulation of genetic effects on hypertension by waist-to-hip ratio. These findings underscore the importance of employing a robust model that effectively controls type 1 error rates, thus preventing the occurrence of spurious GxE signals. To facilitate the implementation of our approach, we have developed an innovative R software package called GxEprs, specifically designed to detect and estimate GxE effects. Overall, our study highlights the importance of accurate GxE modeling and its implications for genetic risk prediction, although providing a practical tool to support further research in this area.

多基因风险评分（PRS）模型的使用改变了遗传学领域，它可以根据个体的遗传特征预测复杂的性状和疾病。然而，基因型-环境交互作用（GxE）对多基因风险评分模型的性能和适用性的影响仍然是一个有待探索的重要方面。目前，现有的基因型-环境交互作用多基因风险评分（GxE PRS）模型经常被不恰当地使用，这可能会导致1型错误率升高，结果大打折扣。在本研究中，我们提出了新的 GxE PRS 模型，该模型联合了加性遗传效应和交互遗传效应，同时还包括一个额外的二次项，用于非遗传协变量，从而增强了模型对模型错误规范的稳健性。通过大量模拟，我们证明我们提出的模型在控制 1 类错误率和提高统计能力方面优于现有模型。此外，我们将提出的模型应用于真实数据，并报告了显著的 GxE 效果。具体来说，我们强调了我们的模型对定量特征和二元特征的影响。在数量性状方面，我们发现了酒精摄入频率对体重指数遗传效应的 GxE 调节作用。在二元性状方面，我们发现了腰臀比对高血压遗传效应的 GxE 调节作用。这些发现强调了采用稳健模型的重要性，该模型可有效控制 1 类错误率，从而防止出现虚假的 GxE 信号。为了便于实施我们的方法，我们开发了一个名为 GxEprs 的创新 R 软件包，专门用于检测和估计 GxE 效应。总之，我们的研究强调了准确 GxE 建模的重要性及其对遗传风险预测的影响，同时提供了一种实用工具来支持该领域的进一步研究。

{"title":"Mitigating type 1 error inflation and power loss in GxE PRS: Genotype–environment interaction in polygenic risk score models","authors":"Dovini Jayasinghe, Md. Moksedul Momin, Kerri Beckmann, Elina Hyppönen, Beben Benyamin, S. Hong Lee","doi":"10.1002/gepi.22546","DOIUrl":"10.1002/gepi.22546","url":null,"abstract":"The use of polygenic risk score (PRS) models has transformed the field of genetics by enabling the prediction of complex traits and diseases based on an individual's genetic profile. However, the impact of genotype–environment interaction (GxE) on the performance and applicability of PRS models remains a crucial aspect to be explored. Currently, existing genotype–environment interaction polygenic risk score (GxE PRS) models are often inappropriately used, which can result in inflated type 1 error rates and compromised results. In this study, we propose novel GxE PRS models that jointly incorporate additive and interaction genetic effects although also including an additional quadratic term for nongenetic covariates, enhancing their robustness against model misspecification. Through extensive simulations, we demonstrate that our proposed models outperform existing models in terms of controlling type 1 error rates and enhancing statistical power. Furthermore, we apply the proposed models to real data, and report significant GxE effects. Specifically, we highlight the impact of our models on both quantitative and binary traits. For quantitative traits, we uncover the GxE modulation of genetic effects on body mass index by alcohol intake frequency. In the case of binary traits, we identify the GxE modulation of genetic effects on hypertension by waist-to-hip ratio. These findings underscore the importance of employing a robust model that effectively controls type 1 error rates, thus preventing the occurrence of spurious GxE signals. To facilitate the implementation of our approach, we have developed an innovative R software package called GxEprs, specifically designed to detect and estimate GxE effects. Overall, our study highlights the importance of accurate GxE modeling and its implications for genetic risk prediction, although providing a practical tool to support further research in this area.","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"48 2","pages":"85-100"},"PeriodicalIF":2.1,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22546","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139671559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Interval estimate of causal effect in summary data based Mendelian randomization in the presence of winner's curse 在存在赢家诅咒的情况下，基于孟德尔随机化的汇总数据中因果效应的区间估计。

IF 2.1 4区医学 Q3 GENETICS & HEREDITY

Genetic Epidemiology

Pub Date : 2024-01-28 DOI: 10.1002/gepi.22545

Kai Wang

This research focuses on the interval estimation of the causal effect of an exposure on an outcome using the summary data-based Mendelian randomization (SMR) method while accounting for the winner's curse caused by the selection of single nucleotide polymorphism instruments. This issue is understudied and is important as the point estimate is biased. Since Fieller's theorem and its variations are not suitable for constructing a confidence interval, we use the box method. This box method is known to be conservative and thus provides a lower bound on the coverage level. To assess the performance of the box method, we use simulation studies and compare it with the support interval we proposed earlier and the Wald interval derived from the SMR method. All three methods are applied to a study of causal genes for Alzheimer's disease. Overall, the box method presents an alternative for constructing interval estimates for a causal effect while addressing the winner's curse issue.

这项研究的重点是利用基于汇总数据的孟德尔随机化（SMR）方法，对暴露对结果的因果效应进行区间估计，同时考虑到单核苷酸多态性工具的选择所导致的赢家诅咒。这一问题研究不足，但由于点估计值存在偏差，因此非常重要。由于 Fieller 定理及其变式不适合构建置信区间，我们采用了盒式方法。众所周知，方框法是一种保守的方法，因此可以为覆盖水平提供一个下限。为了评估方框法的性能，我们使用了模拟研究，并将其与我们之前提出的支持区间和由 SMR 方法得出的 Wald 区间进行了比较。这三种方法都应用于阿尔茨海默病因果基因的研究。总之，方框方法为构建因果效应的区间估计值提供了一种替代方法，同时解决了赢家诅咒问题。

引用次数: 0

simmrd: An open-source tool to perform simulations in Mendelian randomization simmrd：用于执行孟德尔随机化模拟的开源工具。

IF 2.1 4区医学 Q3 GENETICS & HEREDITY

Genetic Epidemiology

Pub Date : 2024-01-23 DOI: 10.1002/gepi.22544

Noah Lorincz-Comi, Yihe Yang, Xiaofeng Zhu

Mendelian randomization (MR) has become a popular tool for inferring causality of risk factors on disease. There are currently over 45 different methods available to perform MR, reflecting this extremely active research area. It would be desirable to have a standard simulation environment to objectively evaluate the existing and future methods. We present simmrd, an open-source software for performing simulations to evaluate the performance of MR methods in a range of scenarios encountered in practice. Researchers can directly modify the simmrd source code so that the research community may arrive at a widely accepted framework for researchers to evaluate the performance of different MR methods.

孟德尔随机化（MR）已成为推断疾病风险因素因果关系的常用工具。目前有超过 45 种不同的方法可用于执行 MR，反映出这一研究领域极为活跃。我们希望有一个标准的模拟环境来客观地评估现有和未来的方法。我们推出的 simmrd 是一款开源软件，用于进行模拟，评估 MR 方法在实际应用中的各种情况下的性能。研究人员可以直接修改 simmrd 的源代码，这样研究界就可以形成一个广为接受的框架，供研究人员评估不同磁共振方法的性能。

引用次数: 0

DYNATE: Localizing rare-variant association regions via multiple testing embedded in an aggregation tree DYNATE:通过嵌入在聚合树中的多个测试来定位稀有的关联区域。

IF 2.1 4区医学 Q3 GENETICS & HEREDITY

Genetic Epidemiology

Pub Date : 2023-11-28 DOI: 10.1002/gepi.22542

Xuechan Li, John Pura, Andrew Allen, Kouros Owzar, Jianfeng Lu, Matthew Harms, Jichun Xie

Rare-variants (RVs) genetic association studies enable researchers to uncover the variation in phenotypic traits left unexplained by common variation. Traditional single-variant analysis lacks power; thus, researchers have developed various methods to aggregate the effects of RVs across genomic regions to study their collective impact. Some existing methods utilize a static delineation of genomic regions, often resulting in suboptimal effect aggregation, as neutral subregions within the test region will result in an attenuation of signal. Other methods use varying windows to search for signals but often result in long regions containing many neutral RVs. To pinpoint short genomic regions enriched for disease-associated RVs, we developed a novel method, DYNamic Aggregation TEsting (DYNATE). DYNATE dynamically and hierarchically aggregates smaller genomic regions into larger ones and performs multiple testing for disease associations with a controlled weighted false discovery rate. DYNATE's main advantage lies in its strong ability to identify short genomic regions highly enriched for disease-associated RVs. Extensive numerical simulations demonstrate the superior performance of DYNATE under various scenarios compared with existing methods. We applied DYNATE to an amyotrophic lateral sclerosis study and identified a new gene, EPG5, harboring possibly pathogenic mutations.

罕见变异(RVs)遗传关联研究使研究人员能够发现常见变异无法解释的表型性状变异。传统的单变量分析缺乏效力;因此，研究人员开发了各种方法来汇总rv在基因组区域的影响，以研究它们的集体影响。一些现有的方法利用基因组区域的静态描述，通常导致次优效应聚集，因为测试区域内的中性子区域将导致信号的衰减。其他方法使用不同的窗口来搜索信号，但往往导致包含许多中性rv的长区域。为了精确定位与疾病相关的rv富集的短基因组区域，我们开发了一种新的方法，动态聚合测试(DYNATE)。DYNATE动态地、分层地将较小的基因组区域聚合为较大的基因组区域，并在控制加权错误发现率的情况下对疾病关联进行多次测试。DYNATE的主要优势在于其识别疾病相关rv高度富集的短基因组区域的强大能力。大量的数值模拟表明，与现有方法相比，DYNATE在各种场景下都具有优越的性能。我们将DYNATE应用于肌萎缩性侧索硬化症的研究中，发现了一个新的基因EPG5，该基因可能具有致病性突变。

{"title":"DYNATE: Localizing rare-variant association regions via multiple testing embedded in an aggregation tree","authors":"Xuechan Li, John Pura, Andrew Allen, Kouros Owzar, Jianfeng Lu, Matthew Harms, Jichun Xie","doi":"10.1002/gepi.22542","DOIUrl":"10.1002/gepi.22542","url":null,"abstract":"Rare-variants (RVs) genetic association studies enable researchers to uncover the variation in phenotypic traits left unexplained by common variation. Traditional single-variant analysis lacks power; thus, researchers have developed various methods to aggregate the effects of RVs across genomic regions to study their collective impact. Some existing methods utilize a static delineation of genomic regions, often resulting in suboptimal effect aggregation, as neutral subregions within the test region will result in an attenuation of signal. Other methods use varying windows to search for signals but often result in long regions containing many neutral RVs. To pinpoint short genomic regions enriched for disease-associated RVs, we developed a novel method, DYNamic Aggregation TEsting (DYNATE). DYNATE dynamically and hierarchically aggregates smaller genomic regions into larger ones and performs multiple testing for disease associations with a controlled weighted false discovery rate. DYNATE's main advantage lies in its strong ability to identify short genomic regions highly enriched for disease-associated RVs. Extensive numerical simulations demonstrate the superior performance of DYNATE under various scenarios compared with existing methods. We applied DYNATE to an amyotrophic lateral sclerosis study and identified a new gene, EPG5, harboring possibly pathogenic mutations.","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"48 1","pages":"42-55"},"PeriodicalIF":2.1,"publicationDate":"2023-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138444394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bias and mean squared error in Mendelian randomization with invalid instrumental variables 无效工具变量的孟德尔随机化中的偏差和均方误差。

IF 2.1 4区医学 Q3 GENETICS & HEREDITY

Genetic Epidemiology

Pub Date : 2023-11-16 DOI: 10.1002/gepi.22541

Lu Deng, Sheng Fu, Kai Yu

Mendelian randomization (MR) is a statistical method that utilizes genetic variants as instrumental variables (IVs) to investigate causal relationships between risk factors and outcomes. Although MR has gained popularity in recent years due to its ability to analyze summary statistics from genome-wide association studies (GWAS), it requires a substantial number of single nucleotide polymorphisms (SNPs) as IVs to ensure sufficient power for detecting causal effects. Unfortunately, the complex genetic heritability of many traits can lead to the use of invalid IVs that affect both the risk factor and the outcome directly or through an unobserved confounder. This can result in biased and imprecise estimates, as reflected by a larger mean squared error (MSE). In this study, we focus on the widely used two-stage least squares (2SLS) method and derive formulas for its bias and MSE when estimating causal effects using invalid IVs. Using those formulas, we identify conditions under which the 2SLS estimate is unbiased and reveal how the independent or correlated pleiotropic effects influence the accuracy and precision of the 2SLS estimate. We validate these formulas through extensive simulation studies and demonstrate the application of those formulas in an MR study to evaluate the causal effect of the waist-to-hip ratio on various sleeping patterns. Our results can aid in designing future MR studies and serve as benchmarks for assessing more sophisticated MR methods.

孟德尔随机化(MR)是一种利用遗传变异作为工具变量(IVs)来研究风险因素与结果之间因果关系的统计方法。尽管MR近年来因其分析全基因组关联研究(GWAS)汇总统计数据的能力而受到欢迎，但它需要大量的单核苷酸多态性(snp)作为iv来确保检测因果效应的足够能力。不幸的是，许多性状的复杂遗传可导致使用无效的静脉注射，直接或通过未观察到的混杂因素影响风险因素和结果。这可能导致有偏差和不精确的估计，反映在较大的均方误差(MSE)上。在本研究中，我们重点研究了广泛使用的两阶段最小二乘(2SLS)方法，并推导了在使用无效IVs估计因果效应时其偏差和MSE的公式。利用这些公式，我们确定了2SLS估计无偏的条件，并揭示了独立或相关的多效效应如何影响2SLS估计的准确性和精度。我们通过广泛的模拟研究验证了这些公式，并演示了这些公式在MR研究中的应用，以评估腰臀比对各种睡眠模式的因果关系。我们的结果可以帮助设计未来的核磁共振研究，并作为评估更复杂的核磁共振方法的基准。

{"title":"Bias and mean squared error in Mendelian randomization with invalid instrumental variables","authors":"Lu Deng, Sheng Fu, Kai Yu","doi":"10.1002/gepi.22541","DOIUrl":"10.1002/gepi.22541","url":null,"abstract":"Mendelian randomization (MR) is a statistical method that utilizes genetic variants as instrumental variables (IVs) to investigate causal relationships between risk factors and outcomes. Although MR has gained popularity in recent years due to its ability to analyze summary statistics from genome-wide association studies (GWAS), it requires a substantial number of single nucleotide polymorphisms (SNPs) as IVs to ensure sufficient power for detecting causal effects. Unfortunately, the complex genetic heritability of many traits can lead to the use of invalid IVs that affect both the risk factor and the outcome directly or through an unobserved confounder. This can result in biased and imprecise estimates, as reflected by a larger mean squared error (MSE). In this study, we focus on the widely used two-stage least squares (2SLS) method and derive formulas for its bias and MSE when estimating causal effects using invalid IVs. Using those formulas, we identify conditions under which the 2SLS estimate is unbiased and reveal how the independent or correlated pleiotropic effects influence the accuracy and precision of the 2SLS estimate. We validate these formulas through extensive simulation studies and demonstrate the application of those formulas in an MR study to evaluate the causal effect of the waist-to-hip ratio on various sleeping patterns. Our results can aid in designing future MR studies and serve as benchmarks for assessing more sophisticated MR methods.","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"48 1","pages":"27-41"},"PeriodicalIF":2.1,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136397137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Limitation of permutation-based differential correlation analysis 基于排列的差分相关分析的局限性。

IF 2.1 4区医学 Q3 GENETICS & HEREDITY

Genetic Epidemiology

Pub Date : 2023-11-10 DOI: 10.1002/gepi.22540

Hoseung Song, Michael C. Wu

The comparison of biological systems, through the analysis of molecular changes under different conditions, has played a crucial role in the progress of modern biological science. Specifically, differential correlation analysis (DCA) has been employed to determine whether relationships between genomic features differ across conditions or outcomes. Because ascertaining the null distribution of test statistics to capture variations in correlation is challenging, several DCA methods utilize permutation which can loosen parametric (e.g., normality) assumptions. However, permutation is often problematic for DCA due to violating the assumption that samples are exchangeable under the null. Here, we examine the limitations of permutation-based DCA and investigate instances where the permutation-based DCA exhibits poor performance. Experimental results show that the permutation-based DCA often fails to control the type I error under the null hypothesis of equal correlation structures.

通过分析不同条件下分子的变化，对生物系统进行比较，在现代生物科学的进步中发挥了至关重要的作用。具体而言，差异相关分析（DCA）已被用于确定基因组特征之间的关系是否因条件或结果而异。由于确定测试统计的零分布以捕捉相关性的变化是具有挑战性的，因此几种DCA方法利用排列来放松参数（例如，正态性）假设。然而，由于违反了样本在零下是可交换的假设，排列对于DCA来说往往是有问题的。在这里，我们研究了基于置换的DCA的局限性，并研究了基于排列的DCA表现出较差性能的实例。实验结果表明，在等相关结构的零假设下，基于置换的DCA往往无法控制I型误差。

引用次数: 0

Correction to “Abstracts” 对“摘要”的更正

IF 2.1 4区医学 Q3 GENETICS & HEREDITY

Genetic Epidemiology

Pub Date : 2023-11-09 DOI: 10.1002/gepi.22543

(2023), Abstracts. Genetic Epidemiology, 47: 520–581. https://doi.org/10.1002/gepi.22539

In the originally published Abstracts, there were authors missing for “Two-sample Mendelian Randomization Study of Circulating Metabolites and Prostate Cancer Risk in Hispanic Populations” (abstract 49). The correct authors and affiliations appear below and have been updated on the online version of the abstracts.

Harriett Fuller¹, Rebecca Rohde², Heather Highland², Jiayi Shen³, Bing Yu⁴, Eric Boerwinkle⁴, Megan Grove⁴, Kari E. North², David V. Conti³, Christopher A. Haiman³, Kristin Young², Burcu F. Darst¹

¹Public Health Sciences Division, Fred Hutchinson Cancer Center, Seattle, Washington, USA

²Department of Epidemiology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA

³Department of Population and Public Health Sciences, Center for Genetic Epidemiology, Keck School of Medicine, University of Southern California, California, USA

⁴School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas, USA

We apologize for this error.

(2023),摘要。中华流行病学杂志，47(2):521 - 521。https://doi.org/10.1002/gepi.22539In在最初发表的摘要中，“西班牙裔人群循环代谢物和前列腺癌风险的两样本孟德尔随机化研究”(摘要49)缺少作者。正确的作者和所属机构如下所示，并已在摘要的在线版本上更新。Harriett Fuller1, Rebecca Rohde2, Heather Highland2, shenjiayi 3, yubing 4, Eric Boerwinkle4, Megan Grove4, Kari E. North2, David V. Conti3, Christopher A. Haiman3, Kristin Young2, Burcu F. darst11公共卫生科学部，Fred Hutchinson癌症中心，西雅图，华盛顿州，usa2，北卡罗来纳大学教堂山分校，教堂山，北卡罗来纳州，usa2人口与公共卫生科学部，遗传流行病学中心，南加州大学凯克医学院、美国公共卫生学院、德克萨斯大学休斯敦健康科学中心、美国德克萨斯州休斯敦我们为这个错误道歉。

{"title":"Correction to “Abstracts”","authors":"","doi":"10.1002/gepi.22543","DOIUrl":"10.1002/gepi.22543","url":null,"abstract":"(2023), Abstracts. Genetic Epidemiology, 47: 520–581. https://doi.org/10.1002/gepi.22539In the originally published Abstracts, there were authors missing for “Two-sample Mendelian Randomization Study of Circulating Metabolites and Prostate Cancer Risk in Hispanic Populations” (abstract 49). The correct authors and affiliations appear below and have been updated on the online version of the abstracts.Harriett Fuller1, Rebecca Rohde2, Heather Highland2, Jiayi Shen3, Bing Yu4, Eric Boerwinkle4, Megan Grove4, Kari E. North2, David V. Conti3, Christopher A. Haiman3, Kristin Young2, Burcu F. Darst11Public Health Sciences Division, Fred Hutchinson Cancer Center, Seattle, Washington, USA2Department of Epidemiology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA3Department of Population and Public Health Sciences, Center for Genetic Epidemiology, Keck School of Medicine, University of Southern California, California, USA4School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas, USAWe apologize for this error.","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"47 8","pages":"642"},"PeriodicalIF":2.1,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22543","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135241247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Haplotype reconstruction for genetically complex regions with ambiguous genotype calls: Illustration by the KIR gene region 具有模糊基因型调用的遗传复杂区域的单倍型重建：KIR基因区域的说明。

IF 2.1 4区医学 Q3 GENETICS & HEREDITY

Genetic Epidemiology

Pub Date : 2023-10-13 DOI: 10.1002/gepi.22538

Lars L. J. van der Burg, Liesbeth C. de Wreede, Henning Baldauf, Jürgen Sauter, Johannes Schetelig, Hein Putter, Stefan Böhringer

Advances in DNA sequencing technologies have enabled genotyping of complex genetic regions exhibiting copy number variation and high allelic diversity, yet it is impossible to derive exact genotypes in all cases, often resulting in ambiguous genotype calls, that is, partially missing data. An example of such a gene region is the killer-cell immunoglobulin-like receptor (KIR) genes. These genes are of special interest in the context of allogeneic hematopoietic stem cell transplantation. For such complex gene regions, current haplotype reconstruction methods are not feasible as they cannot cope with the complexity of the data. We present an expectation–maximization (EM)-algorithm to estimate haplotype frequencies (HTFs) which deals with the missing data components, and takes into account linkage disequilibrium (LD) between genes. To cope with the exponential increase in the number of haplotypes as genes are added, we add three components to a standard EM-algorithm implementation. First, reconstruction is performed iteratively, adding one gene at a time. Second, after each step, haplotypes with frequencies below a threshold are collapsed in a rare haplotype group. Third, the HTF of the rare haplotype group is profiled in subsequent iterations to improve estimates. A simulation study evaluates the effect of combining information of multiple genes on the estimates of these frequencies. We show that estimated HTFs are approximately unbiased. Our simulation study shows that the EM-algorithm is able to combine information from multiple genes when LD is high, whereas increased ambiguity levels increase bias. Linear regression models based on this EM, show that a large number of haplotypes can be problematic for unbiased effect size estimation and that models need to be sparse. In a real data analysis of KIR genotypes, we compare HTFs to those obtained in an independent study. Our new EM-algorithm-based method is the first to account for the full genetic architecture of complex gene regions, such as the KIR gene region. This algorithm can handle the numerous observed ambiguities, and allows for the collapsing of haplotypes to perform implicit dimension reduction. Combining information from multiple genes improves haplotype reconstruction.

DNA测序技术的进步使得能够对表现出拷贝数变异和高等位基因多样性的复杂遗传区域进行基因分型，但不可能在所有情况下都得出确切的基因型，这往往导致基因型调用不明确，即部分缺失数据。这种基因区域的一个例子是杀伤细胞免疫球蛋白样受体（KIR）基因。这些基因在异基因造血干细胞移植中具有特殊的意义。对于这样复杂的基因区域，目前的单倍型重建方法是不可行的，因为它们无法应对数据的复杂性。我们提出了一种期望最大化（EM）算法来估计单倍型频率（HTF），该算法处理缺失的数据成分，并考虑基因之间的连锁不平衡（LD）。为了应对基因添加后单倍型数量的指数增长，我们在标准EM算法实现中添加了三个组件。首先，重复进行重建，一次添加一个基因。其次，在每一步之后，频率低于阈值的单倍型在一个罕见的单倍型组中崩溃。第三，在随后的迭代中对罕见单倍型组的HTF进行了分析，以改进估计。一项模拟研究评估了组合多个基因的信息对这些频率估计的影响。我们证明估计的传热函数是近似无偏的。我们的模拟研究表明，当LD高时，EM算法能够组合来自多个基因的信息，而模糊度的增加会增加偏差。基于该EM的线性回归模型表明，大量单倍型对于无偏效应大小估计可能存在问题，并且模型需要稀疏。在KIR基因型的真实数据分析中，我们将HTFs与独立研究中获得的HTFs进行了比较。我们新的基于EM算法的方法是第一个考虑复杂基因区域（如KIR基因区域）的完整遗传结构的方法。该算法可以处理大量观察到的模糊性，并允许单倍型的折叠来执行隐式降维。结合来自多个基因的信息可以改善单倍型的重建。

{"title":"Haplotype reconstruction for genetically complex regions with ambiguous genotype calls: Illustration by the KIR gene region","authors":"Lars L. J. van der Burg, Liesbeth C. de Wreede, Henning Baldauf, Jürgen Sauter, Johannes Schetelig, Hein Putter, Stefan Böhringer","doi":"10.1002/gepi.22538","DOIUrl":"10.1002/gepi.22538","url":null,"abstract":"Advances in DNA sequencing technologies have enabled genotyping of complex genetic regions exhibiting copy number variation and high allelic diversity, yet it is impossible to derive exact genotypes in all cases, often resulting in ambiguous genotype calls, that is, partially missing data. An example of such a gene region is the killer-cell immunoglobulin-like receptor (KIR) genes. These genes are of special interest in the context of allogeneic hematopoietic stem cell transplantation. For such complex gene regions, current haplotype reconstruction methods are not feasible as they cannot cope with the complexity of the data. We present an expectation–maximization (EM)-algorithm to estimate haplotype frequencies (HTFs) which deals with the missing data components, and takes into account linkage disequilibrium (LD) between genes. To cope with the exponential increase in the number of haplotypes as genes are added, we add three components to a standard EM-algorithm implementation. First, reconstruction is performed iteratively, adding one gene at a time. Second, after each step, haplotypes with frequencies below a threshold are collapsed in a rare haplotype group. Third, the HTF of the rare haplotype group is profiled in subsequent iterations to improve estimates. A simulation study evaluates the effect of combining information of multiple genes on the estimates of these frequencies. We show that estimated HTFs are approximately unbiased. Our simulation study shows that the EM-algorithm is able to combine information from multiple genes when LD is high, whereas increased ambiguity levels increase bias. Linear regression models based on this EM, show that a large number of haplotypes can be problematic for unbiased effect size estimation and that models need to be sparse. In a real data analysis of KIR genotypes, we compare HTFs to those obtained in an independent study. Our new EM-algorithm-based method is the first to account for the full genetic architecture of complex gene regions, such as the KIR gene region. This algorithm can handle the numerous observed ambiguities, and allows for the collapsing of haplotypes to perform implicit dimension reduction. Combining information from multiple genes improves haplotype reconstruction.","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"48 1","pages":"3-26"},"PeriodicalIF":2.1,"publicationDate":"2023-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22538","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41198906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Data-adaptive and pathway-based tests for association studies between somatic mutations and germline variations in human cancers 用于人类癌症体细胞突变和种系变异之间关联研究的数据适应性和基于途径的测试。

IF 2.1 4区医学 Q3 GENETICS & HEREDITY

Genetic Epidemiology

Pub Date : 2023-10-11 DOI: 10.1002/gepi.22537

Zhongyuan Chen, Han Liang, Peng Wei

Cancer is a disease driven by a combination of inherited genetic variants and somatic mutations. Recently available large-scale sequencing data of cancer genomes have provided an unprecedented opportunity to study the interactions between them. However, previous studies on this topic have been limited by simple, low statistical power tests such as Fisher's exact test. In this paper, we design data-adaptive and pathway-based tests based on the score statistic for association studies between somatic mutations and germline variations. Previous research has shown that two single-nucleotide polymorphism (SNP)-set-based association tests, adaptive sum of powered score (aSPU) and data-adaptive pathway-based (aSPUpath) tests, increase the power in genome-wide association studies (GWASs) with a single disease trait in a case–control study. We extend aSPU and aSPUpath to multi-traits, that is, somatic mutations of multiple genes in a cohort study, allowing extensive information aggregation at both SNP and gene levels. $� � � p � � �$ -values from different parameters assuming varying genetic architecture are combined to yield data-adaptive tests for somatic mutations and germline variations. Extensive simulations show that, in comparison with some commonly used methods, our data-adaptive somatic mutations/germline variations tests can be applied to multiple germline SNPs/genes/pathways, and generally have much higher statistical powers while maintaining the appropriate type I error. The proposed tests are applied to a large-scale real-world International Cancer Genome Consortium whole genome sequencing data set of 2583 subjects, detecting more significant and biologically relevant associations compared with the other existing methods on both gene and pathway levels. Our study has systematically identified the associations between various germline variations and somatic mutations across different cancer types, which potentially provides valuable utility for cancer risk prediction, prognosis, and therapeutics.

癌症是一种由遗传基因变异和体细胞突变共同驱动的疾病。最近可获得的癌症基因组的大规模测序数据为研究它们之间的相互作用提供了前所未有的机会。然而，以前关于这一主题的研究受到简单、低统计幂检验（如Fisher精确检验）的限制。在本文中，我们设计了基于得分统计的数据自适应和基于路径的测试，用于体细胞突变和种系变异之间的关联研究。先前的研究表明，在一项病例对照研究中，两种基于单核苷酸多态性（SNP）集的关联测试，即自适应总分（aSPU）和基于数据自适应通路（aSUPath）的测试，提高了与单一疾病特征的全基因组关联研究（GWAS）的能力。我们将aSPU和aSUPath扩展到多性状，即队列研究中多个基因的体细胞突变，允许在SNP和基因水平上进行广泛的信息聚合。p$p$-假设不同的遗传结构，将来自不同参数的值组合起来，以产生体细胞突变和种系变异的数据适应性测试。广泛的模拟表明，与一些常用的方法相比，我们的数据适应性体细胞突变/种系变异测试可以应用于多个种系SNPs/基因/途径，并且通常具有更高的统计能力，同时保持适当的I型误差。拟议的测试应用于一个由2583名受试者组成的大型现实世界国际癌症基因组联合会全基因组测序数据集，在基因和途径水平上检测到与其他现有方法相比更显著和生物学相关的关联。我们的研究系统地确定了不同癌症类型的各种种系变异和体细胞突变之间的关联，这可能为癌症风险预测、预后和治疗提供有价值的实用性。

{"title":"Data-adaptive and pathway-based tests for association studies between somatic mutations and germline variations in human cancers","authors":"Zhongyuan Chen, Han Liang, Peng Wei","doi":"10.1002/gepi.22537","DOIUrl":"10.1002/gepi.22537","url":null,"abstract":"Cancer is a disease driven by a combination of inherited genetic variants and somatic mutations. Recently available large-scale sequencing data of cancer genomes have provided an unprecedented opportunity to study the interactions between them. However, previous studies on this topic have been limited by simple, low statistical power tests such as Fisher's exact test. In this paper, we design data-adaptive and pathway-based tests based on the score statistic for association studies between somatic mutations and germline variations. Previous research has shown that two single-nucleotide polymorphism (SNP)-set-based association tests, adaptive sum of powered score (aSPU) and data-adaptive pathway-based (aSPUpath) tests, increase the power in genome-wide association studies (GWASs) with a single disease trait in a case–control study. We extend aSPU and aSPUpath to multi-traits, that is, somatic mutations of multiple genes in a cohort study, allowing extensive information aggregation at both SNP and gene levels. <math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>p</mi>\u0000 </mrow>\u0000 <annotation> $p$</annotation>\u0000 </semantics></math>-values from different parameters assuming varying genetic architecture are combined to yield data-adaptive tests for somatic mutations and germline variations. Extensive simulations show that, in comparison with some commonly used methods, our data-adaptive somatic mutations/germline variations tests can be applied to multiple germline SNPs/genes/pathways, and generally have much higher statistical powers while maintaining the appropriate type I error. The proposed tests are applied to a large-scale real-world International Cancer Genome Consortium whole genome sequencing data set of 2583 subjects, detecting more significant and biologically relevant associations compared with the other existing methods on both gene and pathway levels. Our study has systematically identified the associations between various germline variations and somatic mutations across different cancer types, which potentially provides valuable utility for cancer risk prediction, prognosis, and therapeutics.","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"47 8","pages":"617-636"},"PeriodicalIF":2.1,"publicationDate":"2023-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41198905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0