Adjusting for principal components can induce collider bias in genome-wide association studies.

IF 4 2区 生物学 Q1 GENETICS & HEREDITY PLoS Genetics Pub Date : 2024-12-16 eCollection Date: 2024-12-01 DOI:10.1371/journal.pgen.1011242
Kelsey E Grinde, Brian L Browning, Alexander P Reiner, Timothy A Thornton, Sharon R Browning
{"title":"Adjusting for principal components can induce collider bias in genome-wide association studies.","authors":"Kelsey E Grinde, Brian L Browning, Alexander P Reiner, Timothy A Thornton, Sharon R Browning","doi":"10.1371/journal.pgen.1011242","DOIUrl":null,"url":null,"abstract":"<p><p>Principal component analysis (PCA) is widely used to control for population structure in genome-wide association studies (GWAS). Top principal components (PCs) typically reflect population structure, but challenges arise in deciding how many PCs are needed and ensuring that PCs do not capture other artifacts such as regions with atypical linkage disequilibrium (LD). In response to the latter, many groups suggest performing LD pruning or excluding known high LD regions prior to PCA. However, these suggestions are not universally implemented and the implications for GWAS are not fully understood, especially in the context of admixed populations. In this paper, we investigate the impact of pre-processing and the number of PCs included in GWAS models in African American samples from the Women's Health Initiative SNP Health Association Resource and two Trans-Omics for Precision Medicine Whole Genome Sequencing Project contributing studies (Jackson Heart Study and Genetic Epidemiology of Chronic Obstructive Pulmonary Disease Study). In all three samples, we find the first PC is highly correlated with genome-wide ancestry whereas later PCs often capture local genomic features. The pattern of which, and how many, genetic variants are highly correlated with individual PCs differs from what has been observed in prior studies focused on European populations and leads to distinct downstream consequences: adjusting for such PCs yields biased effect size estimates and elevated rates of spurious associations due to the phenomenon of collider bias. Excluding high LD regions identified in previous studies does not resolve these issues. LD pruning proves more effective, but the optimal choice of thresholds varies across datasets. Altogether, our work highlights unique issues that arise when using PCA to control for ancestral heterogeneity in admixed populations and demonstrates the importance of careful pre-processing and diagnostics to ensure that PCs capturing multiple local genomic features are not included in GWAS models.</p>","PeriodicalId":49007,"journal":{"name":"PLoS Genetics","volume":"20 12","pages":"e1011242"},"PeriodicalIF":4.0000,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11684764/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS Genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1371/journal.pgen.1011242","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0

Abstract

Principal component analysis (PCA) is widely used to control for population structure in genome-wide association studies (GWAS). Top principal components (PCs) typically reflect population structure, but challenges arise in deciding how many PCs are needed and ensuring that PCs do not capture other artifacts such as regions with atypical linkage disequilibrium (LD). In response to the latter, many groups suggest performing LD pruning or excluding known high LD regions prior to PCA. However, these suggestions are not universally implemented and the implications for GWAS are not fully understood, especially in the context of admixed populations. In this paper, we investigate the impact of pre-processing and the number of PCs included in GWAS models in African American samples from the Women's Health Initiative SNP Health Association Resource and two Trans-Omics for Precision Medicine Whole Genome Sequencing Project contributing studies (Jackson Heart Study and Genetic Epidemiology of Chronic Obstructive Pulmonary Disease Study). In all three samples, we find the first PC is highly correlated with genome-wide ancestry whereas later PCs often capture local genomic features. The pattern of which, and how many, genetic variants are highly correlated with individual PCs differs from what has been observed in prior studies focused on European populations and leads to distinct downstream consequences: adjusting for such PCs yields biased effect size estimates and elevated rates of spurious associations due to the phenomenon of collider bias. Excluding high LD regions identified in previous studies does not resolve these issues. LD pruning proves more effective, but the optimal choice of thresholds varies across datasets. Altogether, our work highlights unique issues that arise when using PCA to control for ancestral heterogeneity in admixed populations and demonstrates the importance of careful pre-processing and diagnostics to ensure that PCs capturing multiple local genomic features are not included in GWAS models.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
调整主成分会导致全基因组关联研究中的对撞机偏差。
主成分分析(PCA)被广泛用于控制全基因组关联研究(GWAS)中的群体结构。顶部主成分(PCs)通常反映了群体结构,但在决定需要多少个 PCs 以及确保 PCs 不会捕捉到其他假象(如具有非典型连锁不平衡(LD)的区域)方面存在挑战。针对后者,许多研究小组建议在 PCA 之前进行 LD 修剪或排除已知的高 LD 区域。然而,这些建议并没有得到普遍实施,对 GWAS 的影响也没有得到充分理解,尤其是在混杂人群中。在本文中,我们从妇女健康倡议 SNP 健康关联资源和两项跨奥美精准医学全基因组测序项目贡献研究(杰克逊心脏研究和慢性阻塞性肺病遗传流行病学研究)的非裔美国人样本中,研究了预处理和 GWAS 模型中 PCs 数量的影响。在所有三个样本中,我们发现第一个 PC 与全基因组祖先高度相关,而后面的 PC 通常捕捉局部基因组特征。哪些基因变异以及有多少基因变异与单个 PC 高度相关,这种模式与之前针对欧洲人群的研究中观察到的模式不同,并导致了不同的下游后果:由于对撞机偏差现象,调整这些 PC 会产生有偏差的效应大小估计值和较高的虚假关联率。排除以往研究中发现的高 LD 区域并不能解决这些问题。低密度修剪证明更为有效,但不同数据集的最佳阈值选择各不相同。总之,我们的研究突出了使用 PCA 控制混血人群祖先异质性时出现的独特问题,并证明了仔细预处理和诊断的重要性,以确保捕获多个局部基因组特征的 PC 不被纳入 GWAS 模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
PLoS Genetics
PLoS Genetics GENETICS & HEREDITY-
自引率
2.20%
发文量
438
期刊介绍: PLOS Genetics is run by an international Editorial Board, headed by the Editors-in-Chief, Greg Barsh (HudsonAlpha Institute of Biotechnology, and Stanford University School of Medicine) and Greg Copenhaver (The University of North Carolina at Chapel Hill). Articles published in PLOS Genetics are archived in PubMed Central and cited in PubMed.
期刊最新文献
Exploring adaptation routes to cold temperatures in the Saccharomyces genus. Functional constraints of wtf killer meiotic drivers. Two transmembrane transcriptional regulators coordinate to activate chitin-induced natural transformation in Vibrio cholerae. The recombination landscape of introgression in yeast. Transcriptomic analysis of iPSC-derived endothelium reveals adaptations to high altitude hypoxia in energy metabolism and inflammation.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1