首页 > 最新文献

Genetic Epidemiology最新文献

英文 中文
Integrative Harmonization of Phenotypic and Genomic Data Improves Bone Mineral Density Prediction in Multi-Study Osteoporosis Research. 在多研究骨质疏松症研究中,表型和基因组数据的整合协调改善了骨矿物质密度预测。
IF 3.8 4区 医学 Q3 GENETICS & HEREDITY Pub Date : 2026-02-01 DOI: 10.1002/gepi.70028
Anqi Liu, Jianing Liu, Lang Wu, Qing Wu

Harmonizing osteoporosis-related data across multiple data sets is essential for improving the accuracy and generalizability of bone mineral density (BMD) assessments. This study developed a harmonization framework to standardize phenotypic and genomic variables across three major US osteoporosis data sets: GDBF, GWAS, and NHANES. We standardized key phenotypic variables (BMD, body mass index (BMI), age, sex, and race/ethnicity) using cohort-specific data dictionaries and applied multiple imputations by chained equations (MICEs) to manage missing data. Genomic data were harmonized using principal component analysis (PCA)-based batch effect corrections. Residual regression methods were applied to standardize BMD values. The effectiveness of harmonization on BMD prediction was evaluated using generalized estimating equations (GEEs) and mixed-effects models. Post-harmonization, inter-study variability in BMI was significantly reduced (Ω2 = 0.0028), and BMD associations with covariates remained consistent across data sets. Harmonized models showed improved predictive performance, with explained variance in BMD increasing (R2 = 0.14%). PCA confirmed the effective alignment of genetic data, reducing batch effects and improving cross-study compatibility. This study demonstrates the feasibility and effectiveness of harmonizing phenotypic and genomic data for osteoporosis research. The harmonization framework enhances BMD prediction accuracy, supports more inclusive osteoporosis risk assessment, and improves the integration of multi-cohort data sets for future research. These findings highlight the potential of data harmonization in advancing precision medicine for osteoporosis prevention and management.

协调跨多个数据集的骨质疏松相关数据对于提高骨矿物质密度(BMD)评估的准确性和普遍性至关重要。本研究开发了一个统一框架,以标准化美国三个主要骨质疏松症数据集(GDBF、GWAS和NHANES)的表型和基因组变量。我们标准化关键表型变量(骨密度、体重指数(BMI)、年龄、性别和种族/民族),使用队列特定的数据字典,并通过链式方程(ices)应用多重impuimpuations来管理缺失的数据。使用基于主成分分析(PCA)的批效应校正对基因组数据进行协调。残差回归方法对骨密度值进行标准化。采用广义估计方程(GEEs)和混合效应模型评价了协调对BMD预测的有效性。协调后,BMI的研究间变异性显著降低(Ω2 = 0.0028),并且BMD与协变量的关联在数据集之间保持一致。协调模型显示出更好的预测性能,BMD的解释方差增加(R2 = 0.14%)。主成分分析证实了遗传数据的有效比对,减少了批次效应,提高了交叉研究的相容性。本研究证明了协调骨质疏松症研究的表型和基因组数据的可行性和有效性。统一框架提高了BMD预测的准确性,支持更具包容性的骨质疏松症风险评估,并为未来的研究改善了多队列数据集的整合。这些发现突出了数据协调在推进骨质疏松预防和管理精准医学方面的潜力。
{"title":"Integrative Harmonization of Phenotypic and Genomic Data Improves Bone Mineral Density Prediction in Multi-Study Osteoporosis Research.","authors":"Anqi Liu, Jianing Liu, Lang Wu, Qing Wu","doi":"10.1002/gepi.70028","DOIUrl":"10.1002/gepi.70028","url":null,"abstract":"<p><p>Harmonizing osteoporosis-related data across multiple data sets is essential for improving the accuracy and generalizability of bone mineral density (BMD) assessments. This study developed a harmonization framework to standardize phenotypic and genomic variables across three major US osteoporosis data sets: GDBF, GWAS, and NHANES. We standardized key phenotypic variables (BMD, body mass index (BMI), age, sex, and race/ethnicity) using cohort-specific data dictionaries and applied multiple imputations by chained equations (MICEs) to manage missing data. Genomic data were harmonized using principal component analysis (PCA)-based batch effect corrections. Residual regression methods were applied to standardize BMD values. The effectiveness of harmonization on BMD prediction was evaluated using generalized estimating equations (GEEs) and mixed-effects models. Post-harmonization, inter-study variability in BMI was significantly reduced (Ω<sup>2</sup> = 0.0028), and BMD associations with covariates remained consistent across data sets. Harmonized models showed improved predictive performance, with explained variance in BMD increasing (R<sup>2</sup> = 0.14%). PCA confirmed the effective alignment of genetic data, reducing batch effects and improving cross-study compatibility. This study demonstrates the feasibility and effectiveness of harmonizing phenotypic and genomic data for osteoporosis research. The harmonization framework enhances BMD prediction accuracy, supports more inclusive osteoporosis risk assessment, and improves the integration of multi-cohort data sets for future research. These findings highlight the potential of data harmonization in advancing precision medicine for osteoporosis prevention and management.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"50 1","pages":"e70028"},"PeriodicalIF":3.8,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12836449/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146051694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Extending the Use of Mendelian Randomisation With Non-Inherited Variants to Assess Socially Transmitted Parental Exposures Under Assortative Mating 扩展孟德尔随机化与非遗传变异的使用,以评估在分类交配下社会传播的亲本暴露。
IF 3.8 4区 医学 Q3 GENETICS & HEREDITY Pub Date : 2026-01-21 DOI: 10.1002/gepi.70031
Benjamin Woolf, Amy Mason, Chin Yang Shapland, Hyunseung Kang, Hannah M. Sallis, Stephen Burgess, Marcus R. Munafò

A longstanding aim of developmental psychology and epidemiology is to understand the causal effects of parental phenotypes on offspring outcomes. Traditional approaches often fail to account for confounding and reverse causation. We evaluate the use of Mendelian randomisation with non-inherited variants (MR-NIV) to address these limitations. MR-NIV leverages non-inherited genetic variants to instrument the parental phenotype independent of the offspring's genotype. We used Directed Acyclic Graphs and simulations to validate MR-NIV and explore robustness to assortative mating. In contrast to an alternative MR method which adjusts the parental genotype for offspring genotype, MR-NIV can be robust to assortative mating when used without trio data. In settings without trio data, MR-NIV outperformed the adjustment method. The adjustment method outperformed MR-NIV in settings with trio data. Applying MR-NIV to the Avon Longitudinal Study of Parents and Children, we assessed the causal effect of parental smoking on offspring smoking initiation at age 16. Results were consistent with observational studies, suggesting a meaningful increase in the risk of offspring smoking due to parental smoking. However, larger sample sizes will be necessary to provide a precise answer. MR-NIV offers a promising extension of Mendelian randomisation for studying the developmental environment.

发展心理学和流行病学的一个长期目标是了解亲代表型对后代结果的因果影响。传统的方法往往不能解释混淆和反向因果关系。我们评估了孟德尔随机化与非遗传变异(MR-NIV)的使用,以解决这些局限性。MR-NIV利用非遗传基因变异来检测亲代表型,而不依赖于后代的基因型。我们使用有向无环图和模拟来验证MR-NIV,并探索对分类交配的鲁棒性。与另一种根据后代基因型调整亲本基因型的MR方法相反,MR- niv在没有三人数据的情况下可以对选型交配进行稳健处理。在没有三重数据的情况下,MR-NIV的调整效果优于其他方法。在三重数据的情况下,调整方法优于MR-NIV。将MR-NIV应用于父母和儿童的雅芳纵向研究,我们评估了父母吸烟对子女在16岁时开始吸烟的因果影响。结果与观察性研究一致,表明父母吸烟会显著增加后代吸烟的风险。然而,更大的样本量将需要提供一个精确的答案。MR-NIV为研究发育环境提供了孟德尔随机化的一个有希望的扩展。
{"title":"Extending the Use of Mendelian Randomisation With Non-Inherited Variants to Assess Socially Transmitted Parental Exposures Under Assortative Mating","authors":"Benjamin Woolf,&nbsp;Amy Mason,&nbsp;Chin Yang Shapland,&nbsp;Hyunseung Kang,&nbsp;Hannah M. Sallis,&nbsp;Stephen Burgess,&nbsp;Marcus R. Munafò","doi":"10.1002/gepi.70031","DOIUrl":"10.1002/gepi.70031","url":null,"abstract":"<p>A longstanding aim of developmental psychology and epidemiology is to understand the causal effects of parental phenotypes on offspring outcomes. Traditional approaches often fail to account for confounding and reverse causation. We evaluate the use of Mendelian randomisation with non-inherited variants (MR-NIV) to address these limitations. MR-NIV leverages non-inherited genetic variants to instrument the parental phenotype independent of the offspring's genotype. We used Directed Acyclic Graphs and simulations to validate MR-NIV and explore robustness to assortative mating. In contrast to an alternative MR method which adjusts the parental genotype for offspring genotype, MR-NIV can be robust to assortative mating when used without trio data. In settings without trio data, MR-NIV outperformed the adjustment method. The adjustment method outperformed MR-NIV in settings with trio data. Applying MR-NIV to the Avon Longitudinal Study of Parents and Children, we assessed the causal effect of parental smoking on offspring smoking initiation at age 16. Results were consistent with observational studies, suggesting a meaningful increase in the risk of offspring smoking due to parental smoking. However, larger sample sizes will be necessary to provide a precise answer. MR-NIV offers a promising extension of Mendelian randomisation for studying the developmental environment.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"50 1","pages":""},"PeriodicalIF":3.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12820921/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146010103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-Ancestry Polygenic Prediction: Comparing Methods and Assessing Transferability Across Traits 跨祖先多基因预测:比较方法和评估性状间的可转移性。
IF 3.8 4区 医学 Q3 GENETICS & HEREDITY Pub Date : 2026-01-21 DOI: 10.1002/gepi.70029
Md. Moksedul Momin, Xuan Zhou, Muktar Ahmed, Elina Hyppönen, Beben Benyamin, S. Hong Lee

Accurate prediction of disease risk and other complex traits across different populations is essential for clinical and research purposes. However, genetic differences among ancestries, such as allelic frequencies and genetic architecture, can affect the performance of polygenic risk score (PGS) methods in cross-ancestry prediction. To address this issue, we conducted a formal test of seven polygenic prediction methods applicable across ancestries for five traits (BMI, standing height, LDL-, HDL- and total-cholesterol) from the UK Biobank dataset. We demonstrate that, GBLUP and PRS-CSx outperformed other methods for highly polygenic traits like height and BMI. In contrast, PRSice and PolyPred performed best for less polygenic traits like cholesterol, with PRS-CSx being comparable with larger sample sizes. We also observed that utilizing concordant SNPs, which have the same effect direction across diverse ancestries, can improve the accuracy of cross-ancestry PGS models. Furthermore, we found that the transferability of PGS across ancestries varied depending on the trait. Understanding the strengths and limitations of different methods and approaches is important for future methodological development and improvement, enabling better interpretation and application of PGS results in clinical and research settings.

准确预测不同人群的疾病风险和其他复杂特征对于临床和研究目的至关重要。然而,祖先之间的遗传差异,如等位基因频率和遗传结构,会影响多基因风险评分(PGS)方法在跨祖先预测中的表现。为了解决这个问题,我们对七种多基因预测方法进行了正式测试,这些方法适用于来自英国生物银行数据集的五种特征(BMI、站立高度、LDL-、HDL-和总胆固醇)。我们证明,GBLUP和PRS-CSx在身高和BMI等高多基因性状上优于其他方法。相比之下,PRSice和PolyPred在胆固醇等较少的多基因性状上表现最好,而PRS-CSx在更大的样本量下可以与之媲美。我们还观察到,利用在不同祖先中具有相同影响方向的一致性snp可以提高跨祖先PGS模型的准确性。此外,我们发现PGS在不同祖先之间的可转移性因性状而异。了解不同方法和途径的优势和局限性对未来方法学的发展和改进非常重要,可以在临床和研究环境中更好地解释和应用PGS结果。
{"title":"Cross-Ancestry Polygenic Prediction: Comparing Methods and Assessing Transferability Across Traits","authors":"Md. Moksedul Momin,&nbsp;Xuan Zhou,&nbsp;Muktar Ahmed,&nbsp;Elina Hyppönen,&nbsp;Beben Benyamin,&nbsp;S. Hong Lee","doi":"10.1002/gepi.70029","DOIUrl":"10.1002/gepi.70029","url":null,"abstract":"<p>Accurate prediction of disease risk and other complex traits across different populations is essential for clinical and research purposes. However, genetic differences among ancestries, such as allelic frequencies and genetic architecture, can affect the performance of polygenic risk score (PGS) methods in cross-ancestry prediction. To address this issue, we conducted a formal test of seven polygenic prediction methods applicable across ancestries for five traits (BMI, standing height, LDL-, HDL- and total-cholesterol) from the UK Biobank dataset. We demonstrate that, GBLUP and PRS-CSx outperformed other methods for highly polygenic traits like height and BMI. In contrast, PRSice and PolyPred performed best for less polygenic traits like cholesterol, with PRS-CSx being comparable with larger sample sizes. We also observed that utilizing concordant SNPs, which have the same effect direction across diverse ancestries, can improve the accuracy of cross-ancestry PGS models. Furthermore, we found that the transferability of PGS across ancestries varied depending on the trait. Understanding the strengths and limitations of different methods and approaches is important for future methodological development and improvement, enabling better interpretation and application of PGS results in clinical and research settings.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"50 1","pages":""},"PeriodicalIF":3.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12820924/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146010071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MELODY: Mediation Analysis in Logistic Regression for High-Dimensional Mediators and a Binary Outcome 旋律:高维中介和二元结果的逻辑回归中介分析。
IF 3.8 4区 医学 Q3 GENETICS & HEREDITY Pub Date : 2026-01-20 DOI: 10.1002/gepi.70033
Sunyi Chi, Xingyu Li, Peng Wei, Xuelin Huang

Mediation analysis is a pivotal tool for elucidating the indirect effect of an environmental factor or treatment on disease through potentially high-dimensional omics data, such as gene expression profiles. However, traditional mediation analysis methods tailored for binary outcomes often rely on the rare disease assumption in logistic regression and provide inadequate measures of total mediation effect when multiple mediators have effects in different directions. In this paper, we develop a MEdiation analysis framework in LOgistic regression for high-Dimensional mediators and a binarY outcome (MELODY). It leverages a second-moment-based measure analogous to the � � R� � 2 ${R}^{2}$ for linear models to quantify the total mediation effect. We also develop a variable selection procedure for high-dimensional data to reduce bias introduced by non-mediators. Our comprehensive simulations demonstrate the superior performance of MELODY in scenarios with non-rare disease binary outcomes and high-dimensional mediators. We apply MELODY to the Framingham Heart Study of over 5000 individuals to analyze the mediation effects of metabolomics and transcriptomics data on the pathways from sex to incident coronary heart disease.

中介分析是通过潜在的高维组学数据(如基因表达谱)阐明环境因素或治疗对疾病的间接影响的关键工具。然而,传统的针对二元结果的中介分析方法往往依赖于逻辑回归中的罕见病假设,当多个中介在不同方向上起作用时,对总中介效应的测量不足。在本文中,我们开发了一个用于高维中介和二元结果(MELODY)的LOgistic回归的中介分析框架。它利用类似于线性模型r2 ${R}^{2}$的基于秒矩的度量来量化总中介效应。我们还为高维数据开发了一个变量选择程序,以减少非中介引入的偏差。我们的综合模拟表明,MELODY在具有非罕见疾病二元结果和高维介质的情况下具有优越的性能。我们将MELODY应用于超过5000人的Framingham心脏研究,分析代谢组学和转录组学数据在从性别到冠心病发生的途径中的中介作用。
{"title":"MELODY: Mediation Analysis in Logistic Regression for High-Dimensional Mediators and a Binary Outcome","authors":"Sunyi Chi,&nbsp;Xingyu Li,&nbsp;Peng Wei,&nbsp;Xuelin Huang","doi":"10.1002/gepi.70033","DOIUrl":"10.1002/gepi.70033","url":null,"abstract":"<p>Mediation analysis is a pivotal tool for elucidating the indirect effect of an environmental factor or treatment on disease through potentially high-dimensional omics data, such as gene expression profiles. However, traditional mediation analysis methods tailored for binary outcomes often rely on the rare disease assumption in logistic regression and provide inadequate measures of total mediation effect when multiple mediators have effects in different directions. In this paper, we develop a MEdiation analysis framework in LOgistic regression for high-Dimensional mediators and a binarY outcome (MELODY). It leverages a second-moment-based measure analogous to the <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 \u0000 <mrow>\u0000 <msup>\u0000 <mi>R</mi>\u0000 \u0000 <mn>2</mn>\u0000 </msup>\u0000 </mrow>\u0000 </mrow>\u0000 <annotation> ${R}^{2}$</annotation>\u0000 </semantics></math> for linear models to quantify the total mediation effect. We also develop a variable selection procedure for high-dimensional data to reduce bias introduced by non-mediators. Our comprehensive simulations demonstrate the superior performance of MELODY in scenarios with non-rare disease binary outcomes and high-dimensional mediators. We apply MELODY to the Framingham Heart Study of over 5000 individuals to analyze the mediation effects of metabolomics and transcriptomics data on the pathways from sex to incident coronary heart disease.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"50 1","pages":""},"PeriodicalIF":3.8,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12820538/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146010084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Linear Mixed Model With Measurement Error Correction (LMM-MEC): A Method for Summary-Data-Based Multivariable Mendelian Randomization 具有测量误差校正的线性混合模型:一种基于汇总数据的多变量孟德尔随机化方法
IF 3.8 4区 医学 Q3 GENETICS & HEREDITY Pub Date : 2026-01-19 DOI: 10.1002/gepi.70032
Ming Ding, Fei Zou

Summary-data-based multivariable Mendelian randomization (MVMR) methods, such as MVMR-Egger, MVMR-IVW, MVMR median-based, and MVMR-PRESSO, are used to assess the causal effects of multiple risk factors on disease. However, accounting for variances in the summary statistics of risk factors remains a challenge. We propose a linear mixed model with measurement error correction (LMM-MEC) that accounts for the variance in summary statistics for both disease outcomes and risk factors. First, under the NOME assumption, we apply a linear mixed model to account for variance in disease summary statistics by treating it as fixed- or random-effects, depending on whether there is heterogeneity in the effect sizes of the genetic variants on the disease outcome. Next, we relax the NOME assumption and further take the estimation error (or variance) in the summary statistics of risk factors into consideration by measurement models through a regression calibration approach. In a simulation study, using independent genetic variants as instrumental variables (IV), our method showed comparable performance to existing MVMR methods under conditions of no pleiotropy or with balanced pleiotropy on the disease outcome, and it achieved slightly improved coverage rates and power under directional pleiotropy. When genetic variants are in low to moderate linkage disequilibrium (LD) (0 < � � ρ $rho $2 ≤ 0.3), our method showed comparable performance to MVMR-Egger, although both methods showed reduced coverage rates and power compared to situations where genetic variants as IVs are in LD. In the application study, we examined causal associations between correlated cholesterol biomarkers and longevity. By including 739 genetic variants selected based on p values < 5 × 10−5 from GWAS and allowing for low LD (� � ρ $rho $2 ≤ 0.1), our method identified that large LDL-c levels were causally associated with a lower likelihood of achieving longevity.

基于汇总数据的多变量孟德尔随机化(MVMR)方法,如MVMR- egger、MVMR- ivw、MVMR- median-based和MVMR- presso,用于评估多种危险因素对疾病的因果影响。然而,对风险因素汇总统计中的差异进行解释仍然是一项挑战。我们提出了一个测量误差校正的线性混合模型(LMM-MEC),该模型考虑了疾病结局和危险因素的汇总统计差异。首先,在NOME假设下,我们应用线性混合模型来解释疾病汇总统计中的方差,将其视为固定效应或随机效应,这取决于遗传变异对疾病结果的效应大小是否存在异质性。其次,我们放宽NOME假设,并通过回归校准方法进一步考虑测量模型对风险因素汇总统计中的估计误差(或方差)。在一项模拟研究中,使用独立遗传变异作为工具变量(IV),我们的方法在无多效性或对疾病结局具有平衡多效性的条件下表现出与现有MVMR方法相当的性能,并且在定向多效性下覆盖率和功率略有提高。当遗传变异处于低至中度连锁不平衡(LD) (0 < ρ $rho $ 2≤0.3)时,我们的方法表现出与MVMR-Egger相当的性能,尽管与遗传变异作为IVs处于LD的情况相比,这两种方法的覆盖率和功率都有所降低。我们研究了相关胆固醇生物标志物与寿命之间的因果关系。通过纳入基于GWAS的p值<; 5 × 10−5选择的739个遗传变异,并允许低LD (ρ $rho $ 2≤0.1),我们的方法确定,高LDL-c水平与较低的长寿可能性存在因果关系。
{"title":"A Linear Mixed Model With Measurement Error Correction (LMM-MEC): A Method for Summary-Data-Based Multivariable Mendelian Randomization","authors":"Ming Ding,&nbsp;Fei Zou","doi":"10.1002/gepi.70032","DOIUrl":"https://doi.org/10.1002/gepi.70032","url":null,"abstract":"<div>\u0000 \u0000 <p>Summary-data-based multivariable Mendelian randomization (MVMR) methods, such as MVMR-Egger, MVMR-IVW, MVMR median-based, and MVMR-PRESSO, are used to assess the causal effects of multiple risk factors on disease. However, accounting for variances in the summary statistics of risk factors remains a challenge. We propose a linear mixed model with measurement error correction (LMM-MEC) that accounts for the variance in summary statistics for both disease outcomes and risk factors. First, under the NOME assumption, we apply a linear mixed model to account for variance in disease summary statistics by treating it as fixed- or random-effects, depending on whether there is heterogeneity in the effect sizes of the genetic variants on the disease outcome. Next, we relax the NOME assumption and further take the estimation error (or variance) in the summary statistics of risk factors into consideration by measurement models through a regression calibration approach. In a simulation study, using independent genetic variants as instrumental variables (IV), our method showed comparable performance to existing MVMR methods under conditions of no pleiotropy or with balanced pleiotropy on the disease outcome, and it achieved slightly improved coverage rates and power under directional pleiotropy. When genetic variants are in low to moderate linkage disequilibrium (LD) (0 &lt; <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 \u0000 <mrow>\u0000 <mi>ρ</mi>\u0000 </mrow>\u0000 </mrow>\u0000 <annotation> $rho $</annotation>\u0000 </semantics></math><sup>2</sup> ≤ 0.3), our method showed comparable performance to MVMR-Egger, although both methods showed reduced coverage rates and power compared to situations where genetic variants as IVs are in LD. In the application study, we examined causal associations between correlated cholesterol biomarkers and longevity. By including 739 genetic variants selected based on <i>p</i> values &lt; 5 × 10<sup>−5</sup> from GWAS and allowing for low LD (<span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 \u0000 <mrow>\u0000 <mi>ρ</mi>\u0000 </mrow>\u0000 </mrow>\u0000 <annotation> $rho $</annotation>\u0000 </semantics></math><sup>2</sup> ≤ 0.1), our method identified that large LDL-c levels were causally associated with a lower likelihood of achieving longevity.</p>\u0000 </div>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"50 1","pages":""},"PeriodicalIF":3.8,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146002510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large-Scale Genotype-Based Trait Imputation With Multi-Ancestry GWAS Data 基于多祖先GWAS数据的大规模基因型特征归算。
IF 3.8 4区 医学 Q3 GENETICS & HEREDITY Pub Date : 2026-01-15 DOI: 10.1002/gepi.70030
Jingchen Ren, Wei Pan

Genome-wide association studies (GWAS) have been instrumental in identifying genetic variants associated with complex traits and diseases, including Alzheimer's disease (AD). However, traditional GWAS approaches often focus on European populations, which may lead to loss of power and limit the generalizability of findings across diverse ancestries. On the other hand, LS-Imputation, a nonparametric trait imputation method, leverages GWAS summary statistics and genotype data to impute missing traits, which can then be used for GWAS and other downstream analyses. Although LS-Imputation has been applied successfully to European populations, its performance in non-European populations would be hindered by smaller sample sizes, leading to reduced imputation accuracy. To address these limitations, we propose two novel variants of LS-Imputation-LS-Imputation-Combined and LS-Imputation-Transfer—designed to integrate multi-ancestry GWAS data and enhance imputation performance. LS-Imputation-Combined optimally combines GWAS summary statistics from multiple ancestries, while LS-Imputation-Transfer sequentially refines imputed trait values across ancestries using stochastic gradient descent. We evaluate these methods using data from the UK Biobank and the Alzheimer's Disease Sequencing Project (ADSP), first applying them to high-density lipoprotein (HDL) cholesterol levels as a proof-of-concept before focusing on imputing AD status in Black individuals for genetic association analysis. Our results demonstrate that integrating multi-ancestry GWAS data improves trait imputation accuracy, with LS-Imputation-Transfer achieving the highest performance.

全基因组关联研究(GWAS)在识别与复杂性状和疾病(包括阿尔茨海默病(AD))相关的遗传变异方面发挥了重要作用。然而,传统的GWAS方法往往侧重于欧洲人群,这可能会导致权力的丧失,并限制了在不同祖先中发现的普遍性。另一方面,LS-Imputation是一种非参数性状归算方法,它利用GWAS的汇总统计和基因型数据来归算缺失性状,然后用于GWAS和其他下游分析。虽然LS-Imputation已经成功地应用于欧洲人群,但其在非欧洲人群中的表现将受到较小样本量的阻碍,导致插入精度降低。为了解决这些限制,我们提出了LS-Imputation-LS-Imputation-Combined和ls - imputation - transfer两种新的变体,旨在集成多祖先GWAS数据并提高imputation性能。LS-Imputation-Combined最优地组合了来自多个祖先的GWAS汇总统计数据,而LS-Imputation-Transfer则使用随机梯度下降法依次优化了跨祖先的估算性状值。我们使用来自英国生物银行和阿尔茨海默病测序项目(ADSP)的数据对这些方法进行了评估,首先将它们应用于高密度脂蛋白(HDL)胆固醇水平作为概念验证,然后将重点放在黑人个体的AD状态上进行遗传关联分析。研究结果表明,整合多祖先GWAS数据可以提高性状归算精度,其中LS-Imputation-Transfer达到最高的性能。
{"title":"Large-Scale Genotype-Based Trait Imputation With Multi-Ancestry GWAS Data","authors":"Jingchen Ren,&nbsp;Wei Pan","doi":"10.1002/gepi.70030","DOIUrl":"10.1002/gepi.70030","url":null,"abstract":"<p>Genome-wide association studies (GWAS) have been instrumental in identifying genetic variants associated with complex traits and diseases, including Alzheimer's disease (AD). However, traditional GWAS approaches often focus on European populations, which may lead to loss of power and limit the generalizability of findings across diverse ancestries. On the other hand, LS-Imputation, a nonparametric trait imputation method, leverages GWAS summary statistics and genotype data to impute missing traits, which can then be used for GWAS and other downstream analyses. Although LS-Imputation has been applied successfully to European populations, its performance in non-European populations would be hindered by smaller sample sizes, leading to reduced imputation accuracy. To address these limitations, we propose two novel variants of LS-Imputation-LS-Imputation-Combined and LS-Imputation-Transfer—designed to integrate multi-ancestry GWAS data and enhance imputation performance. LS-Imputation-Combined optimally combines GWAS summary statistics from multiple ancestries, while LS-Imputation-Transfer sequentially refines imputed trait values across ancestries using stochastic gradient descent. We evaluate these methods using data from the UK Biobank and the Alzheimer's Disease Sequencing Project (ADSP), first applying them to high-density lipoprotein (HDL) cholesterol levels as a proof-of-concept before focusing on imputing AD status in Black individuals for genetic association analysis. Our results demonstrate that integrating multi-ancestry GWAS data improves trait imputation accuracy, with LS-Imputation-Transfer achieving the highest performance.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"50 1","pages":""},"PeriodicalIF":3.8,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12805644/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145984816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identifying Causal Genotype–Phenotype Relationships for Population-Sampled Parent–Child Trios 确定人群抽样的亲子三人组的基因型-表型因果关系。
IF 3.8 4区 医学 Q3 GENETICS & HEREDITY Pub Date : 2026-01-11 DOI: 10.1002/gepi.70027
Yushi Tang, Irineo Cabreros, John D. Storey

The process by which genes are transmitted from parent to child provides a source of randomization preceding all other factors that may causally influence any particular child phenotype. Because of this, it is natural to consider genetic transmission as a source of experimental randomization. In this work, we show how parent–child trio data can be leveraged to identify causal genetic loci by modeling the randomization during genetic transmission. We develop a new test, the transmission mean test (TMT), together with its unbiased estimator of the average causal effect, and derive its causal properties within the potential outcomes framework. We also prove that the transmission disequilibrium test (TDT) is a test of causality as a complementary case of the TMT for the affected-only design. The TMT and the TDT differ in the types of traits that they can handle and the study designs for which they are appropriate. The TMT handles arbitrarily distributed traits and is appropriate when trios are randomly sampled; the TDT handles dichotomous traits and is appropriate when sampling is based on a child's trait status. We compare the transmission-based methods with established approaches for genotype–phenotype analyses to clarify conditions appropriate for each method, what conclusions can be drawn by each one, and how these methods can be used together.

基因从父母传递给孩子的过程提供了一个随机化的来源,先于所有其他可能对任何特定儿童表型产生因果影响的因素。正因为如此,将遗传传递视为实验随机化的一个来源是很自然的。在这项工作中,我们展示了如何利用亲子三人组数据通过模拟遗传传播过程中的随机化来识别因果遗传位点。我们开发了一种新的检验,即传递均值检验(TMT)及其平均因果效应的无偏估计量,并在潜在结果框架内推导出其因果性质。我们还证明了传输不平衡检验(TDT)是一种因果关系检验,作为纯受影响设计的TMT的补充案例。TMT和TDT在他们可以处理的特征类型和他们适合的研究设计上有所不同。TMT处理任意分布的特征,适用于随机抽样的三人组;TDT处理二分特征,当抽样是基于孩子的特征状态时是合适的。我们将基于传播的方法与现有的基因型-表型分析方法进行比较,以阐明每种方法适用的条件,每种方法可以得出什么结论,以及如何将这些方法结合使用。
{"title":"Identifying Causal Genotype–Phenotype Relationships for Population-Sampled Parent–Child Trios","authors":"Yushi Tang,&nbsp;Irineo Cabreros,&nbsp;John D. Storey","doi":"10.1002/gepi.70027","DOIUrl":"10.1002/gepi.70027","url":null,"abstract":"<p>The process by which genes are transmitted from parent to child provides a source of randomization preceding all other factors that may causally influence any particular child phenotype. Because of this, it is natural to consider genetic transmission as a source of experimental randomization. In this work, we show how parent–child trio data can be leveraged to identify causal genetic loci by modeling the randomization during genetic transmission. We develop a new test, the transmission mean test (TMT), together with its unbiased estimator of the average causal effect, and derive its causal properties within the potential outcomes framework. We also prove that the transmission disequilibrium test (TDT) is a test of causality as a complementary case of the TMT for the affected-only design. The TMT and the TDT differ in the types of traits that they can handle and the study designs for which they are appropriate. The TMT handles arbitrarily distributed traits and is appropriate when trios are randomly sampled; the TDT handles dichotomous traits and is appropriate when sampling is based on a child's trait status. We compare the transmission-based methods with established approaches for genotype–phenotype analyses to clarify conditions appropriate for each method, what conclusions can be drawn by each one, and how these methods can be used together.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"50 1","pages":""},"PeriodicalIF":3.8,"publicationDate":"2026-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12793723/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145951802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Moderating the Heritability of Body Mass Index by Age and Sex With Genomic Data 用基因组数据调节年龄和性别对体重指数遗传力的影响
IF 3.8 4区 医学 Q3 GENETICS & HEREDITY Pub Date : 2026-01-06 DOI: 10.1002/gepi.70026
Sarah E. Benstock, Elizabeth Prom-Wormley, Brad Verhulst

Environmental contexts may increase or decrease the heritability of a phenotype. Or equivalently, people with some genotypes may be more (or less) sensitive to environmental differences. Such genetic sensitivity to environmental contexts is called gene-environment interaction (G × E). While G × E has been robustly detected in twin and model organism studies for numerous phenotypes, there is a lingering perception that existing genome-wide G × E methods struggle to identify substantively significant and replicable interactions. We propose a novel method for examining G × E heritability using genetic marginal effects from genome-wide G × E analyses and Linkage Disequilibrium Score Regression (LDSC). We demonstrate the effectiveness of our method for body mass index (BMI) using biological sex (binary) and age (continuous) as moderators. Using the same procedures for both binary and continuous moderators, we detect robust evidence for G × E. Our results are consistent with findings from twin G × E studies of BMI and are more sensitive to environmental moderation than other LDSC-based methods. We conclude that BMI heritability is substantially more sensitive to variation in sex and age than is currently appreciated. Extending this method to other phenotype-moderator combinations has the potential to reveal G × E across numerous outcomes and moderators.

环境背景可能增加或减少表型的遗传力。或者同样,具有某些基因型的人可能对环境差异更敏感(或更不敏感)。这种对环境的遗传敏感性被称为基因-环境相互作用(gxe)。虽然G × E已经在许多表型的双胞胎和模式生物研究中得到了强有力的检测,但有一种挥之不去的看法是,现有的全基因组G × E方法难以识别具有实质性意义和可复制的相互作用。我们提出了一种利用全基因组遗传边际效应和连锁不平衡评分回归(LDSC)来检验遗传力的新方法。我们使用生物性别(二元)和年龄(连续)作为调节因子,证明了我们的身体质量指数(BMI)方法的有效性。对二元和连续调节子使用相同的程序,我们发现了G × E的有力证据。我们的结果与双G × E BMI研究结果一致,并且比其他基于ldsc的方法对环境调节更敏感。我们的结论是,BMI遗传能力对性别和年龄的变化比目前所认识的要敏感得多。将这种方法扩展到其他表型调节因子组合有可能揭示多种结果和调节因子之间的gxe。
{"title":"Moderating the Heritability of Body Mass Index by Age and Sex With Genomic Data","authors":"Sarah E. Benstock,&nbsp;Elizabeth Prom-Wormley,&nbsp;Brad Verhulst","doi":"10.1002/gepi.70026","DOIUrl":"https://doi.org/10.1002/gepi.70026","url":null,"abstract":"<div>\u0000 \u0000 <p>Environmental contexts may increase or decrease the heritability of a phenotype. Or equivalently, people with some genotypes may be more (or less) sensitive to environmental differences. Such genetic sensitivity to environmental contexts is called gene-environment interaction (G × E). While G × E has been robustly detected in twin and model organism studies for numerous phenotypes, there is a lingering perception that existing genome-wide G × E methods struggle to identify substantively significant and replicable interactions. We propose a novel method for examining G × E heritability using genetic marginal effects from genome-wide G × E analyses and Linkage Disequilibrium Score Regression (LDSC). We demonstrate the effectiveness of our method for body mass index (BMI) using biological sex (binary) and age (continuous) as moderators. Using the same procedures for both binary and continuous moderators, we detect robust evidence for G × E. Our results are consistent with findings from twin G × E studies of BMI and are more sensitive to environmental moderation than other LDSC-based methods. We conclude that BMI heritability is substantially more sensitive to variation in sex and age than is currently appreciated. Extending this method to other phenotype-moderator combinations has the potential to reveal G × E across numerous outcomes and moderators.</p>\u0000 </div>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"50 1","pages":""},"PeriodicalIF":3.8,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145905041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Archipelago Method for Variant Set Association Test Statistics 变异集关联检验统计的群岛方法
IF 3.8 4区 医学 Q3 GENETICS & HEREDITY Pub Date : 2026-01-06 DOI: 10.1002/gepi.70025
Dylan Lawless, Ali Saadat, Mariam Ait Oumelloul, Luregn J. Schlapbach, Jacques Fellay

Variant set association tests (VSAT), especially those incorporating rare variants via variant collapse, are invaluable in genetic studies. However, unlike Manhattan plots for single-variant tests, VSAT statistics lack intrinsic genomic coordinates, hindering visual interpretation. To overcome this, we developed the Archipelago method, which assigns a meaningful genomic coordinate to VSAT P values so that both set-level and individual variant associations can be visualised together. This results in an intuitive and information rich illustration akin to an Archipelago of clustered islands, enhancing the understanding of both collective and individual impacts of variants. We conducted three validation studies spanning simulated and real datasets across small and biobank-scale cohorts, from 504 individuals up to 490,640 UK Biobank participants. We integrated single-variant genome-wide association studies (GWAS) with gene- and protein pathway-level rare-variant collapse. These studies included the 1KG GWAS cohort, the Pan-UK Biobank GWAS with DeepRVAT WES gene-level study, and the UKBB WGS gene-level UTR collapsing PheWAS. The Archipelago plot is applicable in any genetic association study that uses variant collapse to evaluate both individual variants and variant sets, and its customisability facilitates clear communication of complex genetic data. By integrating at least two dimensions of genetic data into a single visualisation, VSAT results can be easily read and aid in identification of potential causal variants in variant sets such as protein pathways.

变异集关联测试(VSAT),特别是那些通过变异崩溃纳入罕见变异的测试,在遗传研究中是非常宝贵的。然而,与单变体测试的曼哈顿图不同,VSAT统计数据缺乏内在的基因组坐标,妨碍了视觉解释。为了克服这个问题,我们开发了Archipelago方法,该方法为VSAT P值分配了一个有意义的基因组坐标,从而可以同时可视化集水平和个体变异的关联。这就产生了一个直观的、信息丰富的插图,类似于集群岛屿的群岛,增强了对变体的集体和个体影响的理解。我们进行了三项验证研究,涵盖小型和生物银行规模队列的模拟和真实数据集,从504个人到490,640名英国生物银行参与者。我们将单变异全基因组关联研究(GWAS)与基因和蛋白质通路水平的罕见变异崩溃结合起来。这些研究包括1KG GWAS队列、Pan-UK Biobank GWAS与DeepRVAT WES基因水平的研究和UKBB WGS基因水平的UTR塌陷PheWAS。Archipelago图适用于任何使用变异坍缩来评估个体变异和变异集的遗传关联研究,其可定制性促进了复杂遗传数据的清晰交流。通过将至少两个维度的遗传数据整合到一个单一的可视化中,VSAT结果可以很容易地读取,并有助于识别变异集(如蛋白质途径)中的潜在因果变异。
{"title":"Archipelago Method for Variant Set Association Test Statistics","authors":"Dylan Lawless,&nbsp;Ali Saadat,&nbsp;Mariam Ait Oumelloul,&nbsp;Luregn J. Schlapbach,&nbsp;Jacques Fellay","doi":"10.1002/gepi.70025","DOIUrl":"https://doi.org/10.1002/gepi.70025","url":null,"abstract":"<p>Variant set association tests (VSAT), especially those incorporating rare variants via variant collapse, are invaluable in genetic studies. However, unlike Manhattan plots for single-variant tests, VSAT statistics lack intrinsic genomic coordinates, hindering visual interpretation. To overcome this, we developed the Archipelago method, which assigns a meaningful genomic coordinate to VSAT P values so that both set-level and individual variant associations can be visualised together. This results in an intuitive and information rich illustration akin to an Archipelago of clustered islands, enhancing the understanding of both collective and individual impacts of variants. We conducted three validation studies spanning simulated and real datasets across small and biobank-scale cohorts, from 504 individuals up to 490,640 UK Biobank participants. We integrated single-variant genome-wide association studies (GWAS) with gene- and protein pathway-level rare-variant collapse. These studies included the 1KG GWAS cohort, the Pan-UK Biobank GWAS with DeepRVAT WES gene-level study, and the UKBB WGS gene-level UTR collapsing PheWAS. The Archipelago plot is applicable in any genetic association study that uses variant collapse to evaluate both individual variants and variant sets, and its customisability facilitates clear communication of complex genetic data. By integrating at least two dimensions of genetic data into a single visualisation, VSAT results can be easily read and aid in identification of potential causal variants in variant sets such as protein pathways.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"50 1","pages":""},"PeriodicalIF":3.8,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.70025","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145905040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Variant Classification Using Proteomics-Informed Large Language Models Increases Power of Rare Variant Association Studies and Enhances Target Discovery 使用基于蛋白质组学的大语言模型进行变异分类增加了罕见变异关联研究的能力,并增强了目标发现。
IF 3.8 4区 医学 Q3 GENETICS & HEREDITY Pub Date : 2025-11-03 DOI: 10.1002/gepi.70023
Christopher E. Gillies, Joelle Mbatchou, Lukas Habegger, Michael D. Kessler, Suying Bao, Suganthi Balasubramanian, Olivier Delaneau, Jack A. Kosmicki, Cristen J. Willer, Hyun Min Kang, Aris Baras, Jeffrey G. Reid, Jonathan Marchini, Gonçalo R. Abecasis, Maya Ghoussaini

Rare variant association analysis, which assesses the aggregate effect of rare damaging variants within a gene, is a powerful strategy for advancing knowledge of human biology. Numerous models have been proposed to identify damaging coding variants, with the most recent ones employing deep learning and large language models (LLMs) to predict the impact of changes in coding sequences. Here, we use newly available proteomics data on 2898 proteins across 46,665 individuals to evaluate and refine LLM predictors of damaging variants. Using one of these refined models, we evaluate the association between rare damaging variants and human phenotypes at 241 positive control gene-trait pairs. Among these gene-trait pairs, our proteomics-guided model outperforms an ensemble of conventional approaches including PolyPhen2, MutationTaster, SIFT, and LRT, as well as newer machine learning approaches for identifying damaging missense variants, such as CADD, ESM-1v, ESM-1b, and AlphaMissense. When attempting to recover known associations by correctly separating damaging singleton missense variants from other singleton variants, our approach recapitulates 36.5% of gene-trait pairs with known associations, exceeding all the alternatives we considered. Furthermore, when we apply our model to 10 example traits from the UK Biobank, we identify 177 gene–trait associations—again exceeding all other approaches. Our results demonstrate that summary statistics from large-scale human proteomics data enable evaluation and refinement of coding variant classification LLMs, improving discovery potential in human genetic studies.

罕见变异关联分析,评估基因中罕见的破坏性变异的总体影响,是推进人类生物学知识的有力策略。已经提出了许多模型来识别有害的编码变体,最近的模型使用深度学习和大型语言模型(llm)来预测编码序列变化的影响。在这里,我们使用最新获得的蛋白质组学数据,对46,665个个体的2898种蛋白质进行评估和完善LLM预测破坏性变异。利用这些改进的模型之一,我们评估了241对阳性对照基因性状对的罕见损伤变异与人类表型之间的关系。在这些基因-性状对中,我们的蛋白质组学指导模型优于传统方法,包括PolyPhen2、MutationTaster、SIFT和LRT,以及用于识别破坏性错义变异的较新的机器学习方法,如CADD、ESM-1v、ESM-1b和AlphaMissense。当试图通过正确分离有害的单例错义变异和其他单例变异来恢复已知的关联时,我们的方法概括了36.5%具有已知关联的基因-性状对,超过了我们考虑的所有替代方法。此外,当我们将我们的模型应用于来自UK Biobank的10个示例性状时,我们确定了177个基因性状关联-再次超过所有其他方法。我们的研究结果表明,大规模人类蛋白质组学数据的汇总统计可以评估和改进编码变异分类llm,提高人类遗传研究的发现潜力。
{"title":"Variant Classification Using Proteomics-Informed Large Language Models Increases Power of Rare Variant Association Studies and Enhances Target Discovery","authors":"Christopher E. Gillies,&nbsp;Joelle Mbatchou,&nbsp;Lukas Habegger,&nbsp;Michael D. Kessler,&nbsp;Suying Bao,&nbsp;Suganthi Balasubramanian,&nbsp;Olivier Delaneau,&nbsp;Jack A. Kosmicki,&nbsp;Cristen J. Willer,&nbsp;Hyun Min Kang,&nbsp;Aris Baras,&nbsp;Jeffrey G. Reid,&nbsp;Jonathan Marchini,&nbsp;Gonçalo R. Abecasis,&nbsp;Maya Ghoussaini","doi":"10.1002/gepi.70023","DOIUrl":"10.1002/gepi.70023","url":null,"abstract":"<p>Rare variant association analysis, which assesses the aggregate effect of rare damaging variants within a gene, is a powerful strategy for advancing knowledge of human biology. Numerous models have been proposed to identify damaging coding variants, with the most recent ones employing deep learning and large language models (LLMs) to predict the impact of changes in coding sequences. Here, we use newly available proteomics data on 2898 proteins across 46,665 individuals to evaluate and refine LLM predictors of damaging variants. Using one of these refined models, we evaluate the association between rare damaging variants and human phenotypes at 241 positive control gene-trait pairs. Among these gene-trait pairs, our proteomics-guided model outperforms an ensemble of conventional approaches including PolyPhen2, MutationTaster, SIFT, and LRT, as well as newer machine learning approaches for identifying damaging missense variants, such as CADD, ESM-1v, ESM-1b, and AlphaMissense. When attempting to recover known associations by correctly separating damaging singleton missense variants from other singleton variants, our approach recapitulates 36.5% of gene-trait pairs with known associations, exceeding all the alternatives we considered. Furthermore, when we apply our model to 10 example traits from the UK Biobank, we identify 177 gene–trait associations—again exceeding all other approaches. Our results demonstrate that summary statistics from large-scale human proteomics data enable evaluation and refinement of coding variant classification LLMs, improving discovery potential in human genetic studies.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"49 8","pages":""},"PeriodicalIF":3.8,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.70023","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145431214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Genetic Epidemiology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1