Improving genetic variant identification for quantitative traits using ensemble learning-based approaches.

IF 3.7 2区生物学 Q2 BIOTECHNOLOGY & APPLIED MICROBIOLOGY BMC Genomics Pub Date : 2025-03-12 DOI:10.1186/s12864-025-11443-x

Jyoti Sharma, Vaishnavi Jangale, Rajveer Singh Shekhawat, Pankaj Yadav

{"title":"Improving genetic variant identification for quantitative traits using ensemble learning-based approaches.","authors":"Jyoti Sharma, Vaishnavi Jangale, Rajveer Singh Shekhawat, Pankaj Yadav","doi":"10.1186/s12864-025-11443-x","DOIUrl":null,"url":null,"abstract":"Background: Genome-wide association studies (GWAS) are rapidly advancing due to the improved resolution and completeness provided by Telomere-to-Telomere (T2T) and pangenome assemblies. While recent advancements in GWAS methods have primarily focused on identifying genetic variants associated with discrete phenotypes, approaches for quantitative traits (QTs) remain underdeveloped. This has often led to significant variants being overlooked due to biases from genotype multicollinearity and strict p-value thresholds.Results: We propose an enhanced ensemble learning approach for QT analysis that integrates regularized variant selection with machine learning-based association methods, validated through comprehensive biological enrichment analysis. We benchmarked four widely recognized single nucleotide polymorphism (SNP) feature selection methods-least absolute shrinkage and selection operator, ridge regression, elastic-net, and mutual information-alongside four association methods: linear regression, random forest, support vector regression (SVR), and XGBoost. Our approach is evaluated on simulated datasets and validated using a subset of the PennCATH real dataset, including imputed versions, focusing on low-density lipoprotein (LDL)-cholesterol levels as a QT. The combination of elastic-net with SVR outperformed other methods across all datasets. Functional annotation of top 100 SNPs identified through this superior ensemble method revealed their expression in tissues involved in LDL cholesterol regulation. We also confirmed the involvement of six known genes (APOB, TRAPPC9, RAB2A, CCL24, FCHO2, and EEPD1) in cholesterol-related pathways and identified potential drug targets, including APOB, PTK2B, and PTPN12.Conclusions: In conclusion, our ensemble learning approach effectively identifies variants associated with QTs, and we expect its performance to improve further with the integration of T2T and pangenome references in future GWAS.","PeriodicalId":9030,"journal":{"name":"BMC Genomics","volume":"26 1","pages":"237"},"PeriodicalIF":3.7000,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11899862/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12864-025-11443-x","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Genome-wide association studies (GWAS) are rapidly advancing due to the improved resolution and completeness provided by Telomere-to-Telomere (T2T) and pangenome assemblies. While recent advancements in GWAS methods have primarily focused on identifying genetic variants associated with discrete phenotypes, approaches for quantitative traits (QTs) remain underdeveloped. This has often led to significant variants being overlooked due to biases from genotype multicollinearity and strict p-value thresholds.

Results: We propose an enhanced ensemble learning approach for QT analysis that integrates regularized variant selection with machine learning-based association methods, validated through comprehensive biological enrichment analysis. We benchmarked four widely recognized single nucleotide polymorphism (SNP) feature selection methods-least absolute shrinkage and selection operator, ridge regression, elastic-net, and mutual information-alongside four association methods: linear regression, random forest, support vector regression (SVR), and XGBoost. Our approach is evaluated on simulated datasets and validated using a subset of the PennCATH real dataset, including imputed versions, focusing on low-density lipoprotein (LDL)-cholesterol levels as a QT. The combination of elastic-net with SVR outperformed other methods across all datasets. Functional annotation of top 100 SNPs identified through this superior ensemble method revealed their expression in tissues involved in LDL cholesterol regulation. We also confirmed the involvement of six known genes (APOB, TRAPPC9, RAB2A, CCL24, FCHO2, and EEPD1) in cholesterol-related pathways and identified potential drug targets, including APOB, PTK2B, and PTPN12.

Conclusions: In conclusion, our ensemble learning approach effectively identifies variants associated with QTs, and we expect its performance to improve further with the integration of T2T and pangenome references in future GWAS.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用基于集成学习的方法改进数量性状的遗传变异识别。

背景：由于端粒到端粒（T2T）和泛基因组组装提高了分辨率和完整性，全基因组关联研究（GWAS）正在迅速发展。虽然GWAS方法的最新进展主要集中在识别与离散表型相关的遗传变异，但数量性状（QTs）的方法仍然不发达。由于基因型多重共线性和严格的p值阈值的偏差，这经常导致显著变异被忽视。结果：我们提出了一种用于QT分析的增强集成学习方法，该方法将正则化变体选择与基于机器学习的关联方法集成在一起，并通过全面的生物富集分析进行了验证。我们对四种广泛认可的单核苷酸多态性（SNP）特征选择方法（最小绝对收缩和选择算子、脊回归、弹性网和互信息）与四种关联方法（线性回归、随机森林、支持向量回归（SVR）和XGBoost）进行了基准测试。我们的方法在模拟数据集上进行了评估，并使用PennCATH真实数据集的一个子集（包括输入版本）进行了验证，重点关注低密度脂蛋白（LDL）胆固醇水平作为QT。弹性网络与SVR的结合在所有数据集上都优于其他方法。通过这种优越的集成方法鉴定的前100个snp的功能注释揭示了它们在参与LDL胆固醇调节的组织中的表达。我们还证实了6个已知基因（APOB、TRAPPC9、RAB2A、CCL24、FCHO2和EEPD1）参与胆固醇相关通路，并确定了潜在的药物靶点，包括APOB、PTK2B和PTPN12。结论：综上所述，我们的集成学习方法有效地识别了与qt相关的变异，我们希望在未来的GWAS中，通过整合T2T和泛基因组参考，其性能会进一步提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

BMC Genomics 生物-生物工程与应用微生物

CiteScore

7.40

自引率

4.50%

发文量

769

审稿时长

6.4 months

期刊介绍： BMC Genomics is an open access, peer-reviewed journal that considers articles on all aspects of genome-scale analysis, functional genomics, and proteomics. BMC Genomics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.