Improving genetic variant identification for quantitative traits using ensemble learning-based approaches.

IF 3.7 2区 生物学 Q2 BIOTECHNOLOGY & APPLIED MICROBIOLOGY BMC Genomics Pub Date : 2025-03-12 DOI:10.1186/s12864-025-11443-x
Jyoti Sharma, Vaishnavi Jangale, Rajveer Singh Shekhawat, Pankaj Yadav
{"title":"Improving genetic variant identification for quantitative traits using ensemble learning-based approaches.","authors":"Jyoti Sharma, Vaishnavi Jangale, Rajveer Singh Shekhawat, Pankaj Yadav","doi":"10.1186/s12864-025-11443-x","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Genome-wide association studies (GWAS) are rapidly advancing due to the improved resolution and completeness provided by Telomere-to-Telomere (T2T) and pangenome assemblies. While recent advancements in GWAS methods have primarily focused on identifying genetic variants associated with discrete phenotypes, approaches for quantitative traits (QTs) remain underdeveloped. This has often led to significant variants being overlooked due to biases from genotype multicollinearity and strict p-value thresholds.</p><p><strong>Results: </strong>We propose an enhanced ensemble learning approach for QT analysis that integrates regularized variant selection with machine learning-based association methods, validated through comprehensive biological enrichment analysis. We benchmarked four widely recognized single nucleotide polymorphism (SNP) feature selection methods-least absolute shrinkage and selection operator, ridge regression, elastic-net, and mutual information-alongside four association methods: linear regression, random forest, support vector regression (SVR), and XGBoost. Our approach is evaluated on simulated datasets and validated using a subset of the PennCATH real dataset, including imputed versions, focusing on low-density lipoprotein (LDL)-cholesterol levels as a QT. The combination of elastic-net with SVR outperformed other methods across all datasets. Functional annotation of top 100 SNPs identified through this superior ensemble method revealed their expression in tissues involved in LDL cholesterol regulation. We also confirmed the involvement of six known genes (APOB, TRAPPC9, RAB2A, CCL24, FCHO2, and EEPD1) in cholesterol-related pathways and identified potential drug targets, including APOB, PTK2B, and PTPN12.</p><p><strong>Conclusions: </strong>In conclusion, our ensemble learning approach effectively identifies variants associated with QTs, and we expect its performance to improve further with the integration of T2T and pangenome references in future GWAS.</p>","PeriodicalId":9030,"journal":{"name":"BMC Genomics","volume":"26 1","pages":"237"},"PeriodicalIF":3.7000,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11899862/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12864-025-11443-x","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Genome-wide association studies (GWAS) are rapidly advancing due to the improved resolution and completeness provided by Telomere-to-Telomere (T2T) and pangenome assemblies. While recent advancements in GWAS methods have primarily focused on identifying genetic variants associated with discrete phenotypes, approaches for quantitative traits (QTs) remain underdeveloped. This has often led to significant variants being overlooked due to biases from genotype multicollinearity and strict p-value thresholds.

Results: We propose an enhanced ensemble learning approach for QT analysis that integrates regularized variant selection with machine learning-based association methods, validated through comprehensive biological enrichment analysis. We benchmarked four widely recognized single nucleotide polymorphism (SNP) feature selection methods-least absolute shrinkage and selection operator, ridge regression, elastic-net, and mutual information-alongside four association methods: linear regression, random forest, support vector regression (SVR), and XGBoost. Our approach is evaluated on simulated datasets and validated using a subset of the PennCATH real dataset, including imputed versions, focusing on low-density lipoprotein (LDL)-cholesterol levels as a QT. The combination of elastic-net with SVR outperformed other methods across all datasets. Functional annotation of top 100 SNPs identified through this superior ensemble method revealed their expression in tissues involved in LDL cholesterol regulation. We also confirmed the involvement of six known genes (APOB, TRAPPC9, RAB2A, CCL24, FCHO2, and EEPD1) in cholesterol-related pathways and identified potential drug targets, including APOB, PTK2B, and PTPN12.

Conclusions: In conclusion, our ensemble learning approach effectively identifies variants associated with QTs, and we expect its performance to improve further with the integration of T2T and pangenome references in future GWAS.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用基于集成学习的方法改进数量性状的遗传变异识别。
背景:由于端粒到端粒(T2T)和泛基因组组装提高了分辨率和完整性,全基因组关联研究(GWAS)正在迅速发展。虽然GWAS方法的最新进展主要集中在识别与离散表型相关的遗传变异,但数量性状(QTs)的方法仍然不发达。由于基因型多重共线性和严格的p值阈值的偏差,这经常导致显著变异被忽视。结果:我们提出了一种用于QT分析的增强集成学习方法,该方法将正则化变体选择与基于机器学习的关联方法集成在一起,并通过全面的生物富集分析进行了验证。我们对四种广泛认可的单核苷酸多态性(SNP)特征选择方法(最小绝对收缩和选择算子、脊回归、弹性网和互信息)与四种关联方法(线性回归、随机森林、支持向量回归(SVR)和XGBoost)进行了基准测试。我们的方法在模拟数据集上进行了评估,并使用PennCATH真实数据集的一个子集(包括输入版本)进行了验证,重点关注低密度脂蛋白(LDL)胆固醇水平作为QT。弹性网络与SVR的结合在所有数据集上都优于其他方法。通过这种优越的集成方法鉴定的前100个snp的功能注释揭示了它们在参与LDL胆固醇调节的组织中的表达。我们还证实了6个已知基因(APOB、TRAPPC9、RAB2A、CCL24、FCHO2和EEPD1)参与胆固醇相关通路,并确定了潜在的药物靶点,包括APOB、PTK2B和PTPN12。结论:综上所述,我们的集成学习方法有效地识别了与qt相关的变异,我们希望在未来的GWAS中,通过整合T2T和泛基因组参考,其性能会进一步提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
BMC Genomics
BMC Genomics 生物-生物工程与应用微生物
CiteScore
7.40
自引率
4.50%
发文量
769
审稿时长
6.4 months
期刊介绍: BMC Genomics is an open access, peer-reviewed journal that considers articles on all aspects of genome-scale analysis, functional genomics, and proteomics. BMC Genomics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.
期刊最新文献
Staphylococcus aureus sequence type 71 is a chimera that emerged twice. Germplasm screening and genome-wide PEBP profiling identify key regulators of photoperiod-insensitive flowering in winged bean (Psophocarpus tetragonolobus L). Genome-wide identification and expression analysis of the maize ZmWOX gene family reveal its critical role in callus formation. Genome assembly and annotation of the olive grass mouse Abrothrix olivacea reveal transcriptomic and cellular adaptations across contrasting biomes. Multi-omics analysis reveals seasonal variation in ovarian lipid metabolism associated with vitamin D3 in the muskrats.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1