Variable Selection in Heterogeneous Datasets: A Truncated-rank Sparse Linear Mixed Model with Applications to Genome-wide Association Studies.

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine Pub Date : 2017-11-01 Epub Date: 2017-12-18 DOI:10.1109/BIBM.2017.8217687

Haohan Wang, Bryon Aragam, Eric P Xing

{"title":"Variable Selection in Heterogeneous Datasets: A Truncated-rank Sparse Linear Mixed Model with Applications to Genome-wide Association Studies.","authors":"Haohan Wang, Bryon Aragam, Eric P Xing","doi":"10.1109/BIBM.2017.8217687","DOIUrl":null,"url":null,"abstract":"<p><p>A fundamental and important challenge in modern datasets of ever increasing dimensionality is variable selection, which has taken on renewed interest recently due to the growth of biological and medical datasets with complex, non-i.i.d. structures. Naïvely applying classical variable selection methods such as the Lasso to such datasets may lead to a large number of false discoveries. Motivated by genome-wide association studies in genetics, we study the problem of variable selection for datasets arising from multiple subpopulations, when this underlying population structure is unknown to the researcher. We propose a unified framework for sparse variable selection that adaptively corrects for population structure via a low-rank linear mixed model. Most importantly, the proposed method does not require prior knowledge of individual relationships in the data and adaptively selects a covariance structure of the correct complexity. Through extensive experiments, we illustrate the effectiveness of this framework over existing methods. Further, we test our method on three different genomic datasets from plants, mice, and humans, and discuss the knowledge we discover with our model.</p>","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2017 ","pages":"431-438"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5889139/pdf/nihms874620.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM.2017.8217687","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2017/12/18 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

A fundamental and important challenge in modern datasets of ever increasing dimensionality is variable selection, which has taken on renewed interest recently due to the growth of biological and medical datasets with complex, non-i.i.d. structures. Naïvely applying classical variable selection methods such as the Lasso to such datasets may lead to a large number of false discoveries. Motivated by genome-wide association studies in genetics, we study the problem of variable selection for datasets arising from multiple subpopulations, when this underlying population structure is unknown to the researcher. We propose a unified framework for sparse variable selection that adaptively corrects for population structure via a low-rank linear mixed model. Most importantly, the proposed method does not require prior knowledge of individual relationships in the data and adaptively selects a covariance structure of the correct complexity. Through extensive experiments, we illustrate the effectiveness of this framework over existing methods. Further, we test our method on three different genomic datasets from plants, mice, and humans, and discuss the knowledge we discover with our model.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

异质数据集中的变量选择：截断秩稀疏线性混合模型在全基因组关联研究中的应用》（Truncated-rank Sparse Linear Mixed Model with Applications to Genome-wide Association Studies）。

在维度不断增加的现代数据集中，变量选择是一个基本而重要的挑战。最近，由于具有复杂、非 i.i.d 结构的生物和医学数据集的增加，变量选择再次引起了人们的关注。在此类数据集上天真地应用经典变量选择方法（如 Lasso）可能会导致大量错误发现。受遗传学中全基因组关联研究的启发，我们研究了在研究人员不知道潜在种群结构的情况下，对来自多个亚种群的数据集进行变量选择的问题。我们提出了一个统一的稀疏变量选择框架，通过低秩线性混合模型对种群结构进行自适应校正。最重要的是，我们提出的方法不需要事先了解数据中的个体关系，就能自适应地选择具有正确复杂性的协方差结构。通过大量实验，我们证明了这一框架相对于现有方法的有效性。此外，我们还在植物、小鼠和人类的三个不同基因组数据集上测试了我们的方法，并讨论了我们通过模型发现的知识。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

自引率

0.00%

发文量