Empirical Performance of Cross-Validation With Oracle Methods in a Genomics Context.

IF 1.8 4区 数学 Q1 STATISTICS & PROBABILITY American Statistician Pub Date : 2011-11-01 DOI:10.1198/tas.2011.11052
Josue G Martinez, Raymond J Carroll, Samuel Müller, Joshua N Sampson, Nilanjan Chatterjee
{"title":"Empirical Performance of Cross-Validation With Oracle Methods in a Genomics Context.","authors":"Josue G Martinez, Raymond J Carroll, Samuel Müller, Joshua N Sampson, Nilanjan Chatterjee","doi":"10.1198/tas.2011.11052","DOIUrl":null,"url":null,"abstract":"<p><p>When employing model selection methods with oracle properties such as the smoothly clipped absolute deviation (SCAD) and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold cross-validation, for example, m = 10. In problems where the true regression function is sparse and the signals large, such cross-validation typically works well. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using SCAD and the Adaptive Lasso, with 10-fold cross-validation, is a random variable that has considerable and surprising variation. Similar remarks apply to non-oracle methods such as the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold cross-validation with any oracle method, and not just the SCAD and Adaptive Lasso.</p>","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":null,"pages":null},"PeriodicalIF":1.8000,"publicationDate":"2011-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3281424/pdf/nihms355303.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Statistician","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1198/tas.2011.11052","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0

Abstract

When employing model selection methods with oracle properties such as the smoothly clipped absolute deviation (SCAD) and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold cross-validation, for example, m = 10. In problems where the true regression function is sparse and the signals large, such cross-validation typically works well. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using SCAD and the Adaptive Lasso, with 10-fold cross-validation, is a random variable that has considerable and surprising variation. Similar remarks apply to non-oracle methods such as the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold cross-validation with any oracle method, and not just the SCAD and Adaptive Lasso.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基因组学背景下使用 Oracle 方法进行交叉验证的经验性能。
在使用平滑截断绝对偏差(SCAD)和自适应套索(Adaptive Lasso)等具有甲骨文特性的模型选择方法时,通常会通过 m 倍交叉验证来估计平滑参数,例如 m = 10。在真实回归函数稀疏、信号量大的问题中,这种交叉验证通常效果很好。然而,在涉及单核苷酸多态性(SNP)的基因组研究回归建模中,真正的回归函数虽然被认为是稀疏的,但信号并不大。我们通过实证证明,在此类问题中,使用 SCAD 和自适应套索法(10 倍交叉验证)所选变量的数量是一个随机变量,其变化相当大,令人惊讶。类似的结论也适用于 Lasso 等非oracle 方法。我们的研究强烈质疑对任何甲骨文方法(不仅仅是 SCAD 和 Adaptive Lasso)只进行一次 m 倍交叉验证是否合适。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
American Statistician
American Statistician 数学-统计学与概率论
CiteScore
3.50
自引率
5.60%
发文量
64
审稿时长
>12 weeks
期刊介绍: Are you looking for general-interest articles about current national and international statistical problems and programs; interesting and fun articles of a general nature about statistics and its applications; or the teaching of statistics? Then you are looking for The American Statistician (TAS), published quarterly by the American Statistical Association. TAS contains timely articles organized into the following sections: Statistical Practice, General, Teacher''s Corner, History Corner, Interdisciplinary, Statistical Computing and Graphics, Reviews of Books and Teaching Materials, and Letters to the Editor.
期刊最新文献
Causal Inference with Complex Surveys: A Unified Perspective on Sample Selection and Exposure Selection Performance Analysis of NSUM Estimators in Social-Network Topologies Cross-validatory Z-Residual for Diagnosing Shared Frailty Models A Pareto tail plot without moment restrictions Sparse-group boosting: Unbiased group and variable selection
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1