When is the first spurious variable selected by sequential regression procedures?

被引:9
|
作者
Su, Weijie J. [1 ]
机构
[1] Univ Penn, Dept Stat, 472 John M Huntsman Hall,3730 Walnut St, Philadelphia, PA 19104 USA
基金
美国国家科学基金会;
关键词
False variable; Familywise error rate; Forward stepwise regression; Lasso; Least angle regression; FALSE DISCOVERY RATE; LASSO; CONSISTENCY; KNOCKOFFS; RECOVERY; PATH;
D O I
10.1093/biomet/asy032
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Applied statisticians use sequential regression procedures to rank explanatory variables and, in settings of low correlations between variables and strong true effect sizes, expect that variables at the top of this ranking are truly relevant to the response. In a regime of certain sparsity levels, however, we show that the lasso, forward stepwise regression, and least angle regression include the first spurious variable unexpectedly early. We derive a sharp prediction of the rank of the first spurious variable for these three procedures, demonstrating that it occurs earlier and earlier as the regression coefficients become denser. This phenomenon persists for statistically independent Gaussian random designs and arbitrarily large true effects. We gain insight by identifying the underlying cause and then introduce a simple visualization tool termed the double-ranking diagram to improve on these methods. We obtain the first result establishing the exact equivalence between the lasso and least angle regression in the early stages of solution paths beyond orthogonal designs. This equivalence implies that many important model selection results concerning the lasso can be carried over to least angle regression.
引用
收藏
页码:517 / 527
页数:11
相关论文
共 50 条