Stepwise regression (or discriminant function, or
logistic, also, all-possible-subsets) has been
beaten to death in discussions on the .stats
Usenet groups: a lot of posts, strongly negative,
from multiple points of views.
This comes as a surprise to users whose introduction
has come partly from stat-packages which make the
options seem easy and appealing - if these users
have not been at it long enough to become disillusioned.
What is included below are three posts that cover
a good range of objections, and offer pertinant
references. Much more could be said. In fact, I
will say more in the Comments section of the FAQ,
which will include some mention of when it actually
can work (high R-squared, no inference-making).
The References portion of the FAQ includes some on
alpha-inflation. If more references are desired on
the subject, or more Postings from other users (I
find it convincing to hear it in multiple voices),
please let me know.
//Rich Ulrich// Jan 24, 1997
======================Frank Harrell Jr, 19 Feb 1996======ssc
Frank E Harrell Jr feh@biostat.mc.duke.edu
Associate Professor of Biostatistics
Division of Biometry Duke University Medical Center
----------------------------------------------------------------------
Subject: Reasons not to do stepwise (or all possible regressions)
Message-ID: <4gailc$cc0@news.duke.edu>
I post this every few months. I hope it helps.
Here are SOME of the problems with stepwise variable selection.
1. It yields R-squared values that are badly biased high
2. The F and chi-squared tests quoted next to each variable on the
printout do not have the claimed distribution
3. The method yields confidence intervals for effects and predicted
values that are falsely narrow (See Altman and Anderson Stat in Med)
4. It yields P-values that do not have the proper meaning and the
proper correction for them is a very difficult problem
5. It gives biased regression coefficients that need shrinkage
(the coefficients for remaining variables are too large;
see Tibshirani, 1996).
6. It has severe problems in the presence of collinearity
7. It is based on methods (e.g. F tests for nested models) that were
intended to be used to test pre-specified hypotheses.
8. Increasing the sample size doesn't help very much (see
Derksen and Keselman)
9. It allows us to not think about the problem
10. It uses a lot of paper
Note that 'all possible subsets' regression does not solve any of these
problems.
References
----------
@article{alt89,
author = "Altman, D. G. and Andersen, P. K.",
journal = "Statistics in Medicine",
pages = "771-783",
title = "Bootstrap investigation of the stability of a {C}ox
regression model",
volume = "8",
year = "1989"
Shows that stepwise methods yields confidence limits that are far too narrow.
}
@article{der92bac,
author = {Derksen, S. and Keselman, H. J.},
journal = {British Journal of Mathematical and Statistical Psychology},
pages = {265-282},
title = {Backward, forward and stepwise automated subset selection algorithms: {F}requency of obtaining authentic and noise variables},
volume = {45},
year = {1992},
annote = {variable selection}
Conclusions:
``The degree of correlation between the predictor variables affected
the frequency with which authentic predictor variables found their way
into the final model.
The number of candidate predictor variables affected the number of
noise variables that gained entry to the model.
The size of the sample was of little practical importance in
determining the number of authentic variables contained in the final
model.
The population multiple coefficient of determination could be
faithfully estimated by adopting a statistic that is adjusted by
the total number of candidate predictor variables rather than the
number of variables in the final model''.
}
@article{roe91pre,
author = {Roecker, Ellen B.},
journal = {Technometrics},
pages = {459-468},
title = {Prediction error and its estimation for subset--selected models},
volume = {33},
year = {1991}
Shows that all-possible regression can yield models that are "too small".
}
@article{man70why,
author = {Mantel, Nathan},
journal = {Technometrics},
pages = {621-625},
title = {Why stepdown procedures in variable selection},
volume = {12},
year = {1970},
annote = {variable selection; collinearity}
}
@article{hur90,
author = "Hurvich, C. M. and Tsai, C. L.",
journal = American Statistician,
pages = "214-217",
title = "The impact of model selection on inference in linear regression",
volume = "44",
year = "1990"
}
@article{cop83reg,
author = {Copas, J. B.},
journal = "Journal of the Royal Statistical Society B",
pages = {311-354},
title = {Regression, prediction and shrinkage (with discussion)},
volume = {45},
year = {1983},
annote = {shrinkage; validation; logistic model}
Shows why the number of CANDIDATE variables and not the number in the
final model is the number of d.f. to consider.
}
@article{tib96reg,
author = {Tibshirani, Robert},
journal = "Journal of the Royal Statistical Society B",
pages = {267-288},
title = {Regression shrinkage and selection via the lasso},
volume = {58},
year = {1996},
annote = {shrinkage; variable selection; penalized MLE; ridge regression}
}
==========================Ira Bernstein, 29 Apr 1996===========sse
Message-ID:
From: "IRA H BERNSTEIN"
Subject: Re: When should Stepwise reg be used?
Mark Eakin (who is also a distinguished
colleague of mine at UT-Arlington) stated:
> I just reviewed another text that included stepwise regression. Under
> what conditions should stepwise be used? It is not a confirmatory analysis
> technique as far as I am concerned. If I wish to explore a set of data, then
> I would use all possible regressions. If there too many variables for my
> PC to calculate all possible regressions, I would still like to see as many
> models as possible. (For example, the MAXR option in SAS would attempt to give
> me the best one-variable, the best two-variable model,..., to the best
> p-variable model.) What am I missing in my logic? Why is stepwise so popular?
I think that there are two distinct questions here: (a) _when_ is
stepwise selection appropriate and (b) _why_ is it so popular.
Since I have seen some variation in usage of the term "stepwise", I
define it as any of a number of _data_ driven variable selection
schemes used in regression and discriminant analysis, among other
applications. Some, inappropriately IMHO (since there is no offical
body to define "appropriate"), use it to describe what I would call
hierarchical (_hypothesis_ driven) selection. Like I would assume
many, I would discourage stepwise selection and encourage
hierarchical selection. I, of course, assume the researcher does
not "cheat" by defining his/her "hierarchy" given the data but does
so by considering alternatives in advance of analysis and,
preferably, replicates the study (dream on).
I would probably only argue slightly with "never" as an answer to the
use of stepwise selection since I don't know what knowledge we would
lose if all papers using stepwise regression were to vanish from
journals at the same time programs providing their use were to become
terminally virus-laden. However, I have been in situations that
looked like "I have good reason to look at variables A, B, and C;
then look at D, and E, but I have no basis to favor F over G or vice
versa past that point." Older versions of SPSS (I haven't used newer
versions since switching to SAS a decade ago) allowed this mixture,
and I would personally not object to it as long as the strategy were
defined in advance and made clear to readers.
As to part (b), I think that there are two groups that are inclined
to favor its usage. One consists of individuals with little formal
training in data analysis who confuse knowledge of data analysis
with knowledge of the syntax of SAS, SPSS, etc. They seem to figure
that "if its there in a program, its gotta be good and better than
actually thinking about what my data might look like". They are
fairly easy to spot and to condemn in a right-thinking group of
well-trained data analysts (like ourselves). However, there is also
a second group who are often well trained (and may be here in this
group ready to flame me). They believe in statistics uber
alles--given any properly obtained data base, a suitable computer
program can objectively make substantaive inferences without active
consideration of the underlying hypotheses. If stepwise selection
is the parent of this line blind data analysis, then automatic
variable respecification in confirmatory factor analysis is the
child.
Ira H. Bernstein
Professor of Psychology
UT-Arlington
P. O. Box 19528
Arlington, TX 76019-0528
(817) 272-3183
===========================Ronay M Conroy 7/5/96============sse
Message-ID:
From: rconroy@rcsi.ie (Ronan M Conroy)
Subject: Re: When should Stepwise reg be used?
>>What am I missing in my logic? Why is stepwise so popular?"
Joelle>
>"You use stepwise regression to evaluate the effect of one
variable, provided that another one is controlled. Suppose you
want to know how reading skills influence a test of verbal memory.
You need to control the effect of age before; otherwise you won't
know if you evaluate reading skill or age effects. Then, you enter
age as first variable, reading skills in second; the
regression weight you get for reading skill can not be due to age.
This is the point.
"Hope I was clear!"
>
>
I should point out that this misunderstands a) linear models and b)stepwise
methods.
Confounding variables, such as age, are controlled in a linear model. What
Joelle seems to be advocating is type I sums of squares, not a stepwise
model.
I am struck by the fact that Judd and McClelland in their excellent book
"Data Analysis: A Model Comparison Approach" (Harcourt Brace Jovanovich,
ISBN 0-15-516765-0) devote less than 2 pages to stepwise methods. What they
do say, however, is worth repeating:
1. Stepwise methods will not necessarily produce the best model if there
are redundant predictors (common problem).
2. All-possible-subset methods produce the best model for each possible
number of terms, but larger models need not necessarily be subsets of
smaller ones, causing serious conceptual problems about the underlying
logic of the investigation.
3. Models identified by stepwise methods have an inflated risk of
capitalising on on chance features of the data. They frequently fail when
applied to new datasets. They are rarely tested in this way.
4. Since the interpretation of coefficients in a model depends on the other
terms included, "it seems unwise," to quote J and McC, "to let an automatic
algorithm determine the questions we do and do not ask about our data". RC
adds that stepwise methods abusers frequently would rather not think about
their data, for reasons that are funny to describe over a second Guinness.
5. I quote this last point directly, as it is sane and succinct:
"It is our experience and strong belief that better models and a better
understanding of one's data result from focussed data analysis, guided by
substantive theory." (p 204)
They end with a quote from Henderson and Velleman's paper "Building
multiple regression models interactively". Biometrics 1981;37:391-411
"The data analyst knows more than the computer"
and add "failure to use that knowledge produces inadequate data analysis."
Personally, I would no more let an automatic routine select my model than I
would let some best-fit procedure pack my suitcase.
_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/
_/_/_/ _/_/ _/_/_/ _/ Ronan M Conroy
_/ _/ _/ _/ _/ _/ Lecturer in Biostatistics
_/_/_/ _/ _/_/_/ _/ Royal College of Surgeons
_/ _/ _/ _/ _/ Dublin 2, Ireland
_/ _/ _/_/ _/_/_/ _/ +353 1 402 2431 fax 402 2329
all things are pure to the pure: all things are simple to the stupid
_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/
Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
FAQ top.
Ulrich home page.
Ulrich FAQ.
http://www.pitt.edu/~wpilib/stats99.html