- file 96stepw.html ->
more words on Stepwise (1996).
******************* in addition to 3 originals **********
Stepwise? (numerous comments)
=========================Rich Ulrich, 15 May 1995============(spss)
Not long ago, I wrote, on the subject of Stepwise Regression -
: The manual makes reference to a lot of
: options, and yes, the good advice is, `DON'T USE THEM!' I don't know if
: the suggestion about a hazard warning was meant to be tongue-in-cheek,
: but it seems to me to be about the proper level of discouragement that
: is DESERVING for stepwise techniques. Seriously, folks.
Since I don't think the manuals are going to take this step that a few
people would be offended by, I would like to offer suggestions that
may be more constructive:
1) Let the manuals give an emphasis to showing how to ENTER sets of
variables as blocks, obtaining a test statistic on the variance that is
added. One version of this is simply keeping track of a categorical
variable that is dummy-coded - I think that BMDP does that. More simply,
the manuals could promote the ENTER strategy as one that is standard for
obtaining certain ANOVA tests. Also,
2) Show explicitly how to look at the `partial contribution' of several
variables WITHOUT entering them into the equation which is accounting for
design factors, etc.
=======================John P. Ball, 23 Apr 1996=============ssc,sse
From: jpb@szooek.slu.se
Subject: Re: Subject/variable ratio for regression?
Message-ID: <4lihb4$1d1@populus.slu.se>
>I tried to do some regression analysis of the data we collected a while
>ago. The experiment includes 2 sections: 1. Physical measurements:
>total 20 measure ments; 2. Dynamic data collection looking at the effect
>of walking speed on foot plantar pressure distribution. In section 2, we
>collected 3 trials under each walking speed. The total subject number
>was 20. My question is how many variables can be included in the
>step-wise regression analysis? Also, if there were missing trials, say 5
>subjects did not have certain measurement, how many variables can be
>included in the step-wise regression?
>I am not a regular reader of this group, please reply to my email
>address. I will compile the replies if others are interested.
I'll send you an email copy, but I'd humbly suggest that perhaps you
_should_ consider reading the group regularly... you would learn some
things. I still do.
Regarding the number of variables that you can include in stepwise
regression in your case: ZERO (well, OK one -- but then it is not
then stepwise regression but simple correlation). Sorry to be the
bearer of bad news. I'm afraid that to have any real hope of a
"reliable" answer you need to go get a LOT more subjects if you really
want to go on a fishing expedition and examine anything like 20
variables.
I recently gave a seminar to graduate students on pretty much exactly
your question, but I won't bother with all the multitude of
references here. In the interests of brevity, I refer you to
Tabachnick and Fidell, 1983, Using Multivariate Statistics. Harper
and Rowe, New York page 91-92, where it is put forth quite clearly:
"Ideally one would have 20 times more cases than variables. If
stepwise regression is to be used, a procedure that is notorious for
capitalizing on chance, a case-to-variable ratio of 40 to 1 would be
appropriate. "
Years ago, (as a grad student) I was assigned a similar problem:
assess the needed case-to-variable ratio for "reliable" stepwise
multiple regression. My computer similations indicated about 50 was
the minimum for the best algorithm, and I was suitably impressed at
how bad all stepwise algorithms were (I even used
all-possible-subsets). Even with 80 to 1 ratios, the results were not
particularly heartening. Today, I occassionally use multiple
regresssion, but all the variables _stay_ in the model (NO stepwise
ever, of any algorithm). Obviously, the degrees of freedom then
directly limit the number of variables that you can include in the
general linear model/regression model. Try too many and your SS will
be undefined (or zero, depending on you stat program).
Hope that this is useful information (even if it is NOT what you
wanted to hear). Sorry for the typos -- I'm rushing today...
========================Paul Velleman, 8 April 1996===========ssc
From: pfv2@cornell.edu (Paul Velleman)
Subject: Re: Variable Reduction
In article <4ka18h$40h@news1.h1.usa.pipeline.com>, jzhong@usa.pipeline.com
wrote:
> I am building a predictive model. The dependent variable is binary. I have
> about
> 400 block group level census variables as predictor variables. I plan to
> perform
> principal components analysis on the predictor variables and then apply
> logistic regression on the selected a few principal components.
>
> Among the 400 predictor variables, many of them are not significantly
> related to
> the dependent variable. So I would like eliminate these insignificant
> variables
> before I do principal components analysis. What would be a good way to do
> this?
If you know that these variables are not significantly related to the
dependent variable,and if your goal is prediction, what is wrong with
simply deciding to omit the variables?
You *are* allowed to apply thought to statistical analyses; it needn't be
all automated computation.
===================Jesse A. Canchola, 15 Apr 1996=============ssc
From: adminjc@psg.ucsf.edu (Jesse A. Canchola)
Subject: Re: Variable Reduction
Message-ID: <4ku0p6$10eo@itssrv1.ucsf.edu>
On the subject of using stepwise regression for selecting variables
for you, check out Flack's and Chang's article, "Frequency of
Selecting Noise Variables in Subset Regression Analysis: A
Simulations Study" found in the _The American Statistician_, February
1987, Volume 41, No. 1. It pretty much substantiates Rich Ulrich's
response.
They mention how you should not use the results of such a stepwise
regression as the basis of any conclusions with respect to the subject
matter at hand. However, if the results of the subset selection are
"confirmed" and/or "validated" by other data sets, you will probably
be ok. They go on to say that, "Such confirmation and validation are
especially important when the number of candidate variables is large
and a priori knowledge about their relationships to the response
variables are not clear."
You might also want to look at David Freedman's "A Note on Screening
Regression Equations", _The American Statistician_, May 1983, Vol 37,
No. 2.
Good luck!
==========================William Ware, 30 Apr 1996===========sse
Message-ID:
From: "William B. Ware" << wbware@unc.edu >>
Subject: Re: When should Stepwise reg be used?
On Mon, 29 Apr 1996, IRA H BERNSTEIN wrote:
> I think that there are two distinct questions here: (a) _when_ is
> stepwise selection appropriate and (b) _why_ is it so popular.
I agree with most of what Professor Bernstein wrote in his original
message. However, I do think that "stepwise" regression (in which the
computer algorithm selects the variables) does have a place in our
statistics tool kits. However, that place is extremely limited!
When we are willing to throw all rights to interpretation to the winds
and when our primary goal is to develop an efficient predictive
mechanism using a small set of variables selected from some larger set,
then stepwise techniques are OK. The principal place in which this
limited application might be appropriate is in personnel selection
(e.g., college admissions).
=====================Ira H Bernstein, 30 Apr 1996============sse
Message-ID:
From: "IRA H BERNSTEIN"
Subject: Re: When should Stepwise reg be used?
"William B. Ware" noted, in response to my
original (largely negative) posting about stepwise regression:
> I agree with most of what Professor Bernstein wrote in his original
> message. However, I do think that "stepwise" regression (in which the
> computer algorithm selects the variables) does have a place in our
> statistics tool kits. However, that place is extremely limited!
>
> When we are willing to throw all rights to interpretation to the winds
> and when our primary goal is to develop an efficient predictive
> mechanism using a small set of variables selected from some larger set,
> then stepwise techniques are OK. The principal place in which this
> limited application might be appropriate is in personnel selection
> (e.g., college admissions).
Note that I had said the following in my original posting:
>I would probably only argue slightly with "never" as an answer to the
>use of stepwise selection since I don't know what knowledge we would
>lose if all papers using stepwise regression were to vanish from
>journals at the same time programs providing their use were to become
>terminally virus-laden. However, I have been in situations that
>looked like "I have good reason to look at variables A, B, and C.;
>then look at D, and E, but I have no basis to favor F over G or v.ice
>versa past that point." Older versions of SPSS (I haven't used. newer
>versions since switching to SAS a decade ago) allowed this mixture,
>and I would personally not object to it as long as the strategy were
>defined in advance and made clear to readers.
I therefore don't think that Prof. Ware and I are in any disagreement
as I believe we are both saying "not often, but sometimes".
Ira H. Bernstein
Professor of Psychology
UT-Arlington
P. O. Box 19528
Arlington, TX 76019-0528
(817) 272-3183
==========================Jerry Dallal, 30 Apr 1996===========sse
From: jerry@mint.hnrc.tufts.edu (Jerry Dallal)
Message-ID: <1996Apr30.103720@mint.hnrc.tufts.edu>
In article <960429.144610.CDT.B118MEE@UTARLVM1>, Mark Eakin writes:
> Why is stepwise so popular?
Because it gives the appearance of objectivity.
(Please do not interpret this comment as a statement for or against
the use of the technique.)
==========================Kent Campbell, 30 Apr 1996=========sse
From: campbell@acs.ryerson.ca (Kent Campbell)
Subject: Re: When should Stepwise reg be used?
Message-ID: <4m5lh0$i7g@ns2.ryerson.ca>
Hi -
try generating some random data sets and then analyzing them
with stepwise regression. It is quite likely that you will discover all
sorts of "significant" relationships. I have done this in a controlled
manner and found that the type 1 error (using the default settings in
spss) is much higher than 5%. So one reason why stepwise is so popular
is that it produces statistically significant results when fed garbage.
Best wishes,
Kent.
============================Carl Huberty, 13 Feb 1996=========ssc
Message-ID: <960213.083427.EST.CHUBERTY@UGA.CC.UGA.EDU>
From: carl huberty
Subject: Re: When are stepwise and backward regression methods appropriate?
About the only time stepwise methods are remotely appropriate is
when you have a large number of variables and you want to do some "pre
screening" of the variable set -- and you would need a "large" N/p
ratio to do such an analyis then. There are MUCH better ways to
assess variable ordering and to determine good variable subsets. DOWN
WITH STEPWISE!!
============================Rich Ulrich, 19 Feb 1996==========ssc
From: wpilib+@pitt.edu (Richard F Ulrich)
Subject: Re: When are stepwise and backward regression methods appropriate?
Message-ID: <4ga6c7$2de@usenet.srv.cis.pitt.edu>
David H. Uthe (uthed@clark.net) wrote:
< stuff deleted... >
: I've found stepwise methods useful when I have lots of variables, some of
: which I'm not really sure contribute THAT much to the solution. Some of
: these regressors may be deteriorating the solution by attracting bogus
: coefficients. The stepwise methods eliminate weak regressors to the
: benefit of stronger ones, thereby stregthening the solution.
I like the first part of this statement, which echoes the doubts
of several people.
"...deteriorating the solution by attracting bogus coefficients" is
perfectly apt, since variables that are only *randomly* correlated
will survive the step-wise paring, which has the clear function of
eliminating redundancy. Useful variables in my area typically do have
redundancy, so eliminating them provides an ordinary preference for
BOGUS ones.
But "eliminate weak regressors to the benefit of stronger ones" is
hopeful thinking, if it is not *wishful* thinking. - That would
justify step-wise, sure; but, I think, you should only do the
stepwise to get a valid statement (of any kind - for testing, or
for concise description of a sample) when you can tell the difference
between `attracting bogus coefficients' and benefitting the strong ones.
You are safe, if the set of possible predictors does not include any
predictors that might happen to be bogus.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
FAQ top.
Ulrich home page.
Ulrich FAQ.
http://www.pitt.edu/~wpilib/stats99.html