- file stat 97stepw.html ->
more on Stepwise (1997)
There are several files in this FAQ with more
general comments on stepwise selection of variables
for regression, discriminant function, etc.
In this file :
(summing up);
all possible subsets;
R^2 > .99.
(summing up)
=======================Donald F Burrill, 23 Mar 1997==========sse
Message-ID:
From: "Donald F. Burrill"
Subject: Re: Bad statistical practice Was: Re: Meta-Analysis with heterogeneous
On Sat, 22 Mar 1997, Herman Rubin wrote, inter alia:
> As for stepwise algorithms, they are never theoretically sound,
> except in situations where they are unnecessary.
Lovely! Thank you, Herman, that was _very_ nicely put!
*--------
all-possible subsets
=======================Rich Ulrich, 28 Jan 1997==========ssc
Subject: Re: All possible subset
Message-ID: <5clo8l$242@usenet.srv.cis.pitt.edu>
John P. Ball (john.ball@REMOVE.THIS.szooek.slu.se) wrote:
<< deleted, some of the stuff ... concerning stepwise solutions >>
: .... In ecology, I see that interpretations of causality are
: routinely made, hardly anyone performs cross-validation, few use
: sophisticated methods like Aikaike's Information Criterion or Mallows
: Cp plots to determine how many variables to include in the final
: regression model, and nobody seems to realize the optimistic P-levels
: that emerge from any stepwise procedure, etc. Still, if one has
: insufficient replication to LEAVE all of your potential independent
: variables in a (for example) GLM, then stepwise regression may be the
: most appropriate choice (it is more data-exploration than hypothesis
: testing though!). Of all the stepwise algorithms, at least
: all-possible subsets ensures (computationally) that you get the best
: regression model (which forward-stepping, backward stepping, stepwise
: algorithms, and others cannot guarantee). So, if you HAVE to use
: stepwise regression for some other reason, at least
: all-possible-subsets ensures that you DO get the best one.
-- yes, as they say, "You might as well be hung for a sheep, as a
goat."
One vital qualifier before doing stepwise, etc. -- this is close
to being an absolute, unless you are absolutely EXPLORATORY -- is
the condition that you KNOW that all the variables are potentially
useful; and you just want a shorter, more elegant or cheaper equation -
Then, "all-possible regressions" is what does it. (Well, you can
accomplish a bit by cross-validation, but, as John says, who bothers?)
This is FAR DIFFERENT from the condition that: you scanned 200 tests
of various stuff, and took the 10 or 15 "significant" variables to
put into a stat package. Doing the latter is called: "Using 200
variables to overcapitalize on chance." In fact, if your "real"
predictors tend to be inter-correlated, and not much stronger in
prediction than the "random" predictors, it is easy to see how the
"random" predictors will be selected *more often* than the "real"
ones, being preferred since they are uncorrelated. Then "stepwise"
will give you a *worse* equation that you would have gotten by
picking a which few of your variables to use, by chance; and
"all-possible" will be among the worst-possible solutions.
: ... I stand
: by, awaiting correction by all the REAL statisticians reading this
: list! :-]
-- well, you did seem a little enthusiastic. I thought we had
previously stomped out all pro-stepwise spirit among our readers.
I will send you some other advice, from previous posts.
*--------
R^2 > .99 (stepwise?)
=======================Rich Ulrich, 27 Jan 1997==========spss
Subject: Re: HELP - Q : compute
Message-ID: <5cin0d$gn7@usenet.srv.cis.pitt.edu>
Here is one of the few applications where STEPWISE REGRESSION
might actually be of good service. (...as compared to the ordinary
world of research, where Stepwise is just a naive error.)
If your R-squared is approaching 1.0, then you don't worry too much
about having "irrelevent" predictors. Your professor could try to
predict, or fit, the NEW_VAR by using stepwise regression (or, best
subset) using variables that are likely candidates to have been
included. When you hit 100% perfect prediction, then you have the
original variables, or something that gives exactly the same scores
as they did.
=========================in response to ...
Ling Ting (ting@COMP.UARK.EDU) wrote:
<< start, deleted >>
: A professor use "COMPUTE" comment of SPSS for Windows (6.1.2) to calculate
: a new variable say new_var. What she did is
: click on
: Transform -> Compute -> ... etc to get the new variable
: in term of syntax would be
: COMPUTE new_var=MEAN(var1, var2, var3, var4)
: The problem is she doesn't have any information about which variables
: were used to compute the new_var. Now, she need to know what these 4
: variables are. Is there any way can be used to find out what are they?
<< rest, deleted >>
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
FAQ top.
Ulrich home page.
Ulrich FAQ.
http://www.pitt.edu/~wpilib/stats99.html