- file stat 97regrn.html ->
Regression: N, vs p vars
Three Postings and references on the question
the question of the N, sample, needed for regression
when compared to the number p, predictors.
-- and "micronumerosity".
N for regression (1)
=======================Klajmang, 12 May 1997==========sse
Message-ID: <199705130301.XAA20889@jse.stat.ncsu.edu>
From: klajmang@alpha.montclair.edu
Subject: Re: Cases-to-predictors ratio in multiple regression?
Mike Palij, responding to a query from Ronan Conroy about a minimum ratio of
cases to variables, quoted Tabachnik and Fidell, and went on to say
>I do not have T&F's latest, 3rd edition text handy, so I
>don't know if they continue to give this advice. Note,
>they do not provide a reference for the rule but I am
>pretty certain that I had read the rule elsewhere as
>well.
>
>-Mike Palij/Psychology Dept/New York University
Mike Palij cited the first edition of Tabachnik & Fidell (_Using Multivariate
Statistics_),I have the third edition of _Using Multivariate Statistics_
(1996). They write:
Required sample size depends on a number of issues, including the
desired power, alpha level, number of predictors, and expected effect
size. Green (1991) provides a thorough discussion of these issues and
some procedures to help decide how many cases are necessary. The
simples rules of thumb are N=> 50 + 8m (m is the number of IV's) for
testing the multiple correlation and N => 104 + m for testing individual
predictors. (P. 132) ...
They go on to discuss the increased number of cases demanded by skewed DV,
small effect size, and so on.
Equally valuable, in my view, is their terse warning (p. 133):
It is possible to have too many cases. As the number of cases becomes
quite large, almost any multiple correlation will depart significantly
from zero, even one that predicts negligible variance in the DV. For
both statistical and practical reasons, then, one wants to measure the
smallest number of cases that has a decent chance of revealing a
relationship of a specified siize.
They refer to:
Green, S. B. (1991) How many subjects does it take to do a regression
analysis? _Multivariate Behavioral Research_, 26, 499-510.
N for regression (2)
=======================Rich Ulrich, 12 May 1997==========ssc
Subject: Re: Rule of thumb in regression
Message-ID: <5l7vs5$5r6@usenet.srv.cis.pitt.edu>
Michael Fahey (michael@*junk*east.ncc.go.jp) wrote:
: Ziegler Andreas Dr. wrote:
:
: > in the last weeks I have heard the following rule of thumb several
: > times at different universities: In a regression you need
: > approximately 100 observations for every variable to be estimated.
: An alternative is that the number of predictor degrees of freedom should
: not exceed m/10, where m is the number of uncensored events.
: > Is there a good reason for this rule of thumb? Can anyone point me
: > to a reference?
: See Harrell FE et al, Stat Med, 15: 361-87, 1996.
-- I never heard of the 100 obs.-per-var. rule.
-- Making the distinction on the number of "events" is the proper
starting point, for considering logistic regression, or any other
use of dichotomous variables.
-- 10-20 continous cases per d.f. is enough to get you fairly
good *testing* for large effects. But what are you interested in?
And how good is your prediction? A minimal test of whether everything
is zero takes far less data than trying to find an accurate set
of regression coefficients. And it takes far less data if your
(adjusted) R-squared is .95 or .99, than if you have softer data
that have 10 or 50 times the error.
*--------
N for regression (3)
=======================William F Barker, 12 May 1997==========sse
Message-ID: <01IIS9R2SX4Y8WW1D5@grove.iup.edu>
From: BARKER@grove.iup.edu
Subject: Re: Regression minimum sample size references
>On 12-MAY-1997 13:12:54.62 Ronan Conroy wrote
>
> "Subject: Cases-to-predictors ratio in multiple regression?
>
> ...
>
> Are any psychology people out there (or anyone else) familiar with
> 'rules' about the ratio of cases-to-variables for multiple regression,
> and can anyone provide a reference and/or rationale?
>
> ..."
Some references I have collect:
GENERAL (including Regression):
Cohen, J. (1988). "Statistical power analysis for the
behavioral sciences" 2nd Edition. Hillsdale, NJ: Lawrence
Erlbaum Associates, Publishers.
Summary: n = Fn(alpha, # predictors, effect size, power)
pp.444-465
REGRESSION SPECIFIC:
Green, S. B. (1991). How many subjects does it take to do a
regression analysis? "Multivariate Behavioral Research", 26, 499-
510.
Summary: n = 50 + 8*P where P = # of Predictor variables.
CORRELATION
Borg, W.R. & Gall, M.D. (1989). "Educational research: An
introduction", 5th Edition. White Plains, NY: Longman, Inc.
Summary: Minimum sample size Rule of Thumb: n = 30 .
PARAMETER ESTIMATES:
McMillan, J. H., & Schumacher, S. (1984). "Research in
education: A conceptual introduction". Boston, MA: Little, Brown
and Company. (pp 120-123)
Summary: Minimum sample size for a correlation study should
be 30 (lots of ifs).
Raykov, T., & Widaman, K. F. (1995). Issues in applied
structural equation modeling research. "Structural Equation
Modeling", 2, 289-318. (p. 296)
Summary: Minimum sample size 5 subjects per free parameter
(lots of ifs).
I hope this helps.
Other references would be appreciated. Thanks in advance.
Barker@Grove.IUP.Edu
*--------
Micronumerosity
=======================T Scott Thompson, 14 Jul 1993==========sms
From: thompson@atlas.socsci.umn.edu (T. Scott Thompson)
Subject: Re: Statistical Referee's Canon
Message-ID:
saswss@hotellng.unx.sas.com (Warren Sarle) writes:
>In article <21tde2$t23@news.u.washington.edu>, rons@hardy.u.washington.edu (Ronald Schoenberg) writes:
>|> ...
>|> It appears to me that you should be spending more of your time with
>|> econometrics texts and less time with statistics texts - there isn't
>|> an econometrics text in my library that doesn't talk about T > k and why.
>I tried Theil, _Principles of Econometrics_, who says, "It will be
>assumed here that n > K ...", and Judge, Griffiths, Hill, and Lee,
>_The Theory and Practice of Econometrics_, who don't even say as much
>as Theil on this subject as far as I could tell.
>I have highly technical books on least-squares computation that
>explain why least-squares estimates are not unique when the sample
>size is less than the number of variables, but that doesn't really
>address the statistical issue of why one wants (considerably) more
>observations than variables in a multiple regression.
Try looking at the section on "micronumerousity" (i.e. small samples)
in Art Goldberger's newer econometrics textbook (published maybe two
years ago). He discusses the difficulties that small samples produce.
The section is actually a parody on the endless discussions of
heteroskedasticity that appear in many stat/econometrics books. Art's
position is that heteroskedasticity and "micronumerousity" cause
roughly the same difficulties (e.g. imprecise estimates and low power
for statistical tests), are detected by similar means (look at the
data), and are curable by the same means (get better data). He
concludes that heteroskedasticity gets all of the attention only
because it has a fancy name. Hence his new name for the problem of
not enough data.
The parody is really quite amusing. (At least for an econometrician.)
For example, he discusses how it is easy to detect "extreme
micronumerousity" (i.e. no data points) but somewhat harder to detect
"near micronumerousity", since the latter requires judicious use of
fingers and toes.
Seriously though, what would you say about this issue? The T>k
requirement is obvious. It is implicit in the usual formulas for
least squares, for example, which assume invertibility of the
regressor cross-products matrix. I suspect that this is why no
explicit mention appears in many books.
We all have an intuitive feel for why it is usually good to have lots
of data, but I have never seen any definite results that go beyond the
basic T>k requirement. Good inferences can be drawn from datasets
with T only slightly larger than k if the errors are small and the
regressors have a "good" configuration. Bad inferences are likely
even with T>>k if the data are collected from a poorly designed
experiment and/or if the errors are large. Can any more be said?
What recommendations would you put in your ideal textbook and why?
--
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
FAQ top.
Ulrich home page.
Ulrich FAQ.
http://www.pitt.edu/~wpilib/stats99.html