<- file stat 97regrn.html -> Regression: N, vs p vars Three Postings and references on the question the question of the N, sample, needed for regression when compared to the number p, predictors. -- and "micronumerosity".
  • N for regression (1)
  • =======================Klajmang, 12 May 1997==========sse Message-ID: <199705130301.XAA20889@jse.stat.ncsu.edu> From: klajmang@alpha.montclair.edu Subject: Re: Cases-to-predictors ratio in multiple regression? Mike Palij, responding to a query from Ronan Conroy about a minimum ratio of cases to variables, quoted Tabachnik and Fidell, and went on to say >I do not have T&F's latest, 3rd edition text handy, so I >don't know if they continue to give this advice. Note, >they do not provide a reference for the rule but I am >pretty certain that I had read the rule elsewhere as >well. > >-Mike Palij/Psychology Dept/New York University Mike Palij cited the first edition of Tabachnik & Fidell (_Using Multivariate Statistics_),I have the third edition of _Using Multivariate Statistics_ (1996). They write: Required sample size depends on a number of issues, including the desired power, alpha level, number of predictors, and expected effect size. Green (1991) provides a thorough discussion of these issues and some procedures to help decide how many cases are necessary. The simples rules of thumb are N=> 50 + 8m (m is the number of IV's) for testing the multiple correlation and N => 104 + m for testing individual predictors. (P. 132) ... They go on to discuss the increased number of cases demanded by skewed DV, small effect size, and so on. Equally valuable, in my view, is their terse warning (p. 133): It is possible to have too many cases. As the number of cases becomes quite large, almost any multiple correlation will depart significantly from zero, even one that predicts negligible variance in the DV. For both statistical and practical reasons, then, one wants to measure the smallest number of cases that has a decent chance of revealing a relationship of a specified siize. They refer to: Green, S. B. (1991) How many subjects does it take to do a regression analysis? _Multivariate Behavioral Research_, 26, 499-510.
  • N for regression (2)
  • =======================Rich Ulrich, 12 May 1997==========ssc Subject: Re: Rule of thumb in regression Message-ID: <5l7vs5$5r6@usenet.srv.cis.pitt.edu> Michael Fahey (michael@*junk*east.ncc.go.jp) wrote: : Ziegler Andreas Dr. wrote: : : > in the last weeks I have heard the following rule of thumb several : > times at different universities: In a regression you need : > approximately 100 observations for every variable to be estimated. : An alternative is that the number of predictor degrees of freedom should : not exceed m/10, where m is the number of uncensored events. : > Is there a good reason for this rule of thumb? Can anyone point me : > to a reference? : See Harrell FE et al, Stat Med, 15: 361-87, 1996. -- I never heard of the 100 obs.-per-var. rule. -- Making the distinction on the number of "events" is the proper starting point, for considering logistic regression, or any other use of dichotomous variables. -- 10-20 continous cases per d.f. is enough to get you fairly good *testing* for large effects. But what are you interested in? And how good is your prediction? A minimal test of whether everything is zero takes far less data than trying to find an accurate set of regression coefficients. And it takes far less data if your (adjusted) R-squared is .95 or .99, than if you have softer data that have 10 or 50 times the error. *--------
  • N for regression (3)
  • =======================William F Barker, 12 May 1997==========sse Message-ID: <01IIS9R2SX4Y8WW1D5@grove.iup.edu> From: BARKER@grove.iup.edu Subject: Re: Regression minimum sample size references >On 12-MAY-1997 13:12:54.62 Ronan Conroy <rconroy@rcsi.ie> wrote > > "Subject: Cases-to-predictors ratio in multiple regression? > > ... > > Are any psychology people out there (or anyone else) familiar with > 'rules' about the ratio of cases-to-variables for multiple regression, > and can anyone provide a reference and/or rationale? > > ..." Some references I have collect: GENERAL (including Regression): Cohen, J. (1988). "Statistical power analysis for the behavioral sciences" 2nd Edition. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers. Summary: n = Fn(alpha, # predictors, effect size, power) pp.444-465 REGRESSION SPECIFIC: Green, S. B. (1991). How many subjects does it take to do a regression analysis? "Multivariate Behavioral Research", 26, 499- 510. Summary: n = 50 + 8*P where P = # of Predictor variables. CORRELATION Borg, W.R. & Gall, M.D. (1989). "Educational research: An introduction", 5th Edition. White Plains, NY: Longman, Inc. Summary: Minimum sample size Rule of Thumb: n = 30 . PARAMETER ESTIMATES: McMillan, J. H., & Schumacher, S. (1984). "Research in education: A conceptual introduction". Boston, MA: Little, Brown and Company. (pp 120-123) Summary: Minimum sample size for a correlation study should be 30 (lots of ifs). Raykov, T., & Widaman, K. F. (1995). Issues in applied structural equation modeling research. "Structural Equation Modeling", 2, 289-318. (p. 296) Summary: Minimum sample size 5 subjects per free parameter (lots of ifs). I hope this helps. Other references would be appreciated. Thanks in advance. Barker@Grove.IUP.Edu *--------
  • Micronumerosity
  • =======================T Scott Thompson, 14 Jul 1993==========sms From: thompson@atlas.socsci.umn.edu (T. Scott Thompson) Subject: Re: Statistical Referee's Canon Message-ID: <thompson.742662110@daphne.socsci.umn.edu> saswss@hotellng.unx.sas.com (Warren Sarle) writes: >In article <21tde2$t23@news.u.washington.edu>, rons@hardy.u.washington.edu (Ronald Schoenberg) writes: >|> ... >|> It appears to me that you should be spending more of your time with >|> econometrics texts and less time with statistics texts - there isn't >|> an econometrics text in my library that doesn't talk about T > k and why. >I tried Theil, _Principles of Econometrics_, who says, "It will be >assumed here that n > K ...", and Judge, Griffiths, Hill, and Lee, >_The Theory and Practice of Econometrics_, who don't even say as much >as Theil on this subject as far as I could tell. >I have highly technical books on least-squares computation that >explain why least-squares estimates are not unique when the sample >size is less than the number of variables, but that doesn't really >address the statistical issue of why one wants (considerably) more >observations than variables in a multiple regression. Try looking at the section on "micronumerousity" (i.e. small samples) in Art Goldberger's newer econometrics textbook (published maybe two years ago). He discusses the difficulties that small samples produce. The section is actually a parody on the endless discussions of heteroskedasticity that appear in many stat/econometrics books. Art's position is that heteroskedasticity and "micronumerousity" cause roughly the same difficulties (e.g. imprecise estimates and low power for statistical tests), are detected by similar means (look at the data), and are curable by the same means (get better data). He concludes that heteroskedasticity gets all of the attention only because it has a fancy name. Hence his new name for the problem of not enough data. The parody is really quite amusing. (At least for an econometrician.) For example, he discusses how it is easy to detect "extreme micronumerousity" (i.e. no data points) but somewhat harder to detect "near micronumerousity", since the latter requires judicious use of fingers and toes. Seriously though, what would you say about this issue? The T>k requirement is obvious. It is implicit in the usual formulas for least squares, for example, which assume invertibility of the regressor cross-products matrix. I suspect that this is why no explicit mention appears in many books. We all have an intuitive feel for why it is usually good to have lots of data, but I have never seen any definite results that go beyond the basic T>k requirement. Good inferences can be drawn from datasets with T only slightly larger than k if the errors are small and the regressors have a "good" configuration. Bad inferences are likely even with T>>k if the data are collected from a poorly designed experiment and/or if the errors are large. Can any more be said? What recommendations would you put in your ideal textbook and why? -- * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
  • FAQ top.
  • Ulrich home page.
  • Ulrich FAQ. http://www.pitt.edu/~wpilib/stats99.html