<- file stat 97logis.html -> Logistic Regr - 5 comments In this file about logistic regression, there are notes on: Tests - of FIT Nichols Tests - betas Ulrich Classification cutoffs Conroy REFs Logistic/Disc function Helberg Normality assumption Ulrich Extreme splits in group Ulrich *----------------
  • Tests in logistic (FIT)
  • =======================David Nichols, 01 jun 1993==========spss Subject: Re: logistic regression goodness of fit indicators. Message-ID: <C7ytMz.C5z@spss.com> In article <SHAWN.93May25164147@shadowfax.ori.org> shawn@shadowfax.ori.org (-Shawn Boles-) writes: > >In the SPSS Advanced statistics User's Guide (1990) the Logistic Regression >Section (2.11 pp.52-53) discusses the use of the following table for >assessing the overall goodness of fit for a model (starting against a null >model): > > Chi-Square df Significance > -2 Log Likelihood 48.126 47 .4274 <- good model > Model Chi-Square 22.126 5 .0005 > Improvement 22.126 5 .0005 > Goodness of Fit 46.790 47 .4812 <- good model > >The point made is that the first & last rows of the table are used to >decide that the model does not differ from a perfect (i.e., saturated) >model. The middle two rows are then used to test the significance of the >coefficents themselves given an adequate model. > >My question is what interpretation does one make as to the adequacy of a model >if the following table is obtained? Does the -2LL or the Goodness of Fit >test control the decision? Or does the fact that they contradict one another >point to some flaw in the model construction itself? > > Chi-Square df Significance > -2 Log Likelihood 255.720 1 .0005 < - model no good > Model Chi-Square 3.177 1 .0747 < - coef. <~> 0 (p<.10) if good model > Improvement 3.177 1 .0747 " " " > Goodness of Fit 188.000 186 .4452 < - good model > >direct replies appreciated, > >thanks in advance, > > >Shawn Boles > >Oregon Research Institute Internet: shawn@ori.org >1899 Willamette St. Voice: (503) 484-2123 Ext. 172 >Eugene, Oregon /97401 USA Fax: (503) 484-1108 > >....................... non nova, sed novae ......................... Two things here, one a user misreading, the other a since corrected mistake by SPSS (and some of the research literature). First, only the first row of the table is represented as comparing the current model with a saturated model in the User's Guide. It says that the last line leads to the same conclusion as the first. You can't compare Pearson chi-squares for any kind of nested models of any type and get a chi-square distributed variable, which is the idea behind the statement about the -2LL statistic. Second, the -2LL idea is wrong. It is mistakenly stated in a number of places in the literature (I know we relied heavily on John Fox's Wiley series book _Linear Statistical Models and Related Methods_) that you can compare a particular model with a saturated model via subtraction of one -2LL from the other, with the difference being a chi-square variable under the null hypothesis. This is not true for the situation represented in the SPSS LOGISTIC REGRESSION procedure, in which you have generally as many cells in your design as cases, because the approximation there is based on asymptotics as the number of cases in a cell becomes large. The difference between two nested models involving some parameters is indeed chi square distributed on df equal to the number of parameters by which the two models differ. You can check McCullagh and Nelder's 2nd edition of _Generalized Linear Models_ for a more precise discussion of the issue. Shelby Haberman proved the nested result originally, I believe. I posted a long discussion of this some time ago. What we have done for version 5 is to print just the measures themselves (they are functions of deviance and Pearson residuals, respectively), without any df or significance approximation, since we don't know what the distributions are under the null hypothesis. These should not be used as test statistics. In more general interpretational terms, the two kinds of residuals are sometimes sensitive to different things. I'm more accustomed to seeing situations in which the Pearson residuals are large while the deviance ones are not. This is often caused by one or two very badly misfitting cases, which are in effect weighted more heavily by the Pearson than by the deviance. If anyone needs any further explanation, let me know. -- *--------
  • Testing - betas.
  • =======================Rich Ulrich, 28 Jul 1997==========ssc Subject: Logistic regression Message-ID: <5rirki$4f6@usenet.srv.cis.pitt.edu> Marc BUSSON (busson@NEPTUNE.CHU-STLOUIS.FR) wrote: : Hello, : Question about logistic regression. : Suppose, you have a model with 3 independant covariates A,B,C. : The Beta coefficients for A and B are significantly different from zero. : You remove C from the model, and you observe that the likelihood of the : model decreased significantly, so you decide to keep the first model with : A,B AND C. but what do you do with the beta of C. In particular, if its - I think your question should be answered by noting that (a) the change in likelihood is, in general, a better test than the test based on the asymptotic variance. Thus, your premise is wrong, that C is "not significant". And, (b) as someone has noted, when two tests come out different, your problem is probably screwed up in some way, so you should be slow about drawing ANY conclusions. Finally, (c), if the coefficient for C is large, sig. small, then C is greatly confounded with some other variable or two. In that case, whether it is in the equation should depend on knowledge of the variables and what can make up an intelligible model - and NOT much on the p-level. *--------
  • Classification (cutoffs)
  • =======================Ronan Conroy, 11 Apr 1997==========ssc Message-ID: <199704111423.PAA25267@gate.rcsi.ie> From: Ronan Conroy <rconroy@RCSI.IE> Subject: Re: Choice of cutoff points in logistic regression Anyone interested in the thorny issues involved in using logistic regression for classification whould have a look at the paper by Frank Harrell and his colleagues in Stats in Medicine Harrell FE, Lee KL, Mark DB Multivariate prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine 1996:15:361-87 Title says it all. Delightfully clear paper. *--------
  • REFs logistic, discriminant f.
  • =======================Clay Helberg, 28 Jan 1997==========ssc From: Clay Helberg <chelberg@spss.com> Subject: Re: Alternatives to Logistic Regression Message-ID: <32EE30A3.6F2A@spss.com> Richard F Ulrich wrote: > In theory, you may feel comfortable looking at TESTS on Logistic > when your data don't satisfy you for Discriminant function. But it > still seems to me that should be little difference in how well they > work, in the ordinary case where there the fit is far less than > perfect. I would be interested in hearing of simulation results, or > examples and counter-examples. Here are a few things I happened to have handy: *Srinivasan & Kim (1987) compared several classification procedures in the context of credit granting decisions. Their results based on resampling from an actual dataset indicate that logistic regression provides better classification than linear discriminant analysis, mainly due to inequality of VCV matrices across groups. *Wiginton (1980) Also reports that LR outperforms DA in terms of classification accuracy in credit related problems, although neither procedure performed particularly well with this data. Unfortunately, I'm afraid I don't have the library resources at hand that I did when I worked for U of Wisconsin, so I can't really look up more examples. I have a few generic citations which probably contain useful info, but I don't have copies of them, so I can't comment on them directly. Perhaps someone who is familiar with the papers/books in question will comment.... --Clay Generic citations: Altman, et al. (1981) Application of Classification Techniques in Business, Banking and Finance. CT: JAI Press Eisenbeis (1977) Problems in applying discriminant analysis in credit scoring models. J of Banking and Finance, 2, 205-219 Eisenbeis & Avery (1972) Discriminant Analysis and Classification Procedures. Lexington, MA: Lexington Books Efron (1975) The efficiency of logistic regression compared to normal discriminant analysis. JASA, 70, 113-121 Press & Wilson (1978) Choosing between logistic regression and discriminant analysis. JASA, 73, 699-705 References: Srinivasan & Kim (1987) Credit granting: a comparative analysis of classification procedures. J of Finance, 42, 665-683 Wiginton (1980) A note on the comparison of logit and discriminant models of consumer credit behavior. J. Finance and Quant. Analysis, 15, 757-768 -- *--------
  • Normality assumption
  • =======================Rich Ulrich, 28 Jan 1997==========ssc From: wpilib+@pitt.edu (Richard F Ulrich) Subject: Re: Alternatives to Logistic Regression Message-ID: <5cl6e6$s4g@usenet.srv.cis.pitt.edu> Helberg, Clay (chelberg@SPSS.COM) wrote: : Diana Kornbrot wrote: : >what do you want the alternative to do for you? : > : >note that logistic regression is not distribution free : >it assumes a LOGISTIC DISTRIBUTION underlying the relation between : >probability of the dpendent variable and parameters in the explanatory : >variable(s) : Well, yes, but this is a *much* less restrictive assumption than the : distributional assumptions required for discriminant analysis, e.g., : which requires MV normality among the predictors and homogeneity of VCV : matrices *as well as* a linear link function (or functions, for : polychotomous DA). -- Clay: The text you recommended (I think) by Tabachnick and Fidell points out that, since Logistic regression constructs a linear equation in the predictors, you are better off with multivariate normality, or something close to it, even in the Logistic case. I can understand how a single variable can be modeled usefully, logistically, taking advantage of non-linearity to incorporate simple skewness... for one variable. But you still have to make a linear combination and create a score, and then (perhaps) draw a line to see how well you have done. In theory, you may feel comfortable looking at TESTS on Logistic when your data don't satisfy you for Discriminant function. But it still seems to me that should be little difference in how well they work, in the ordinary case where there the fit is far less than perfect. I would be interested in hearing of simulation results, or examples and counter-examples. *--------
  • Extreme splits in criterion.
  • =======================Rich Ulrich, 29 Jun 1997==========spss Subject: Logistic Regression -Reply Message-ID: <5rlkk2$3gq@usenet.srv.cis.pitt.edu> Dale Glaser (dale.glaser@SHARP.COM) wrote: : >>> Gordon Behie <gbehie@UVIC.CA> 07/23/97 03:02pm >>> : When using logistic regression, is it problematic to have a dichotomous : dependent variable with the following split: Category 1 = 10 (10%) and : Category 2 =90 (90%). If not, why? Is there a point where the split must : be of a certain distinction (i.e. a 75-25 split).<<<<< : Gordon........... : I am also currently encountering this problem with a binary outcome : variable: Respiratory Distress Syndrome: 0-no (95%); 1-yes : (5%).......even though markedly unequal sample sizes for the outcome : variable is not as problematic for logistic regression as it would be for : discriminant analyses Fisher and Belle (1993) Biostatistics: A : methodology for the Health Sciences; John Wiley & Sons state that "if the : event of interest is rare it may be difficult to generate enough information : to make the prediction of its occurence reasonably high. Particularly in : epidemiolgical screening procedures, if the prevalence (prior probability) : of the disease is rare the predictive value of a positive test (posterior : probability) may not be high" (p. 659). - The quote given ABOVE from Fisher and Belle does not back up the assertion about markedly unequal sample sizes. Maybe they do say something like that somewhere, but I just browsed in their text, and I am not impressed with their own insights. This was my first look at Fisher & Belle - they give some nice references and mention several facets; but ultimately I would rather read their references, than accept an assertion from them about, say, unequal sizes. [I judge this partly by the fact that I REALLY don't like what they say about step-wise analyses in an earlier chapter.] : One advantage of discriminant analyses (DA), in the case of continuous : level predictors, is being able to set the prior probabilites, an option : that is not provided for the SPSS version of logistic regression (equal : probabilities is the default)..... - "prior probabilities" is, in particular, an advantage of "SPSS-DA" over "SPSS-logistic" (and, I think, SAS and BMDP, as well... ) rather than a necessary difference between DA and logistic - which is not always kept clearly in mind, since the computer packages do tend to get identified with the Procedures. But it *should* be kept in mind. : ...of course, with DA more rigorous : assumptions (i.e., multivariate normality, homogeneity of : variance-covariance matrices) must be met........<<snip, to end>> - You have to write a formula, in either case, with scores that distinguish one end from the other, so (in my opinion) the assumptions are not MUCH more rigorous for one than for the other. Here is something to think about: If the R-squared or degree of prediction is not rather HIGH, then there will be very, very little difference between the results of logistic and DA. But if the prediction is nearly perfect, then the ML-logistic has the hazard of hitting on a pathological solution - which is a risk you do not take with Least Squares. - That is the way I characterized the example that W. Sarle provided in another Usenet group, a few months ago, and neither he nor anyone else came back with any response. That is, the number of 'correctly classified cases' happens to be a useful piece of information which can be derived and described, for either DA or logistic. Most of the time, it is only *indirectly* related to the numeric criteria for *either* analysis. However, if there is SOME formula from the variables which gives 100% correct, at some split, then the Maximum Likelihood solution used in SPSS logistic will insist on that formula. I noted for the example provided, that any LEAST SQUARES solution to the logistic, which could be constructed conventionally, did not match the Logistic equation. * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
  • FAQ top.
  • Ulrich home page.
  • Ulrich FAQ. http://www.pitt.edu/~wpilib/stats99.html