<- file stat .html -> FAQ - Chap 5, transformations ****************** ranks, skewness, transformations *********
  • ... compute K-S test of normality?
  • << note: D'agostino calls this test obsolete >> =======================Ismail Parsa, 2 Jun 1995========ssc Message-ID: <9506021613.AA04568@dragon.epsilon.com> From: Ismail Parsa <sip@EPSILON.COM> Subject: Re: Wanted: Source Code or Algorithm for test of normal test Dirk Melcher <dmelcher@ECHNATON.USF.UNI-OSNABRUECK.DE> writes: |> I would like to implement a Kolmogorov-Smirnov Test for testing of normal |> distribution in ANSI C. The K-S Test is quit often described in literature |> and actually I have already implemented a version of the test. |> Unfortunately I have nowhere |> nowhwere found and algorithm to calculate the numbers of classes one has to |> contruct to calculate the test variable D. Does anybody have a source code |> (in C, Fortran, Pascal etc.) for a test of normal distribution or does |> anybody know how the classes are calculated (literature?) How do |> commercial programs like SPSS and SAS do this? This is from SPSS statistical algorithms (2nd edition): D = ( -b - sqrt ( b^2-4ac ) ) / 2a where if sample size (n) <= 100, a = -7.01256 * ( n + 2.78019 ) b = 2.99587 * sqrt ( ( n + 2.78019 ) ) c = 2.1804661 + ( 0.974598 / sqrt ( n ) ) + ( 1.67997 / n ) If sample size (n) > 100, a = -7.9028898 * n ^ 0.98 b = 3.1803686 * n ^ 0.49 c = 2.2947256 The references are: Lilliefors (1967), JASA 62:399-402 and Dallal & Wilkinson (1986), American Statistician 40, No. 4. * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • (Q about the KS test for normality)
  • ============================Rich Ulrich, 06 Sep 1996=====(spss) From: wpilib+@pitt.edu (Richard F Ulrich) Subject: Re: Normality tests Message-ID: <50pmk0$ekl@usenet.srv.cis.pitt.edu> Shlomo-Zalman Jessel (mss@pluto.mscc.huji.ac.il) wrote: : I ran a test of normality with plots and statististics, for my small : samples (n=11) of interval and ordinal data. SPSS carried out two : tests. The Kolmogurov-Smirnov statistic with Lilliefors significance : level typically gave significance values of <0.01 . The Shapiro-Wilk : statistic typically gave values > 0.2 . My questions: Is my data : normal or non-normal based on this? I mean, one test indicates -- Ideally, the two tests would differ only because they are sensitive to somewhat different aspects. Consider, for instance, looking at normality in terms of MOMENTS: a test of skewness is possible, or a test of kurtosis, which are potentially independent phenomena. -- In practice, the S-W test, if I remember right, was devised with tables provided for evaluation for Normality for small samples; whereas, the K-S/L test is a more general test whose properties I would trust with large samples. But I haven't heard about its robustness for small samples, or how will its p-levels are evaluated for small samples. -- For a sample so small as 11 to show up as non-normal by *any* test is a pretty good indicator that it is odd. IF your data set includes a bunch of ties, THAT is *non-normal* in a way that could offend both tests named, without bad consequences for (say) ANOVA, if you just ignored it. D'agostino _Goodness of Fit_ recommends against K-S as an obsolete test. << paragraph revised >> I'm curious to see what data illustrates the difference between those two tests, especially if it is something more subtle than ties. * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • ... compare skew between samples?
  • =====================Diana Kornbrot, 18 Oct 1995========ssc Message-ID: <Pine.SUN.3.91.951018115230.19754C-100000@altair.herts.ac.uk> From: Diana Kornbrot <D.E.Kornbrot@HERTS.AC.UK> Subject: Re: 2-sample tests of skewness? << context: several recent posts which were nonplussed by a question of 'how do I compare skew in two samples...' >> i am clearly a freak in that I routinely examin skew and kurtosis. this is because i work oh human reaction times whereskew and kurtosis can give info on accumulation of information over above that given by mean and sd. the formula for standard errors of skew and kurtosis may be found in: (1) Stuart, A. & Ord, J. K. Kendall's Advanced Theory of Statistics. London: Charles Griffin and Co., 1987. however s.e depends on form of skewed distribution. i have formulae for gamma, inverse gauss, and poisson-erlang as appendices to a theoretical paper. (also brief tables). if anyone is interested please let me know. -- Dr. Diana Kornbrot * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • ... do T-scoring, z scoring?
  • =====================Rich Ulrich, 11 Dec 1995========ssc From: wpilib+@pitt.edu (Richard F Ulrich) Subject: Re: T scores vs Z scores? Message-ID: <4ahvd1$47v@usenet.srv.cis.pitt.edu> Steve Marson, marson@pembvax1.pembroke.edu wrote: : Howdy, : Got a little problem in remembering the difference T and Z distributions. : When transforming raw scores to Z scores one assumes that one has the : true population mean and standard deviation. When transforming raw : scores to T scores, one need only have the sample mean and standard : deviation. Is that correct? The answer you have already received makes me wonder how many different conventions for T/Z or t/z there may be.... - In my own reading, z scores (small-letter z) are what people have when they standardize to Normal(0,1) by using the available means and S.D. [what is a "true population mean", anyway?]. T-scores (capital-letter T) are what people have when they present data as Normal(50,10) - which is used to document "standard outcomes" from rating scales by making use of the inverse Normal: That is, if a raw scale-score corresponds to the median, it is a T of 50; if it corresponds to 95%, then it is at plus+two standard deviations, or a T of 70. Such documentation is USUALLY presented to illustrate the centiles of a large normative sample; otherwise ("the cheap version") it is computed directly from the mean and S.D. I might use T-scores do show how data compare with previous work; or to transform scores, before any another analyses, if the raw scores are really badly skewed. If the only need is to equalize the ranges for the data in hand, I would prefer z scores. * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • Why does the Mann-Whitney require "identically shaped" distributions?
  • =======================Rich Ulrich, 25 May 1995==========ssc Subject: Re: Mann-Whitney Test Question Message-ID: <3q2rri$21e@usenet.srv.cis.pitt.edu> Mr. N.W.A. Marsh (jw34@LIVERPOOL.AC.UK) wrote, and used the useful phrase, `ordinal precedence': : Also, I wish to deny Mike Lacy's assertion that the M-W-U test is a test : of shift between two identically-shaped distributions. The M-W : statistic will do this job with some degree of validity, but its manifest : character (look at how the M-W-U is computed) is that it is a test of the : degree of ordinal precedence of scores in one group over those in another - : which is a broader definition. So, with two distributions of opposite skew, when you look at the rank order of the cases, you can go from smallest to largest, looking at a) the long tail of group B b) the bulk of cases at the PEAK of group A c) <some mixture of both A, B, with both medians> d) the bulk of cases at the PEAK of group B e) the long tail of group A So, within each HALF of the distribution, group B cases occur first, even though the medians occur the same. So, the M-H test could reject, illustrating why the statement about `identically shaped distributions' is essential to the validity of the test, if it is to be a test, effectively, of a difference in `medians'. The statement about identically-shaped : distributions, and its denial, are perennials on stat-l and elsewhere where : statistical procedures are debated. I wonder where the idea about the M-W-U : being only for comparing identically-shaped distributions comes from? < see above. for robust interpretation of a difference... > * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • ANOVA and Unequal variances (1 of 2).
  • =====================Rich Ulrich, 09 Aug 1995========ssc From: wpilib+@pitt.edu (Richard F Ulrich) Subject: Re: Response to Reviewer's Critique Message-ID: <40b6mt$717@usenet.srv.cis.pitt.edu> Joseph V. Martin (jomartin@crab.rutgers.edu) wrote: : I recently submitted an article for publication in a journal. The : work concerned measurements of binding parameters in two brain areas : from mice killed at different stages of the estrous cycle. One binding : parameter was found to vary significantly with the estrous cycle by an : ANOVA. The other parameter (Kd) did not. A reviewer had the : following criticism "ANOVA is only a valid statistical approach when : the groups have equal variance, and judging from the standard error : values, this is not the case (1.56 vs 0.24). Thus the conclusion that : the Kd did not change with the stage of estrous is not yet justified." : How can I address this criticism? . . . SE's of 1.56 and 0.24 are an ENORMOUS difference in variability, assuming that those `standard errors' are based on equal Ns. ( -- And if they are NOT, then neither you nor your reviewer should be allowed to publish without important statistical advice.) The reviewer, I think, may not have expressed the point as sharply as this, but, if I may try to rephrase: `What are you DOING, trying to compare numbers that have such very different Variability? which looks potentially more important than the variability that may exist between the means... It is simply poor model building, to work from those numbers'. Address that variability: Outliers? Does it vary with the mean? Very small N? It looks like it could be an important difference, even if the MEANS are NOT different. By the way, it is not totally clear to me what was being compared, but I think it was: brain chemistry from a specific location, as measured in animals that were sacrificed at different points in their estrous cycles. * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • ANOVA and Unequal variances (2 of 2).
  • =======================Phil Gallagher, 2 Nov 1994==========ssc Message-ID: <STAT-L%94110220252648@VM1.MCGILL.CA> From: "Philip Gallagher,(919)929-6010" <UPHILG@UNCMVS.OIT.UNC.EDU> Subject: Re: variance - a different view > > > > What can you do if you want to do an Analysis of Variance, > > but not all variances are equal? When I have a problem like this I get peace of mind by stepping back from the textbook questions such as " ... are the variances equal ..." and try to think again about what I am really asking/wanting to find out. We memorized in STAT 1 that an ANOVA tests the equality of the cell means but in order to have a valid test the cell variances must be (nearly) equal. It was years later that I realized that what that is saying (in the most fortunate happenstance) is that if you were to overlay the distributions (standardized to equal Ns) of the k cells in a single plot, if they all have (nearly) the same location and and the same dispersion, the k distributions will lie one upon another. Or, that the distns of the variable in each of the k cells are (nearly) equal. So ANOVA is really a test not just of means equality, but of equality of distributions. (My soul is absolutely convinced that some black-hearted genius could readily conjure up data with equal means and variances but quite different shapes - (j) skewed left and (k-j) skewed right, for example, but, to my mind that is a pathological situation for ANOVA; one might be able to go away proclaiming equality of means but I would hate to have people thinking that the k cells were alike/similar, wouldn't you?) Anyway, now that I have the mental picture of k distributions piled up on each other, to me the question is no longer whether I am justified in doing an ANOVA, but, rather, are the k distributions (nearly) the same in both location and shape? It only takes one of those k distributions to have a different location, dispersion, or shape for me to want to say that I find it hard to believe they were all drawn from the same parent distribution. So I'm driven to abandoning ANOVA as soon as I suspect that the distributions are not equal. Typically, the first thing I do is overlay all the empirical distributions and perform the IOT test (IOT - Inter Ocular Trauma - it hits you right between the eyes). If it happens that at least one of the empirical distributions is grossly different from at least some of the others, then I am well along the path to being finished. Who cares about ANOVA? One (or more) of these cells is (are) different from the rest. Of course I may have a frightful time with p-values if I have to compute the D-statistic for every possible cell-pair, etc., but that's tough. It's also why I say that a scientist-statistician is almost never in a hypothesis testing situation - once you know enough so that you may legitimately design an experiment for an honest single a priori hypothesis test - the shapes of the distributions are equal, etc. - you know so much that few are willing to invest the resources just to get a single 100% defensible p-value. And, to try to answer the original question: If you are convinced that at least one of the k cell variances is different from the rest, STOP! No need to do the ANOVA - you are already convinced that the cells were not drawn from the same population. Now, if it were true that the ONLY thing you were interested in was whether the means differ and you really do not care about the shapes of the distributions, well, that's a different game, and I'd go haul out my copy of Bradley's book on nonparametric statistics to find the locations test that best suited my data. If you can find one that really is strictly a test of location. I'm always fearful that if I study a supposed test of location alone, that I will find that there's still a component of shape/dispersion in it, and that if I haven't found that component, it's because I didn't work hard enough. * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • Transformations
  • NB: What seemingly fouls up the application of simple things like taking the log or square root is often the presence of an odd score or two, i.e., an outlier that will only become *further* outlying after what should have been a suitable transformation. If the rest of the data "look good" after some transform, using the criterion of being symmetrical around the median at most of the percentiles, then the outlier should PROBABLY be dealt with as an "outlier" (see previous section). =========================Ian Spence, 10 Jun 1996======ssm From: Ian Spence <spence@psych.utoronto.ca> Subject: Re: ANOVA Problem Message-ID: <31BC32E9.747E@psych.utoronto.ca> << question - what to do when the data are probably distributed as Poisson, rather than normal >> Square root is indicated. See, for example, Rao, C.R. (1965) Linear statistical inference and its applications. New York: Wiley, p.357 There is a second edition (I think) but I cannot give you the page number. Kempthorne also discusses transformations in his classic text. Kruskal's article on transformations in the International Encyclopedia for the Social Sciences (and the abstracted International Encyclopedia of Statistics) is still probably the single best short treatment of data transformations. =====================Richard Goldstein, 11 Mar 1996========sse From: richgold@netcom.com (Richard Goldstein) Subject: Re: References for Transformations in Data Analysis Message-ID: <richgoldDo3tso.H0q@netcom.com> Barry DeCicco (bdecicco@sunm4048as.sph.umich.edu) wrote: : In article <4hukmv$ho3@earth.njcc.com>, bgunter@pluto.njcc.com (Bert Gunter) writes: <<snip>> : |> Adding to Paul's list, a book by Tony Atkinson whose title is : |> approximately "Transfomations and ... in Regression" (correction : |> welcome). Box, Hunter, and Hunter's "Statistics for Experimenters" and : |> Box and Draper's "Response Surface Methods" also have good discussions : |> on transformations in them. : |> : |> Bert Gunter : |> bgunter@pluto.njcc.com : |> (Statistical Consultant) : |> : I think that it is 'Transformations and Weighting in Regression'. : If it is the same book I'm thinking of, half is on the use of weighted : regression, and the other half is on the use of transformations. No, this is a different book, by Carroll and Ruppert; the Atkinson book is _Plots, Transformations and Regression_. * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
  • FAQ top.
  • Ulrich home page.
  • Ulrich FAQ. http://www.pitt.edu/~wpilib/stats99.html