- file stat .html ->
FAQ - Chap 5, transformations
****************** ranks, skewness, transformations *********
... compute K-S test of normality?
<< note: D'agostino calls this test obsolete >>
=======================Ismail Parsa, 2 Jun 1995========ssc
Message-ID: <9506021613.AA04568@dragon.epsilon.com>
From: Ismail Parsa
Subject: Re: Wanted: Source Code or Algorithm for test of normal test
Dirk Melcher writes:
|> I would like to implement a Kolmogorov-Smirnov Test for testing of normal
|> distribution in ANSI C. The K-S Test is quit often described in literature
|> and actually I have already implemented a version of the test.
|> Unfortunately I have nowhere
|> nowhwere found and algorithm to calculate the numbers of classes one has to
|> contruct to calculate the test variable D. Does anybody have a source code
|> (in C, Fortran, Pascal etc.) for a test of normal distribution or does
|> anybody know how the classes are calculated (literature?) How do
|> commercial programs like SPSS and SAS do this?
This is from SPSS statistical algorithms (2nd edition):
D = ( -b - sqrt ( b^2-4ac ) ) / 2a
where if sample size (n) <= 100,
a = -7.01256 * ( n + 2.78019 )
b = 2.99587 * sqrt ( ( n + 2.78019 ) )
c = 2.1804661 + ( 0.974598 / sqrt ( n ) ) + ( 1.67997 / n )
If sample size (n) > 100,
a = -7.9028898 * n ^ 0.98
b = 3.1803686 * n ^ 0.49
c = 2.2947256
The references are: Lilliefors (1967), JASA 62:399-402 and
Dallal & Wilkinson (1986), American Statistician 40, No. 4.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
(Q about the KS test for normality)
============================Rich Ulrich, 06 Sep 1996=====(spss)
From: wpilib+@pitt.edu (Richard F Ulrich)
Subject: Re: Normality tests
Message-ID: <50pmk0$ekl@usenet.srv.cis.pitt.edu>
Shlomo-Zalman Jessel (mss@pluto.mscc.huji.ac.il) wrote:
: I ran a test of normality with plots and statististics, for my small
: samples (n=11) of interval and ordinal data. SPSS carried out two
: tests. The Kolmogurov-Smirnov statistic with Lilliefors significance
: level typically gave significance values of <0.01 . The Shapiro-Wilk
: statistic typically gave values > 0.2 . My questions: Is my data
: normal or non-normal based on this? I mean, one test indicates
-- Ideally, the two tests would differ only because they are
sensitive to somewhat different aspects. Consider, for instance,
looking at normality in terms of MOMENTS: a test of skewness is
possible, or a test of kurtosis, which are potentially independent
phenomena.
-- In practice, the S-W test, if I remember right, was devised
with tables provided for evaluation for Normality for small samples;
whereas, the K-S/L test is a more general test whose properties
I would trust with large samples. But I haven't heard about its
robustness for small samples, or how will its p-levels are evaluated
for small samples.
-- For a sample so small as 11 to show up as non-normal by *any*
test is a pretty good indicator that it is odd. IF your data set
includes a bunch of ties, THAT is *non-normal* in a way that
could offend both tests named, without bad consequences for (say)
ANOVA, if you just ignored it. D'agostino _Goodness of Fit_
recommends against K-S as an obsolete test. << paragraph revised >>
I'm curious to see what data illustrates the difference between
those two tests, especially if it is something more subtle than
ties.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
... compare skew between samples?
=====================Diana Kornbrot, 18 Oct 1995========ssc
Message-ID:
From: Diana Kornbrot
Subject: Re: 2-sample tests of skewness?
<< context: several recent posts which were nonplussed by a
question of 'how do I compare skew in two samples...' >>
i am clearly a freak in that I routinely examin skew and kurtosis.
this is because i work oh human reaction times whereskew and kurtosis can
give info on accumulation of information over above that given by mean
and sd.
the formula for standard errors of skew and kurtosis may be found in:
(1) Stuart, A. & Ord, J. K. Kendall's Advanced Theory of Statistics.
London: Charles Griffin and Co., 1987.
however s.e depends on form of skewed distribution. i have formulae for
gamma, inverse gauss, and poisson-erlang as appendices to a theoretical
paper. (also brief tables). if anyone is interested please let me know.
--
Dr. Diana Kornbrot
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
... do T-scoring, z scoring?
=====================Rich Ulrich, 11 Dec 1995========ssc
From: wpilib+@pitt.edu (Richard F Ulrich)
Subject: Re: T scores vs Z scores?
Message-ID: <4ahvd1$47v@usenet.srv.cis.pitt.edu>
Steve Marson, marson@pembvax1.pembroke.edu wrote:
: Howdy,
: Got a little problem in remembering the difference T and Z distributions.
: When transforming raw scores to Z scores one assumes that one has the
: true population mean and standard deviation. When transforming raw
: scores to T scores, one need only have the sample mean and standard
: deviation. Is that correct?
The answer you have already received makes me wonder how many
different conventions for T/Z or t/z there may be....
- In my own reading, z scores (small-letter z) are what people
have when they standardize to Normal(0,1) by using the available means
and S.D. [what is a "true population mean", anyway?].
T-scores (capital-letter T) are what people have when they present data
as Normal(50,10) - which is used to document "standard outcomes"
from rating scales by making use of the inverse Normal: That is, if a
raw scale-score corresponds to the median, it is a T of 50; if it
corresponds to 95%, then it is at plus+two standard deviations, or a T
of 70. Such documentation is USUALLY presented to illustrate the
centiles of a large normative sample; otherwise ("the cheap version")
it is computed directly from the mean and S.D.
I might use T-scores do show how data compare with previous work;
or to transform scores, before any another analyses, if the raw scores
are really badly skewed. If the only need is to equalize the ranges
for the data in hand, I would prefer z scores.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Why does the Mann-Whitney require
"identically shaped" distributions?
=======================Rich Ulrich, 25 May 1995==========ssc
Subject: Re: Mann-Whitney Test Question
Message-ID: <3q2rri$21e@usenet.srv.cis.pitt.edu>
Mr. N.W.A. Marsh (jw34@LIVERPOOL.AC.UK) wrote, and used the useful
phrase, `ordinal precedence':
: Also, I wish to deny Mike Lacy's assertion that the M-W-U test is a test
: of shift between two identically-shaped distributions. The M-W
: statistic will do this job with some degree of validity, but its manifest
: character (look at how the M-W-U is computed) is that it is a test of the
: degree of ordinal precedence of scores in one group over those in another -
: which is a broader definition.
So, with two distributions of opposite skew, when you look at the rank order
of the cases, you can go from smallest to largest, looking at
a) the long tail of group B
b) the bulk of cases at the PEAK of group A
c)
d) the bulk of cases at the PEAK of group B
e) the long tail of group A
So, within each HALF of the distribution, group B cases occur first, even
though the medians occur the same. So, the M-H test could reject,
illustrating why the statement about `identically shaped distributions'
is essential to the validity of the test, if it is to be a test,
effectively, of a difference in `medians'.
The statement about identically-shaped
: distributions, and its denial, are perennials on stat-l and elsewhere where
: statistical procedures are debated. I wonder where the idea about the M-W-U
: being only for comparing identically-shaped distributions comes from?
< see above. for robust interpretation of a difference... >
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
ANOVA and Unequal variances (1 of 2).
=====================Rich Ulrich, 09 Aug 1995========ssc
From: wpilib+@pitt.edu (Richard F Ulrich)
Subject: Re: Response to Reviewer's Critique
Message-ID: <40b6mt$717@usenet.srv.cis.pitt.edu>
Joseph V. Martin (jomartin@crab.rutgers.edu) wrote:
: I recently submitted an article for publication in a journal. The
: work concerned measurements of binding parameters in two brain areas
: from mice killed at different stages of the estrous cycle. One binding
: parameter was found to vary significantly with the estrous cycle by an
: ANOVA. The other parameter (Kd) did not. A reviewer had the
: following criticism "ANOVA is only a valid statistical approach when
: the groups have equal variance, and judging from the standard error
: values, this is not the case (1.56 vs 0.24). Thus the conclusion that
: the Kd did not change with the stage of estrous is not yet justified."
: How can I address this criticism? . . .
SE's of 1.56 and 0.24 are an ENORMOUS difference in variability,
assuming that those `standard errors' are based on equal Ns.
( -- And if they are NOT, then neither you nor your reviewer should be
allowed to publish without important statistical advice.) The reviewer,
I think, may not have expressed the point as sharply as this, but,
if I may try to rephrase:
`What are you DOING, trying to compare numbers that have such very
different Variability? which looks potentially more important than
the variability that may exist between the means... It is simply
poor model building, to work from those numbers'.
Address that variability: Outliers? Does it vary with the mean? Very
small N? It looks like it could be an important difference, even if
the MEANS are NOT different.
By the way, it is not totally clear to me what was being compared,
but I think it was: brain chemistry from a specific location, as
measured in animals that were sacrificed at different points in
their estrous cycles.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
ANOVA and Unequal variances (2 of 2).
=======================Phil Gallagher, 2 Nov 1994==========ssc
Message-ID:
From: "Philip Gallagher,(919)929-6010"
Subject: Re: variance - a different view
> >
> > What can you do if you want to do an Analysis of Variance,
> > but not all variances are equal?
When I have a problem like this I get peace of mind by
stepping back from the textbook questions such as " ... are
the variances equal ..." and try to think again about what
I am really asking/wanting to find out. We memorized in
STAT 1 that an ANOVA tests the equality of the cell means
but in order to have a valid test the cell variances must
be (nearly) equal. It was years later that I realized that
what that is saying (in the most fortunate happenstance) is
that if you were to overlay the distributions (standardized to
equal Ns) of the k cells in a single plot, if they all have
(nearly) the same location and and the same dispersion, the
k distributions will lie one upon another. Or, that the
distns of the variable in each of the k cells are (nearly)
equal. So ANOVA is really a test not just of means equality,
but of equality of distributions. (My soul is absolutely
convinced that some black-hearted genius could readily conjure
up data with equal means and variances but quite different
shapes - (j) skewed left and (k-j) skewed right, for example,
but, to my mind that is a pathological situation for ANOVA;
one might be able to go away proclaiming equality of means
but I would hate to have people thinking that the k cells were
alike/similar, wouldn't you?)
Anyway, now that I have the mental picture of k distributions
piled up on each other, to me the question is no longer whether
I am justified in doing an ANOVA, but, rather, are the k
distributions (nearly) the same in both location and shape?
It only takes one of those k distributions to have a different
location, dispersion, or shape for me to want to say that I find
it hard to believe they were all drawn from the same parent
distribution. So I'm driven to abandoning ANOVA as soon as I
suspect that the distributions are not equal. Typically, the first
thing I do is overlay all the empirical distributions and
perform the IOT test (IOT - Inter Ocular Trauma - it hits you
right between the eyes). If it happens that at least one of
the empirical distributions is grossly different from at least
some of the others, then I am well along the path to being
finished. Who cares about ANOVA? One (or more) of these cells
is (are) different from the rest. Of course I may have a frightful
time with p-values if I have to compute the D-statistic for
every possible cell-pair, etc., but that's tough. It's also
why I say that a scientist-statistician is almost never in a
hypothesis testing situation - once you know enough so that
you may legitimately design an experiment for an honest
single a priori hypothesis test - the shapes of the distributions
are equal, etc. - you know so much that few are willing to invest
the resources just to get a single 100% defensible p-value.
And, to try to answer the original question: If you are convinced
that at least one of the k cell variances is different from the
rest, STOP! No need to do the ANOVA - you are already convinced
that the cells were not drawn from the same population.
Now, if it were true that the ONLY thing you were interested
in was whether the means differ and you really do not care
about the shapes of the distributions, well, that's a different
game, and I'd go haul out my copy of Bradley's book on
nonparametric statistics to find the locations test that best
suited my data. If you can find one that really is strictly
a test of location. I'm always fearful that if I study a
supposed test of location alone, that I will find that there's
still a component of shape/dispersion in it, and that if I
haven't found that component, it's because I didn't work
hard enough.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Transformations
NB: What seemingly fouls up the application of simple things
like taking the log or square root is often the presence of an odd
score or two, i.e., an outlier that will only become *further*
outlying after what should have been a suitable transformation.
If the rest of the data "look good" after some transform, using
the criterion of being symmetrical around the median at most
of the percentiles, then the outlier should PROBABLY be dealt
with as an "outlier" (see previous section).
=========================Ian Spence, 10 Jun 1996======ssm
From: Ian Spence
Subject: Re: ANOVA Problem
Message-ID: <31BC32E9.747E@psych.utoronto.ca>
<< question - what to do when the data are probably distributed
as Poisson, rather than normal >>
Square root is indicated. See, for example,
Rao, C.R. (1965) Linear statistical inference and its applications.
New York: Wiley, p.357
There is a second edition (I think) but I cannot give you the page
number. Kempthorne also discusses transformations in his classic text.
Kruskal's article on transformations in the International Encyclopedia
for the Social Sciences (and the abstracted International Encyclopedia
of Statistics) is still probably the single best short treatment of
data transformations.
=====================Richard Goldstein, 11 Mar 1996========sse
From: richgold@netcom.com (Richard Goldstein)
Subject: Re: References for Transformations in Data Analysis
Message-ID:
Barry DeCicco (bdecicco@sunm4048as.sph.umich.edu) wrote:
: In article <4hukmv$ho3@earth.njcc.com>,
bgunter@pluto.njcc.com (Bert Gunter) writes:
<>
: |> Adding to Paul's list, a book by Tony Atkinson whose title is
: |> approximately "Transfomations and ... in Regression" (correction
: |> welcome). Box, Hunter, and Hunter's "Statistics for Experimenters" and
: |> Box and Draper's "Response Surface Methods" also have good discussions
: |> on transformations in them.
: |>
: |> Bert Gunter
: |> bgunter@pluto.njcc.com
: |> (Statistical Consultant)
: |>
: I think that it is 'Transformations and Weighting in Regression'.
: If it is the same book I'm thinking of, half is on the use of weighted
: regression, and the other half is on the use of transformations.
No, this is a different book, by Carroll and Ruppert; the Atkinson book
is _Plots, Transformations and Regression_.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
FAQ top.
Ulrich home page.
Ulrich FAQ.
http://www.pitt.edu/~wpilib/stats99.html