- file stat 97postho.html ->
Post hoc testing for ANOVA
There are some excellent references
provided down below, but I include
the first couple of Notes, here, in
order to emphasize that "posthoc"
testing is not a solution for every
problem.
Many groups: formal test?
=======================Rich Ulrich, 07 Mar 1997==========ssc
Subject: Re: SS-STP-like multiple test correction?
Message-ID: <5fpr12$k0l@usenet.srv.cis.pitt.edu>
Spencer Muse (muse@biosci.mbp.missouri.edu) wrote:
: I have estimates of a parameter, call it d, from 14 separate sets of
: discrete data, along with estimates of the standard deviations of these
: 14 estimates (all via likelihood). The standard deviations are very
: different for each data set (as I expected), and certainly can not be
: assumed to be homogeneous.
-- What are you TRYING to do? If the standard deviations are "very
different", then you definitely have *two* parameters per set of
data, and the sets "differ".
For a whole lot of purposes, you might let your curiosity end right
there. For the odd purposes where you still do have curiosity,
despite knowing that there is not an underlying distribution in
common, I think you have to say more about what *is* in common,
or why you are interested.
I am assuming, too, that there is not a simple transformation that
will produce homogeneity of variance, since you would have mentioned
that as something obvious....
...If I plot the 14 confidence intervals for d
: one above the other, the 14 data sets seem to split nicely into 2
: non-overlapping groups. To summarize, I have d-hat_1,
: d-hat_2,...,d-hat_14, and s_1,s_2,...s_14. How might I go about
: evaluating the significance of this grouping?
You might look at literature on evaluating "number of clusters"
achieved in cluster-analysis. However, it might be true that your
data sets, with a lot of heterogeneity, won't satisfy the usual
assumptions made in clustering.
Also: Bartlett's book on Outliers gives numerous separate algorithms,
to detect 1, 2, ... K outliers; at one extreme or both; using
assumptions of different underlying distributions (Normal, gamma, etc.).
: On a related note, given that the correction for this many implicit
: multiple tests will push us FAR into the tail of the distribution, can I
: expect to obtain any reliable inference?
-- You are speaking to the reason that tests may not seem to have
great POWER when they are used. The reason for not considering an
inference to be reliable will be a) not a small p-level, or
b) not robust against other assumptions (especially note, well or
badly behaved variances).
Which range test
=======================Rich Ulrich, 09 May 1997==========ssm
Subject: Re: Which Range Test
Message-ID: <5kvdea$fvp@usenet.srv.cis.pitt.edu>
Lalit Kumar (p2114659@saturn.geog.unsw.edu.au) wrote:
: Hi all
: I have 11 groups and want to compare all group means to see which are
: significantly different. I have used ANOVA and the F value tells me that
: there are significant differences. Now I want to find which pairs are
: different. I have used Duncan's multiple range test. Am I right in doing
: this? Someone else told me I should use Tukey's test and still others
: tell me to use Scheffe's test. What do I do?
-- Duncan's test has not been popular with statisticians for ages,
since Scheffe complained that he could not understand the basis for it,
and no one else defended it.
Scheffe's test never requires any prior test - if a comparison passes
Scheffe's, then it is such a big difference that the overall F *has*
to be significant.
There are a couple of tests with Tukey's name, which are popular.
There is also the argument that the overall test allows you to report
differences that are manifest with just the Least Significant Difference
(LSD). I rather like the LSD followup because it does not seem to
pretend, so much, that there is magic in passing somebody's tricky
criterion. - If you want to use stiffer testing, use a p-level that
is EXPLICITLY smaller, e.g., 0.01 instead of 0.05 - but that is
my personal opinion that your editor or advisor may not share.
"posthoc" tests and F
=======================Greg Hancock, 21 Feb 1997==========ssc
Message-ID:
From: Greg Hancock
Subject: Re: Post-hoc tests and significant F
John Reece wrote...
>I would like to garner some opinion on the relationship between post-hoc
>means comparison procedures and the omnibus F test. The traditional view
in
>teaching psychology students (and I suspect students from many other
>disciplines) is that one should not carry out exploratory pairwise means
>comparisons unless an omnibus F test indicates significance at some
>arbitrary value, usually .05. However, several sources (Howell & Wilcox
to
>name two) indicate that, far from being a requirement, exploratory means
>comparisons should be carried out regardless of the significance of the
>overall F, or even in lieu of an overall F. This makes some sense to me,
>mainly because it is my understanding that exploratory means comparison
>procedures were developed independent of the notion of an overall omnibus
>test.
I have many thoughts on this subject, few if any of which come down in
favor of an omnibus F. This does not mean, however, that I don't
think it should be taught. It is still a useful frame of reference
for understanding main effects in factorial designs (although these
could also be couched in a different "complex contrast" context). Anyway,
the first question I would ask is what do you mean by "exploratory?"
If by "exploratory" you mean that this is pilot work, meant to be a
precursor to more rigorous statistical analyses on other samples, I guess
I'd say spare the F and go nuts with your t-tests. But don't make any
grandiose proclamations based on your findings. Save those to follow your
more formal cross-validation work.
If by "exploratory" you mean that you're looking at some data
(probably sample means, specifically), and those data suggest some
particular comparisons that were not planned a priori, then I'm all in
favor of exacting some kind of familywise Type I error control.
However, the F-test is not the beast for doing this. It only works as a
control mechanism if the complete null hypothesis (that all population
means are equal) is true. When one population mean differs, and you have
sufficient power, you can get past the omnibus F pretty easily and then
make Type I errors on other null comparisons if you don't have another
control mechanism. Scheffe provides such a mechanism. The only value of
the omnibus F here is that it tells you if you should even both with
Scheffe; if the omnibus F is not significant, then no Scheffe contrast or
comparison will be either. If the omnibus F is significant, then at least
one exploratory contrast of comparison will be as well (although it may
not be one you're interested in or one that makes sense).
By the way, Scheffe's test is actually unnecessarily conservative. A
paper presented last year at the American Educational Research Association
(Klockars & Hancock) dealt with this issue, and has been addressed in
independent work by Ottaway & Harris (personal communication).
An excerpt from a recent paper which addresses the omnibus F, among many
other topics, is presented below:
"There are a number of problems associated with the requirement of
an omnibus test rejection prior to conducting multiple comparisons; we
will present four. First, and most simply, few research questions are
directly addressed by an omnibus test. In a well planned study, the
researcher's questions involve specific contrasts of group means; the
omnibus test, addresses each question only tangentially. Some might argue
that the omnibus test is not present to answer questions; rather, it is
there to facilitate control over the rate of Type I error. This issue of
control, however, brings us to our second point the belief that an
omnibus test offers protection is not completely accurate. When the
complete null hypothesis is true, weak familywise Type I error control is
facilitated by the omnibus test; but, when the complete null is false and
partial nulls exist, the F-test does not maintain strong control over the
familywise error rate.
"A third point, which Games (1971) so elegantly demonstrated in
his figures, is that the F-test may not be completely consistent with the
results of a pairwise comparison approach. Consider, for example, a
researcher who is instructed to conduct Tukey's test only if an
alpha-level F-test rejects the complete null. It is possible for the
complete null to be rejected but for the widest ranging means not to
differ significantly. This is an example of what has been referred to as
incoherence (Gabriel, 1969) or incompatibility (Lehmann, 1957). On the
other hand, the complete null may be retained while the null associated
with the widest ranging means would have been rejected had the decision
structure allowed it to be tested. This has been referred to by Gabriel
(1969) as nonconsonance. One wonders if, in fact, a practitioner in this
situation would simply conduct the MCP contrary to the omnibus test's
recommendation. Strangely enough, such a seeming breach of multiple
comparison ethics would have largely positive statistical ramifications as
we discuss in our next and final point.
"The fourth argument against the traditional implementation of an
initial omnibus F-test stems from the fact that its well-intentioned but
unnecessary protection contributes to a decrease in power. The first test
in a pairwise MCP, such as that of the most disparate means in Tukey's
test, is a form of omnibus test all by itself, controlling the familywise
error rate at the alpha-level in the weak sense. Requiring a preliminary
omnibus F-test amounts to forcing a researcher to negotiate two hurdles to
proclaim the most disparate means significantly different, a task that the
range test accomplished at an acceptable alpha-level all by itself. If
these two tests were perfectly redundant, the results of both would be
identical and the omnibus test would represent neither friend nor foe;
probabilistically speaking, the joint probability of rejecting both would
be alpha when the complete null hypothesis was true. However, the two
tests are not completely redundant; as a result the joint probability of
their rejection is less than alpha. The F-protection therefore imposes
unnecessary conservatism (see Bernhardson, 1975, for a simulation of this
conservatism). For this reason, and those listed before, we agree with
Games' (1971) statement regarding the traditional implementation of a
preliminary omnibus F-test:
'There seems to be little point in applying the overall F test
prior to running c contrasts by procedures that set [the
familywise error rate] <= alpha.... If the c contrasts express
the experimental interest directly, they are justified whether the
overall F is significant or not and [familywise error rate] is
still controlled. (Games, 1971, p. 560)'"
Whole passage from:
Hancock, G. R., & Klockars, A. J. (1996). The quest for alpha:
Developments in multiple comparison procedures in the quarter century
since Games (1971). Review of Educational Research, 66(3), 269-306.
...
"post-hoc", logic in brief
=======================Hans-Peter Piepho, 21 Feb 1997==========ssc
Message-ID: <9702210852.AA14777@fserv.wiz.uni-kassel.de>
From: Hans-Peter Piepho
Subject: Re: Post-hoc tests and significant F
>I would like to garner some opinion on the relationship between post-hoc
>means comparison procedures and the omnibus F test. The traditional view in
>teaching psychology students (and I suspect students from many other
>disciplines) is that one should not carry out exploratory pairwise means
>comparisons unless an omnibus F test indicates significance at some
>arbitrary value, usually .05. However, several sources (Howell & Wilcox to
>name two) indicate that, far from being a requirement, exploratory means
>comparisons should be carried out regardless of the significance of the
>overall F, or even in lieu of an overall F. This makes some sense to me,
>mainly because it is my understanding that exploratory means comparison
>procedures were developed independent of the notion of an overall omnibus
>test.
>
>I am keen to hear people's opinions on this topic, either privately or to
>the list (along with any informative references that people might like to
>recommend), because it has direct implications for how I will teach this
>material.
>
>Thanks in anticipation.
>
>
The suggestion to pursue multiple comparisons only when an overall F-test
rejects, is connected with Fisher's protected LSD test. This guarantees the
experiment-wise Type I error to be controlled only in the weak sense, i.e.
only if the global null is true, but not otherwise (there is no protection
when the global null is false).
To control the experiment-wise error rate in the strong sense, i.e. also
when the global null is false (which I think is the most common situation),
a host of other procedures have been suggested, the most prominent of them
being Tukeys test, which uses studentized ranges. These tests do NOT require
a preliminary F-test.
_______________________________________________________________________
Hans-Peter Piepho
F distribution: issues, REFs
=======================Greg Hancock, 08 Apr 1997==========ssc
Message-ID:
From: Greg Hancock
Subject: Re: F distribution
On Sun, 6 Apr 1997, Ddusick wrote:
> A few weeks ago there was a discussion about the omnibus F test, and
> whether or not you could/should do post hoc testing if the omnibus F is
> not significant. My professor claims you should NEVER do a post hoc if
> the omnbus F is not significant, and challenged us to find ANY statistics
> book that says otherwise. Since several people in this newsgroup defended
> post hocs for omnibus Fs that are not significant, I thought you might be
> able to point me to a text which supports that? How about research
> articles?
"...you may want to apply multiple comparison procedures regardless of
whether you have already obtained a significant F test, and in fact there
may be situations where multiple comparison procedures are applied without
applying the F test at all..." (p.173, Wilcox, 1987).
Wilcox goes on to explain why on pages 187-188 in a section titled, "The
effect of using multiple comparison procedures only after a significant F
test."
Wilcox, R. R. (1987). New statistical procedures for the social
sciences: Modern solutions to basic problems. Hillsdale, NJ: Lawrence
Erlbaum Associates, Publishers.
More recently, this issue was discussed at length on pages 295-296 in a
recent article by Hancock & Klockars (1996).
Hancock, G. R., & Klockars, A. J. (1996). The quest for alpha:
Developments in multiple comparison procedures in the quarter century
since Games (1971). Review of Educational Research, 66(3), 269-306.
I happen to be on pretty good terms with the first author, in case you'd
like a copy.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
FAQ top.
Ulrich home page.
Ulrich FAQ.
http://www.pitt.edu/~wpilib/stats99.html