<- file stat 97postho.html -> Post hoc testing for ANOVA There are some excellent references provided down below, but I include the first couple of Notes, here, in order to emphasize that "posthoc" testing is not a solution for every problem.
  • Many groups: formal test?
  • =======================Rich Ulrich, 07 Mar 1997==========ssc Subject: Re: SS-STP-like multiple test correction? Message-ID: <5fpr12$k0l@usenet.srv.cis.pitt.edu> Spencer Muse (muse@biosci.mbp.missouri.edu) wrote: : I have estimates of a parameter, call it d, from 14 separate sets of : discrete data, along with estimates of the standard deviations of these : 14 estimates (all via likelihood). The standard deviations are very : different for each data set (as I expected), and certainly can not be : assumed to be homogeneous. -- What are you TRYING to do? If the standard deviations are "very different", then you definitely have *two* parameters per set of data, and the sets "differ". For a whole lot of purposes, you might let your curiosity end right there. For the odd purposes where you still do have curiosity, despite knowing that there is not an underlying distribution in common, I think you have to say more about what *is* in common, or why you are interested. I am assuming, too, that there is not a simple transformation that will produce homogeneity of variance, since you would have mentioned that as something obvious.... ...If I plot the 14 confidence intervals for d : one above the other, the 14 data sets seem to split nicely into 2 : non-overlapping groups. To summarize, I have d-hat_1, : d-hat_2,...,d-hat_14, and s_1,s_2,...s_14. How might I go about : evaluating the significance of this grouping? You might look at literature on evaluating "number of clusters" achieved in cluster-analysis. However, it might be true that your data sets, with a lot of heterogeneity, won't satisfy the usual assumptions made in clustering. Also: Bartlett's book on Outliers gives numerous separate algorithms, to detect 1, 2, ... K outliers; at one extreme or both; using assumptions of different underlying distributions (Normal, gamma, etc.). : On a related note, given that the correction for this many implicit : multiple tests will push us FAR into the tail of the distribution, can I : expect to obtain any reliable inference? -- You are speaking to the reason that tests may not seem to have great POWER when they are used. The reason for not considering an inference to be reliable will be a) not a small p-level, or b) not robust against other assumptions (especially note, well or badly behaved variances).
  • Which range test
  • =======================Rich Ulrich, 09 May 1997==========ssm Subject: Re: Which Range Test Message-ID: <5kvdea$fvp@usenet.srv.cis.pitt.edu> Lalit Kumar (p2114659@saturn.geog.unsw.edu.au) wrote: : Hi all : I have 11 groups and want to compare all group means to see which are : significantly different. I have used ANOVA and the F value tells me that : there are significant differences. Now I want to find which pairs are : different. I have used Duncan's multiple range test. Am I right in doing : this? Someone else told me I should use Tukey's test and still others : tell me to use Scheffe's test. What do I do? -- Duncan's test has not been popular with statisticians for ages, since Scheffe complained that he could not understand the basis for it, and no one else defended it. Scheffe's test never requires any prior test - if a comparison passes Scheffe's, then it is such a big difference that the overall F *has* to be significant. There are a couple of tests with Tukey's name, which are popular. There is also the argument that the overall test allows you to report differences that are manifest with just the Least Significant Difference (LSD). I rather like the LSD followup because it does not seem to pretend, so much, that there is magic in passing somebody's tricky criterion. - If you want to use stiffer testing, use a p-level that is EXPLICITLY smaller, e.g., 0.01 instead of 0.05 - but that is my personal opinion that your editor or advisor may not share.
  • "posthoc" tests and F
  • =======================Greg Hancock, 21 Feb 1997==========ssc Message-ID: <Pine.SOL.3.95.970221071300.1842A-100000@rac3.wam.umd.edu> From: Greg Hancock <ghancock@WAM.UMD.EDU> Subject: Re: Post-hoc tests and significant F John Reece wrote... >I would like to garner some opinion on the relationship between post-hoc >means comparison procedures and the omnibus F test. The traditional view in >teaching psychology students (and I suspect students from many other >disciplines) is that one should not carry out exploratory pairwise means >comparisons unless an omnibus F test indicates significance at some >arbitrary value, usually .05. However, several sources (Howell & Wilcox to >name two) indicate that, far from being a requirement, exploratory means >comparisons should be carried out regardless of the significance of the >overall F, or even in lieu of an overall F. This makes some sense to me, >mainly because it is my understanding that exploratory means comparison >procedures were developed independent of the notion of an overall omnibus >test. I have many thoughts on this subject, few if any of which come down in favor of an omnibus F. This does not mean, however, that I don't think it should be taught. It is still a useful frame of reference for understanding main effects in factorial designs (although these could also be couched in a different "complex contrast" context). Anyway, the first question I would ask is what do you mean by "exploratory?" If by "exploratory" you mean that this is pilot work, meant to be a precursor to more rigorous statistical analyses on other samples, I guess I'd say spare the F and go nuts with your t-tests. But don't make any grandiose proclamations based on your findings. Save those to follow your more formal cross-validation work. If by "exploratory" you mean that you're looking at some data (probably sample means, specifically), and those data suggest some particular comparisons that were not planned a priori, then I'm all in favor of exacting some kind of familywise Type I error control. However, the F-test is not the beast for doing this. It only works as a control mechanism if the complete null hypothesis (that all population means are equal) is true. When one population mean differs, and you have sufficient power, you can get past the omnibus F pretty easily and then make Type I errors on other null comparisons if you don't have another control mechanism. Scheffe provides such a mechanism. The only value of the omnibus F here is that it tells you if you should even both with Scheffe; if the omnibus F is not significant, then no Scheffe contrast or comparison will be either. If the omnibus F is significant, then at least one exploratory contrast of comparison will be as well (although it may not be one you're interested in or one that makes sense). By the way, Scheffe's test is actually unnecessarily conservative. A paper presented last year at the American Educational Research Association (Klockars & Hancock) dealt with this issue, and has been addressed in independent work by Ottaway & Harris (personal communication). An excerpt from a recent paper which addresses the omnibus F, among many other topics, is presented below: "There are a number of problems associated with the requirement of an omnibus test rejection prior to conducting multiple comparisons; we will present four. First, and most simply, few research questions are directly addressed by an omnibus test. In a well planned study, the researcher's questions involve specific contrasts of group means; the omnibus test, addresses each question only tangentially. Some might argue that the omnibus test is not present to answer questions; rather, it is there to facilitate control over the rate of Type I error. This issue of control, however, brings us to our second point the belief that an omnibus test offers protection is not completely accurate. When the complete null hypothesis is true, weak familywise Type I error control is facilitated by the omnibus test; but, when the complete null is false and partial nulls exist, the F-test does not maintain strong control over the familywise error rate. "A third point, which Games (1971) so elegantly demonstrated in his figures, is that the F-test may not be completely consistent with the results of a pairwise comparison approach. Consider, for example, a researcher who is instructed to conduct Tukey's test only if an alpha-level F-test rejects the complete null. It is possible for the complete null to be rejected but for the widest ranging means not to differ significantly. This is an example of what has been referred to as incoherence (Gabriel, 1969) or incompatibility (Lehmann, 1957). On the other hand, the complete null may be retained while the null associated with the widest ranging means would have been rejected had the decision structure allowed it to be tested. This has been referred to by Gabriel (1969) as nonconsonance. One wonders if, in fact, a practitioner in this situation would simply conduct the MCP contrary to the omnibus test's recommendation. Strangely enough, such a seeming breach of multiple comparison ethics would have largely positive statistical ramifications as we discuss in our next and final point. "The fourth argument against the traditional implementation of an initial omnibus F-test stems from the fact that its well-intentioned but unnecessary protection contributes to a decrease in power. The first test in a pairwise MCP, such as that of the most disparate means in Tukey's test, is a form of omnibus test all by itself, controlling the familywise error rate at the alpha-level in the weak sense. Requiring a preliminary omnibus F-test amounts to forcing a researcher to negotiate two hurdles to proclaim the most disparate means significantly different, a task that the range test accomplished at an acceptable alpha-level all by itself. If these two tests were perfectly redundant, the results of both would be identical and the omnibus test would represent neither friend nor foe; probabilistically speaking, the joint probability of rejecting both would be alpha when the complete null hypothesis was true. However, the two tests are not completely redundant; as a result the joint probability of their rejection is less than alpha. The F-protection therefore imposes unnecessary conservatism (see Bernhardson, 1975, for a simulation of this conservatism). For this reason, and those listed before, we agree with Games' (1971) statement regarding the traditional implementation of a preliminary omnibus F-test: 'There seems to be little point in applying the overall F test prior to running c contrasts by procedures that set [the familywise error rate] <= alpha.... If the c contrasts express the experimental interest directly, they are justified whether the overall F is significant or not and [familywise error rate] is still controlled. (Games, 1971, p. 560)'" Whole passage from: Hancock, G. R., & Klockars, A. J. (1996). The quest for alpha: Developments in multiple comparison procedures in the quarter century since Games (1971). Review of Educational Research, 66(3), 269-306. ...
  • "post-hoc", logic in brief
  • =======================Hans-Peter Piepho, 21 Feb 1997==========ssc Message-ID: <9702210852.AA14777@fserv.wiz.uni-kassel.de> From: Hans-Peter Piepho <piepho@WIZ.UNI-KASSEL.DE> Subject: Re: Post-hoc tests and significant F >I would like to garner some opinion on the relationship between post-hoc >means comparison procedures and the omnibus F test. The traditional view in >teaching psychology students (and I suspect students from many other >disciplines) is that one should not carry out exploratory pairwise means >comparisons unless an omnibus F test indicates significance at some >arbitrary value, usually .05. However, several sources (Howell & Wilcox to >name two) indicate that, far from being a requirement, exploratory means >comparisons should be carried out regardless of the significance of the >overall F, or even in lieu of an overall F. This makes some sense to me, >mainly because it is my understanding that exploratory means comparison >procedures were developed independent of the notion of an overall omnibus >test. > >I am keen to hear people's opinions on this topic, either privately or to >the list (along with any informative references that people might like to >recommend), because it has direct implications for how I will teach this >material. > >Thanks in anticipation. > > The suggestion to pursue multiple comparisons only when an overall F-test rejects, is connected with Fisher's protected LSD test. This guarantees the experiment-wise Type I error to be controlled only in the weak sense, i.e. only if the global null is true, but not otherwise (there is no protection when the global null is false). To control the experiment-wise error rate in the strong sense, i.e. also when the global null is false (which I think is the most common situation), a host of other procedures have been suggested, the most prominent of them being Tukeys test, which uses studentized ranges. These tests do NOT require a preliminary F-test. _______________________________________________________________________ Hans-Peter Piepho
  • F distribution: issues, REFs
  • =======================Greg Hancock, 08 Apr 1997==========ssc Message-ID: <Pine.SOL.3.95.970408065055.1783A-100000@rac9.wam.umd.edu> From: Greg Hancock <ghancock@wam.umd.edu> Subject: Re: F distribution On Sun, 6 Apr 1997, Ddusick wrote: > A few weeks ago there was a discussion about the omnibus F test, and > whether or not you could/should do post hoc testing if the omnibus F is > not significant. My professor claims you should NEVER do a post hoc if > the omnbus F is not significant, and challenged us to find ANY statistics > book that says otherwise. Since several people in this newsgroup defended > post hocs for omnibus Fs that are not significant, I thought you might be > able to point me to a text which supports that? How about research > articles? "...you may want to apply multiple comparison procedures regardless of whether you have already obtained a significant F test, and in fact there may be situations where multiple comparison procedures are applied without applying the F test at all..." (p.173, Wilcox, 1987). Wilcox goes on to explain why on pages 187-188 in a section titled, "The effect of using multiple comparison procedures only after a significant F test." Wilcox, R. R. (1987). New statistical procedures for the social sciences: Modern solutions to basic problems. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers. More recently, this issue was discussed at length on pages 295-296 in a recent article by Hancock & Klockars (1996). Hancock, G. R., & Klockars, A. J. (1996). The quest for alpha: Developments in multiple comparison procedures in the quarter century since Games (1971). Review of Educational Research, 66(3), 269-306. I happen to be on pretty good terms with the first author, in case you'd like a copy. * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
  • FAQ top.
  • Ulrich home page.
  • Ulrich FAQ. http://www.pitt.edu/~wpilib/stats99.html