<- file stat p05.html -> FAQ - discussing (5%) test levels
  • ... Here was one discussion on the perennial topic of hypothesis testing.
  • =======================Dave Krantz, 14 Feb 1997==========sse Message-ID: <199702141320.IAA13975@paradox.psych.columbia.edu> Subject: replacing the current mythologies There were two replies to my recent question: > then just what is the student to do with two conditions > in which p-values are .03 and .06? from Samuel Scheiner (sam.scheiner@asu.edu): >>What I teach my students is that, in the realm of scientific theory >>generation and testing, the above means "maybe". Go back and do more >>and/or different experiments. On the other hand, I do teach them about >>rejecting and not rejecting the null hypothesis. But, I make it clear >>that they should not confuse what is done in a single instance with the >>much larger process that any one test is part of. from Jerry Dallal (jerry@mint.hnrc.tufts.edu): >>Make it 0.08 or 0.10 and life is more interesting and realistic. There >>is no easy answer for the same reason that there is no easy answer to >>the question of multiple comparisons. There are many way to proceed. >>One statistician's preferred method may be anathema to another whose >>own method is shunned by the first. I am not prepared to offer anything >>other than broad generalizations (which resolve nothing for the novice) >>outside of a specific research protcol and formal research hypothesis. >>Asking what to do about a P value of 0.06-0.10 seems similar to asking >>what to do about a slightly but distinctly elevated white blood count. I agree entirely with Scheiner's first two sentences. But I don't get the rest of it. What is actually done in each instance is necessarily part of the larger process. If rejecting or not rejecting the null hypothesis is not part of the larger process of science--then why should Scheiner (and I) teach our students about it? Or, WHAT should we teach them about it? It doesn't seem sensible to teach students a clearcut method and then tell them that in practice it is not ever usable. I have trouble understanding why Dallal thinks that .08 and .10 are more interesting and realistic than .03 and .06, or why he restricts his analogy with slightly but distinctly elevated white blood count (which I find entirely apt) to the range .06-.10, rather, than, say, something like .02-.20. I completely agree with Dallal's assertion that the lack of easy answer is FOR THE SAME REASON (my emphasis) as the lack of easy answer to the question of multiple comparisons. In both cases, there is never any real reason to pick one, rather than another "allowable" type I error rate; the standard Neyman/Pearson account of how science is done, while very clever, is not a good description of the reality of (successful) scientific practice. Nonetheless, hypothesis testing and confidence intervals based on it do seem very useful in science. I'm not questioning their usefulness, I am questioning our understanding of how in fact they are used. A better intellectual understanding of actual practice, leading to an accurate abstract description (in place of the Neyman/Pearson and Bayesian mythologies) might make teaching easier--which is it may be useful to discuss questions about foundations in this forum. Nobody reacted, either critically or favorably, to my concrete suggestion, so let me offer it again, more concretely and fully, though I intend it still only as a tentative suggestion: When comparing a single parameter estimate to a theoretical value (which may be 0 in some situations), plot two sets of error bars, one at +- 1.5 estimated standard errors (perhaps with thick lines) and one at +- 2.5 estimated standard errors (with the extra +- 1 shown by a thinner line). Explain that the values covered by the thick lines are those that are not contravened to any appreciable degree by the CURRENT focal data (what may be known from past evidence or other concurrent studies is another story). Those not covered even by the thin lines are those that are fairly strongly contravened by the CURRENT focal data. Emphasize that all this is conditional on a sensible estimate of "random error", which must be based on a reasonable overall model (without which, the parameter in question has little meaning in any case) and on substantial degrees of freedom for error (if the latter are few, then the rejection of values outside the thin lines must be taken as quite tentative). The logic behind 1.5 and 2.5 can be explained using tail probabilities of approximately normal sampling distributions. The reason for using the tail is that data in the tails represent results that are BOTH DISTANT and UNLIKELY: distant from the parameter value being tested (if this is not true, one may be using a silly test statistic) and unlikely to arise purely by chance, if that value were the correct one. This attempts to abstract how I think about the standard error of an estimate, in my own intuitive scientific inference; and so it is what I would like to be able to teach to students, instead of p-values, type I and II errors, coverage probabilities, or the strained decision-making analogy. I could add more, chiefly about parameter vectors, for example, how the idea extends to comparisons of several group means, but this is already too much for one posting. Dave Krantz =======================Robert Frick, 15 Feb 1997==========sse Message-ID: <EA564C1263F@psych1.psy.sunysb.edu> Subject: Re: replacing the current mythologies David Krantz asked: > > then just what is the student to do with two conditions > > in which p-values are .03 and .06? If it is a question of what to believe, .03 is better evidence for rejecting the null hypothesis than .06. As hopefully most people would do in real life, and as Bayesians would argue, the other evidence on the issue also needs to be considered. Is there a theory to support the finding? What were previous findings? And so on. If it is an issue of whether or not to publish, .03 allows publication (if other criteria are met) and .06 does not -- don't bother trying to publish a finding supported by p = .06, unless it is a very important issue and you can't get more subjects. Or, if it is secondary finding, you can report marginal significance and speculate on causes, but don't put any weight on it. It seems perfectly rational to me that science require an experimenter to provide a criterion amount of evidence to support a claim, and the current standard is essentially p < .05. Advertisement: This point of view is presented in Frick, R. W. "The Appropriate Use of Null Hypothesis Testing", Psychological Methods, 1996, 379-390. Bob Frick =======================Paull Bernhardt, 15 Feb 1997==========sse Message-ID: <199702152104.OAA27198@gos.oz.cc.utah.edu> Subject: Re: replacing the current mythologies >If it is an issue of whether or not to publish, .03 allows >publication (if other criteria are met) and .06 does not -- don't >bother trying to publish a finding supported by p = .06, unless it >is a very important issue and you can't get more subjects. Or, if >it is secondary finding, you can report marginal significance and >speculate on causes, but don't put any weight on it. I've heard recently of an article on reforming the peer review process so that this problem is eliminated. Instead of presenting a full paper for review, the author presents the literature review, hypotheses, and methods (including statistical methods), but no results or discussion. The paper is reviewed based on only those things, and not the fact of finding probabilities less than .05. After acceptance a review of results and discussion is done, but only to make it fit the rest of the paper well and to ensure it is well written. Null results may not be the reason for rejection after acceptance of the lit review and method. This would get around these issues. If you haven't read his newest revision, Pedhazur (1997) starts in Chapter 1 (beginning on page 10) with a near rant about the current state of the peer review process. He provides several examples of failures of the process and cites others investigations into peer review failurs as further demonstration of his views. Illustrative quotes: "Unfortunately, problems with the review process are exacerbated by the appointment of editors unsuited ot the task because of disposition and/or lack of knowledge to understand, let alone evaluate, the reviews they recieve." (pg. 12) "As I amply show in my commentaries on research studies, their very publication lead to the inescapable conclusion that editors and referees have either not carefully read the manuscripts or have no knowledge of the analytic methods used. I will let you decide which is the worse offense." (pg. 13) Pedhazur, E.J. (1997) Multiple Regression in Behavioral Research:Explanation and Predition. Third Edition. Harcourt Brace: Fort Worth. =======================Robert Frick, 18 Feb 1995==========sse Message-ID: <EEA99156FDA@psych1.psy.sunysb.edu> Subject: Re: replacing the current mythologies Paul Bernhardt wrote: > > I've heard recently of an article on reforming the peer review process so > that this problem is eliminated. Instead of presenting a full paper for > review, the author presents the literature review, hypotheses, and > methods (including statistical methods), but no results or discussion. > The paper is reviewed based on only those things, and not the fact of > finding probabilities less than .05. After acceptance a review of results > and discussion is done, but only to make it fit the rest of the paper > well and to ensure it is well written. Null results may not be the reason > for rejection after acceptance of the lit review and method. This would > get around these issues. This would be a disaster for me and I suspect would not work well for the field. Problem #1: When one outcome is of interest because it challenges current theories, and another is not because it agrees with current theories. For example, what interest would there have been in Garcia's finding of taste aversion, prior to discovering that the effect existed? Problem #2: An experiment devoted to accepting the null hypothesis needs slightly different evidential support than one claiming an effect (in addition to the obvious difference in the value of p). Showing that the experiment had sufficient power to find an effect is important for accepting the null hypothesis and not relevant once an effect is found. Bob Frick =======================Samuel Scheiner, 19 Feb 1997==========sse Subject: Re: replacing the current mythologies Message-ID: <330B2E3B.9DA@asu.edu> Dave Krantz wrote: > > There were two replies to my recent question: > > > then just what is the student to do with two conditions > > in which p-values are .03 and .06? > > from Samuel Scheiner (sam.scheiner@asu.edu): > > >>What I teach my students is that, in the realm of scientific theory > >>generation and testing, the above means "maybe". Go back and do more > >>and/or different experiments. On the other hand, I do teach them about > >>rejecting and not rejecting the null hypothesis. But, I make it clear > >>that they should not confuse what is done in a single instance with the > >>much larger process that any one test is part of. > > I agree entirely with Scheiner's first two sentences. But I don't get > the rest of it. What is actually done in each instance is necessarily > part of the larger process. If rejecting or not rejecting the null > hypothesis is not part of the larger process of science--then why should > Scheiner (and I) teach our students about it? Or, WHAT should we teach > them about it? It doesn't seem sensible to teach students a clearcut > method and then tell them that in practice it is not ever usable. > <rest deleted> First, let me note that I have no arguments with Dave's suggestions for teaching these concepts. But, I should clarify my comments. What I meant was that it is possible to reject or not reject the null in a given instance yet decide on the validity of a scientific theory based on the preponderance of the evidence that is contrary to that particular decision. Meta-analysis is a formalization of this, but even that is still a special case of what scientists do in making decisions about hypotheses. Does this make the individual decisions a waste of time? Of course not. The preponderance of the evidence has to come from somewhere. Perhaps the best way to say this is that the Popperian hypothetico-deductive method works when applied to pieces of scientific theories, but is insufficient when dealing with whole theories. =======================H Rubin, 19 Feb 1997==========sse Subject: Re: replacing the current mythologies Message-ID: <5egasj$2rlt@b.stat.purdue.edu> In article <EA564C1263F@psych1.psy.sunysb.edu>, ROBERT FRICK <RFRICK@psych1.psy.sunysb.edu> wrote: >David Krantz asked: >> > then just what is the student to do with two conditions >> > in which p-values are .03 and .06? >If it is a question of what to believe, .03 is better evidence for >rejecting the null hypothesis than .06. With a given probability model and sample size, this is correct. Otherwise, any remotely sound analysis of the decision problem shows that the relations between the amounts of evidence can be anything. As hopefully most people >would do in real life, and as Bayesians would argue, the other >evidence on the issue also needs to be considered. Is there a theory >to support the finding? What were previous findings? And so on. >If it is an issue of whether or not to publish, .03 allows >publication (if other criteria are met) and .06 does not -- don't >bother trying to publish a finding supported by p = .06, unless it >is a very important issue and you can't get more subjects. Or, if >it is secondary finding, you can report marginal significance and >speculate on causes, but don't put any weight on it. This is the utter stupidity which has been promulgated by those who are unwilling or unable to weigh the consequences. If one is willing to consider that the point null can be true, the above statement can be shown to be nonsense. Look at the loss-prior combination merely as weights to be used in combining what happens in different states of nature. -- * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • Document by Rich Ulrich. E-mail to wpilib+@pitt.edu
  • FAQ top.
  • Ulrich home page.
  • Ulrich FAQ. http://www.pitt.edu/~wpilib/stats99.html