The Null Ritual What You Always Wanted to Know About Significance Testing but Were Afraid to Ask

Please download to get full document.

View again

of 14
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Document Description
Published in: D. Kaplan (Ed.). (2004). The Sage handbook of quantitative methodology for the social sciences (pp ). Thousand Oaks, CA: Sage Sage Publications. The Null Ritual What You Always
Document Share
Document Transcript
Published in: D. Kaplan (Ed.). (2004). The Sage handbook of quantitative methodology for the social sciences (pp ). Thousand Oaks, CA: Sage Sage Publications. The Null Ritual What You Always Wanted to Know About Significance Testing but Were Afraid to Ask Gerd Gigerenzer, Stefan Krauss, and Oliver Vitouch 1 No scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas. (Ronald A. Fisher, 1956, p. 42) It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail. (A. H. Maslow, 1966, pp ) One of us once had a student who ran an experiment for his thesis. Let us call him Pogo. Pogo had an experimental group and a control group and found that the means of both groups were exactly the same. He believed it would be unscientific to simply state this result; he was anxious to do a significance test. The result of the test was that the two means did not differ significantly, which Pogo reported in his thesis. In 1962, Jacob Cohen reported that the experiments published in a major psychology journal had, on average, only a 50 : 50 chance of detecting a medium-sized effect if there was one. That is, the statistical power was as low as 50%. This result was widely cited, but did it change researchers practice? Sedlmeier and Gigerenzer (1989) checked the studies in the same journal, 24 years later, a time period that should allow for change. Yet only 2 out of 64 researchers mentioned power, and it was never estimated. Unnoticed, the average power had decreased (researchers now used alpha adjustment, which shrinks power). Thus, if there had been an effect of a medium size, the researchers would have had a better chance of finding it by throwing a coin rather than conducting their experiments. When we checked the years 2000 to 2002, with some 220 empirical articles, we finally found 9 researchers who computed the power of their tests. Forty years after Cohen, there is a first sign of change. Editors of major journals such as A. W. Melton (1962) made null hypothesis testing a necessary condition for the acceptance of papers and made small p-values the hallmark of excellent experimentation. The Skinnerians found themselves forced to start a new journal, the Journal of the Experimental Analysis of Behavior, to publish their kind of experiments (Skinner, 1984, p. 138). Similarly, one reason for launching the Journal of Mathematical Psychology was to escape the editors pressure to routinely perform null hypothesis testing. One of its founders, R. D. Luce (1988), called this practice a wrongheaded view about what constituted scientific progress and mindless hypothesis testing in lieu of doing good research: measuring effects, constructing substantive theories of some depth, and developing probability models and statistical procedures suited to these theories (p. 582). 1 Author s note: We are grateful to David Kaplan and Stanley Mulaik for helpful comments and to Katharina Petrasch for her support with journal analyses. 2 The Null Ritual The student, the researchers, and the editors had engaged in a statistical ritual rather than statistical thinking. Pogo believed that one always ought to perform a null hypothesis test, without exception. The researchers did not notice how small their statistical power was, nor did they seem to care: Power is not part of the null ritual that dominates experimental psychology. The essence of the ritual is the following: (1) Set up a statistical null hypothesis of no mean difference or zero correlation. Don t specify the predictions of your research hypothesis or of any alternative substantive hypotheses. (2) Use 5% as a convention for rejecting the null. If significant, accept your research hypothesis. (3) Always perform this procedure. The null ritual has sophisticated aspects we will not cover here, such as alpha adjustment and ANOVA procedures, but these do not change its essence. Typically, it is presented without naming its originators, as statistics per se. Some suggest that it was authorized by the eminent statistician Sir Ronald A. Fisher, owing to the emphasis on null hypothesis testing (not to be confused with the null ritual) in his 1935 book. However, Fisher would have rejected all three ingredients of this procedure. First, null does not refer to a zero mean difference or correlation but to the hypothesis to be nullified, which could postulate a correlation of.3, for instance. Second, as the epigram illustrates, by 1956, Fisher thought that using a routine 5% level of significance indicated lack of statistical thinking. Third, for Fisher, null hypothesis testing was the most primitive type in a hierarchy of statistical analyses and should be used only for problems about which we have very little knowledge or none at all (Gigerenzer et al., 1989, chap. 3). Statistics offers a toolbox of methods, not just a single hammer. In many (if not most) cases, descriptive statistics and exploratory data analysis are all one needs. As we will see soon, the null ritual originated neither from Fisher nor from any other renowned statistician and does not exist in statistics proper. It was instead fabricated in the minds of statistical textbook writers in psychology and education. Rituals seem to be indispensable for the self-definition of social groups and for transitions in life, and there is nothing wrong about them. However, they should be the subject rather than the procedure of social sciences. Elements of social rituals include (a) the repetition of the same action, (b) a focus on special numbers or colors, (c) fears about serious sanctions for rule violations, and (d) wishful thinking and delusions that virtually eliminate critical thinking (Dulaney & Fiske, 1994). The null ritual has each of these four characteristics: a repetitive sequence, a fixation on the 5% level, fear of sanctions by editors or advisers, and wishful thinking about the outcome (the p-value) combined with a lack of courage to ask questions. Pogo s counterpart in this chapter is a curious student who wants to understand the ritual rather than mindlessly perform it. She has the courage to raise questions that seem naive at first glance and that others do not care or dare to ask. Question 1: What Does a Significant Result Mean? What a simple question! Who would not know the answer? After all, psychology students spend months sitting through statistics courses, learning about null hypothesis tests (significance tests) and their featured product, the p-value. Just to be sure, consider the following problem (Haller & Krauss, 2002; Oakes, 1986): Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (say, 20 subjects in each sample). Furthermore, suppose you use a simple independent means t-test and your result is significant (t = 2.7, df = 18, p =.01). Please mark each of the statements below as true or false. False means that the statement does not follow logically from the above premises. Also note that several or none of the statements may be correct. Gerd Gigerenzer, Stefan Krauss, and Oliver Vitouch 3 (1) You have absolutely disproved the null hypothesis (i.e., there is no difference between the population means).! True False! (2) You have found the probability of the null hypothesis being true.! True False! (3) You have absolutely proved your experimental hypothesis (that there is a difference between the population means).! True False! (4) You can deduce the probability of the experimental hypothesis being true.! True False! (5) You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.! True False! (6) You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.! True False! Which statements are true? If you want to avoid the I-knew-it-all-along feeling, please answer the six questions yourself before continuing to read. When you are done, consider what a p-value actually is: A p-value is the probability of the observed data (or of more extreme data points), given that the null hypothesis H 0 is true, defined in symbols as p(d H 0 ).This definition can be rephrased in a more technical form by introducing the statistical model underlying the analysis (Gigerenzer et al., 1989, chap. 3). Let us now see which of the six answers are correct: Statements 1 and 3: Statement 1 is easily detected as being false. A significance test can never disprove the null hypothesis. Significance tests provide probabilities, not definite proofs. For the same reason, Statement 3, which implies that a significant result could prove the experimental hypothesis, is false. Statements 1 and 3 are instances of the illusion of certainty (Gigerenzer, 2002). Statements 2 and 4: Recall that a p-value is a probability of data, not of a hypothesis. Despite wishful thinking, p(d H 0 ) is not the same as p(h 0 D), and a significance test does not and cannot provide a probability for a hypothesis. One cannot conclude from a p-value that a hypothesis has a probability of 1 (Statements 1 and 3) or that it has any other probability (Statements 2 and 4). Therefore, Statements 2 and 4 are false. The statistical toolbox, of course, contains tools that allow estimating probabilities of hypotheses, such as Bayesian statistics (see below). However, null hypothesis testing does not. Statement 5: The probability that you are making the wrong decision is again a probability of a hypothesis. This is because if one rejects the null hypothesis, the only possibility of making a wrong decision is if the null hypothesis is true. In other words, a closer look at Statement 5 reveals that it is about the probability that you will make the wrong decision, that is, that H 0 is true. Thus, it makes essentially the same claim as Statement 2 does, and both are incorrect Statement 6: Statement 6 amounts to the replication fallacy. Recall that a p-value is the probability of the observed data (or of more extreme data points), given that the null hypothesis is true. Statement 6, however, is about the probability of significant data per se, not about the probability of data if the null hypothesis were true. The error in Statement 6 is that p = 1% is taken to imply that such significant data would reappear in 99% of the repetitions. Statement 6 could be made only if one knew that the null hypothesis was true. In formal terms, p(d H 0 ) is confused with 1 p(d). The replication fallacy is shared by many, including the editors of top journals. For instance, the former editor of the Journal of Experimental Psychology, A. W. Melton (1962), wrote in his editorial, The level of significance measures the confidence that the results of the experiment would be repeatable under the conditions described (p. 553). A nice fantasy, but false. To sum up, all six statements are incorrect. Note that all six err in the same direction of wishful thinking: They overestimate what one can conclude from a p-value. 4 The Null Ritual Students and Teachers Delusions We posed the question with the six multiple-choice answers to 44 students of psychology, 39 lecturers and professors of psychology, and 30 statistics teachers, who included professors of psychology, lecturers, and teaching assistants. All students had successfully passed one or more statistics courses in which significance testing was taught. Furthermore, each of the teachers confirmed that he or she taught null hypothesis testing. To get a quasi-representative sample, we drew the participants from six German universities (Haller & Krauss, 2002). How many students and teachers noticed that all of the statements were wrong? As Figure 1 shows, none of the students did. Every student endorsed one or more of the illusions about the meaning of a p-value. One might think that these students lack the right genes for statistical thinking and are stubbornly resistant to education. A glance at the performance of their teachers, however, indicates that wishful thinking might not be entirely their fault. Ninety percent of the professors and lecturers also had illusions, a proportion almost as high as among their students. Most surprisingly, 80% of the statistics teachers shared illusions with their students. Thus, the students errors might be a direct consequence of their teachers wishful thinking. Note that one does not need to be a brilliant mathematician to answer the question, What does a significant result mean? One only needs to understand that a p-value is the probability of the data (or more extreme data), given that the H 0 is true. If students inherited the illusions from their teachers, where did the teachers acquire them? The illusions were right there in the first textbooks introducing psychologists to null hypothesis testing more than 60 years ago. Guilford s Fundamental Statistics in Psychology and Education, first published in 1942, was probably the most widely read textbook in the 1940s and 1950s. Guilford suggested that hypothesis testing would reveal the probability that the null hypothesis is true. If the result comes out one way, the hypothesis is probably correct, if it comes out another way, the hypothesis is probably wrong. (p. 156) Guilford s logic was not consistently misleading but wavered back and forth between correct and incorrect statements, as well as ambiguous ones that can be read like Rorschach inkblots. He used phrases such as we obtained directly the probabilities that the null hypothesis was plausible and the probability of extreme deviations from chance interchangeably for referring to the same thing: the level of significance. Guilford is no exception. He marked the beginning of a genre of statistical texts that vacillate between the researchers desire for probabilities of hypotheses and what significance testing can actually provide. Early authors promoting the illusion that the level of significance would specify the probability of hypothesis include Anastasi (1958, p. 11), Ferguson (1959, p. 133), and Lindquist (1940, p. 14). But the belief has persisted over decades: for instance, in Miller and Buckhout (1973; statistical appendix by Brown, p. 523), Nunally (1975, pp ), and in the examples collected by Bakan (1966), Pollard and Richardson (1987), Gigerenzer (1993), Nickerson (2000), and Mulaik, Raju, and Harshman (1997). Which of the illusions were most often endorsed, and which relatively seldom? Table 1 shows that Statements 1 and 3 were most frequently detected as being false. These claim certainty rather than probability. Still, up to a third of the students and an embarrassing 10% to 15% of the group of teachers held this illusion of certainty. Statements 4, 5, and 6 lead the hit list of the most widespread illusions. These errors are about equally prominent in all groups, a collective fantasy that seems to travel by cultural transmission from teacher to student. The last column shows that these three illusions were also prevalent among British academic psychologists who answered the same question (Oakes, 1986). Just as in the case of statistical power cited in the introduction, in which little learning was observed after 24 years, knowledge about what a significant result means Gerd Gigerenzer, Stefan Krauss, and Oliver Vitouch Figure 1. The Amount of Delusions About the Meaning of p =.01. Percent Psychology students (n = 44) Professors & lecturers not teaching statistics (n = 39) Professors & lecturers teaching statistics (n = 30) Note. The percentage refer to the participants in each group who endorsed one or more of the six false statements (based on Haller & Krauss, 2002). does not seem to have improved since Oakes. Yet a persistent blind spot for power and a lack of comprehension of significance are consistent with the null ritual. Statements 2 and 4, which put forward the same type of error, were given different endorsements. When a statement concerns the probability of the experimental hypothesis, it is much more accepted by students and teachers as a valid conclusion than one that concerns the probability of the null hypothesis. The same pattern can be seen for British psychologists (see Table 1). Why are researchers and students more likely to believe that the level of significance determines the probability of H 1 rather than that of H 0? A possible reason is that the researchers focus is on the experimental hypothesis H 1 and that the desire to find the probability of H 1 drives the phenomenon. Did the students produce more illusions than their teachers? Surprisingly, the difference was only slight. On average, students endorsed 2.5 illusions, their professors and lecturers who did not teach statistics approved of 2.0 illusions, and those who taught significance testing endorsed 1.9 illusions. Could it be that these collective illusions are specific to German psychologists and students? No, the evidence points to a global phenomenon. As mentioned above, Oakes (1986) reported that 97% of British academic psychologists produced at least one illusion. Using a similar test question, Falk and Greenbaum (1995) found comparable results for Israeli students, despite having taken measures for debiasing students. Falk and Greenbaum had explicitly added the right alternative ( None of the statements is correct ), whereas we had merely pointed out that more than one or none of the statements might be correct. As a further measure, they had made their students read Bakan s (1966) classic article, which explicitly warns against wrong conclusions. Nevertheless, only 13% of their participants opted for the right alternative. Falk and Greenbaum concluded that unless strong measures in teaching statistics are taken, the chances of overcoming this misconception appear low at present (p. 93). Warning and reading by itself does not seem to foster much insight. So what to do? 6 The Null Ritual Table 1 Percentages of False Answers (i.e., Statements Marked as True) in the Three Groups of Figure 1 Germany 2000 United Kingdom 1986 Statement (abbreviated) Psychology students Professors and lecturers: Professors and not teaching lecturers: statistics teaching statistics Professors and lecturers 1. H 0 is absolutely disproved Probability of H 0 is found H 1 is absolutely proved Probability of H 1 is found Probability of wrong decision Probability of replication Note. For comparison, the results of Oakes (1986) study with academic psychologists in the United Kingdom are shown in the right column. Question 2: How Can Students Get Rid of Illusions? The collective illusions about the meaning of a significant result are embarrassing to our profession. This state of affairs is particularly painful because psychologists unlike natural scientists heavily use significance testing yet do not understand what its product, the p-value, means. Is there a cure? Yes. The cure is to open the statistical toolbox. In statistical textbooks written by psychologists and educational researchers, significance testing is typically presented as if it were an all-purpose tool. In statistics proper, however, an entire toolbox exists, of which null hypothesis testing is only one tool among many. As a therapy, even a small glance into the contents of the toolbox can be sufficient. One quick way to overcome some of the illusions is to introduce students to Bayes rule. Bayes rule deals with the probability of hypotheses, and by introducing it alongside null hypothesis testing, one can easily see what the strengths and limits of each tool are. Unfortunately, Bayes rule is rarely mentioned in statistical textbooks for psychologists. Hays (1963) had a chapter on Bayesian statistics in the second edition of h
Similar documents
View more...
Search Related
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks