Shiken: JALT Testing & Evaluation SIG Newsletter
Vol. 12 No. 1. Jan. 2008. (p. 23 - 28) [ISSN 1881-5537]

Statistics Corner
Questions and answers about language testing statistics:

The Bonferroni adjustment

Photo of JD Brown, c. 2000
James Dean Brown
University of Hawai'i at Manoa


* QUESTION: An interesting paper discussing how science and non-science majors respond differently to a reading test is now online at www.jalt.org/pansig/2007/HTML/Weaver.htm [Weaver, 2007]. The author mentions using a "Bonferroni adjustment" to compare science majors with non-science majors and reduce the likelihood of a Type I error. What is a Bonferroni adjustment? When should it be used? Whereas some authors contend it is useful, at least one other author contends "… Bonferroni adjustments are, at best, unnecessary and, at worst, deleterious to sound statistical inference" (Perneger, 1998).

* ANSWER: In order to answer your various questions, I will have to address a number of sub-questions of my own: (a) What's the problem with interpreting multiple statistical comparisons? (b) What's the probability of one or more t-tests being spuriously significant? (c) How can we solve this problem of spuriously significant differences? (d) How does the Bonferroni adjustment help with this problem? And, (e) does everybody agree that the Bonferroni adjustment should be used?

What's the Problem with Interpreting Multiple Statistical Comparisons?

Let me begin by stating that I am not, by any means, the first researcher to notice that the results of multiple t-tests present a set of problems (see Brown, 1988, p. 170; Dayton, 1970, pp. 37-49; Kirk, 1968, pp. 69-98; Shavelson, 1981, pp. 447-448). Essentially, the problem is that an inferential statistic like the two-sample t-test is designed to tell us the probability that the means of the two samples are significantly different at a certain significance level. For example, if two samples of 17 men and 17 women had means of 18.5 and 14.5, respectively, and standard deviations of 2.5 and 3.5, respectively, we might want to find out if the observed difference is just a chance fluctuation or is a significant difference at say p < .05 (i.e., with a probability of 5%). In this case, we might use an independent samples t-test to investigate this issue, and it would turn out that t = 3.83, which is indeed significant at p < .05. Thus we can say there is only a 5% probability that the difference between the means for men and women reported here occurred by chance (for more on this calculation, see Brown & Rodgers, 2002, pp. 205-210).
So far, so good. However, a problem arises if we decide we want to do another t-test in the same study (say between Japanese and Americans), or two t-tests, or ten t-tests, or fifty t-tests. For the sake of argument, let's consider an extreme example where we want to compare two groups of 100 students on each of 100 different tests (Test T001 to Test T100) using 100 t-tests. Table 1 shows the results for such a hypothetical study. Notice that Table 1 gives the means on each test for groups 1 and 2, as well as the absolute difference (i.e., regardless of sign) between those means and the absolute t-test (i.e., regardless of sign) for each comparison. Notice also that the asterisks next to some of the Abs. t values indicate that those comparisons are significant at p < .05. It turns out that six out of the 100 comparisons are significant at p < .05. So what is the problem?

[ p. 23 ]

The problem is that Table 1 shows that six out of 100 comparisons are significant at .05 even though these analyses are based on data that are 100% random. I produced the data for each pair of groups to be a set of random normal standardized T scores. Yet, six of the t-tests turned out to be significant at .05. This is approximately what we would expect from random numbers. If I were to repeat the procedures with new random normal data, there might be four or three or seven or five significant t-tests, but on average over many such pseudo-studies based on random numbers, I would expect about 5% to be significant at p < .05. These are called spuriously significant results (i.e., significant differences that occur by chance alone). Such findings are completely reasonable because p < .05 indicates that there is a five percent probability of such findings by chance alone and indeed we found that approximately five percent (actually 6% in this particular instance) were indeed significant when based on random normal data.

Table 1 Study of 100 pairs of means based on 100 random normal data points for each test group
Test Group 1 Group 2 Abs. Diff. Abs. T Test Group 1 Group 2 Abs. Diff. Abs. T
T001 50.39 48.43 1.96 1.29 T051 49.95 50.78 0.83 0.62
T002 51.68 49.62 2.06 1.40 T052 50.53 49.60 0.93 0.64
T003 52.29 50.78 1.50 1.15 T053 49.69 49.64 0.05 0.04
T004 50.25 48.56 1.69 1.16 T054 50.45 51.63 1.18 0.85
T005 48.77 48.82 0.05 0.04 T055 49.53 49.48 0.06 0.04
T006 51.72 50.75 0.96 0.74 T056 51.35 50.00 1.35 0.96
T007 48.41 51.39 2.98 2.40 * T057 49.53 50.32 0.80 0.56
T008 50.65 49.90 0.75 0.52 T058 48.13 50.04 1.92 1.33
T009 50.71 49.99 0.72 0.54 T059 49.99 50.03 0.05 0.03
T010 50.60 49.31 1.29 1.01 T060 51.89 50.11 1.78 1.28
T011 50.34 49.31 0.86 0.64 T061 50.80 47.18 3.62 2.91 *
T012 50.78 48.98 1.79 1.20 T062 51.61 49.99 1.61 1.10
T013 49.54 49.87 0.33 0.24 T063 49.82 50.82 1.00 0.71
T014 50.37 51.07 0.70 0.52 T064 50.94 49.90 1.04 0.73
T015 50.52 48.80 1.72 1.30 T065 49.68 52.19 2.52 1.72
T016 52.13 49.68 2.46 1.82 T066 50.75 51.75 1.00 0.66
T017 49.63 49.27 0.36 0.28 T067 49.08 49.32 0.24 0.14
T018 48.95 51.45 2.50 1.62 T068 49.19 50.76 1.56 1.09
T019 51.48 48.89 2.59 1.78 T069 51.01 50.61 0.39 0.27
T020 50.80 48.82 1.97 1.34 T070 49.99 50.67 0.68 0.48
T021 48.93 48.5 0.43 0.43 T071 51.09 49.69 1.40 1.02
T022 51.26 49.79 1.47 1.01 T072 49.46 49.01 0.45 0.32
T023 50.55 49.83 0.72 0.48 T073 51.1 50.38 0.78 0.56
T024 49.65 51.33 1.68 1.19 T074 50.47 49.37 1.10 0.78
T025 51.60 48.80 2.79 2.07 * T075 50.17 51.33 1.16 0.82
T026 51.07 51.39 0.32 0.23 T076 49.72 49.88 0.16 0.11
T027 48.58 50.43 1.85 1.31 T077 50.57 50.27 0.30 0.21
T028 51.10 50.72 0.39 0.28 T078 48.72 46.85 1.87 1.41
T029 48.63 51.40 2.77 1.95 T079 50.62 49.90 0.72 0.49
T030 50.35 49.73 0.62 0.43 T080 50.98 49.15 1.84 1.42
T031 48.93 49.13 0.20 0.14 T081 49.45 48.93 0.52 0.39
T032 49.80 51.16 1.36 0.93 T082 51.10 51.10 0.00 0.00
T033 51.11 51.16 0.05 0.04 T083 49.52 50.44 0.92 0.64>
T034 50.45 49.40 1.05 0.74 T084 49.87 50.18 0.30 0.22
T035 48.95 50.06 1.11 0.82 T085 51.89 50.87 1.01 0.71
T036 52.09 49.87 2.22 1.50 T086 49.31 49.72 0.41 0.32
T037 49.55 51.48 1.93 1.50 T087 52.24 50.28 1.96 1.27
T038 49.03 50.05 1.02 0.67 T088 49.92 51.34 1.42 1.05
T039 50.59 48.40 2.19 1.53 T089 49.66 49.44 0.22 0.15
T040 49.45 49.32 0.13 0.08 T090 49.76 49.28 0.49 0.33
T041 49.06 49.51 0.44 0.31 T091 50.69 51.51 0.83 0.57
T042 50.70 49.01 1.69 1.16 T092 52.07 48.40 3.67 2.77 *
T043 50.53 49.46 1.07 0.77 T093 48.42 50.04 1.62 1.17
T044 49.66 50.96 1.30 0.89 T094 47.99 50.81 2.83 2.01 *
T045 46.84 50.12 3.28 2.24 * T095 51.33 49.80 1.52 1.04
T046 49.02 49.78 0.77 0.55 T096 50.47 49.69 0.79 0.59
T047 49.88 50.04 0.16 0.11 T097 49.21 50.07 0.86 0.60
T048 49.82 49.17 0.66 0.44 T098 48.74 49.13 0.39 0.26
T049 49.32 49.02 0.30 0.20 T099 50.86 50.37 0.49 0.37
T050 49.58 50.28 0.71 0.52 T100 49.01 49.10 0.09 0.06

*p < .05, two-tailed

[ p. 24 ]


Could such patterns occur in real data? Consider a study by Politzer and McGroarty (1985) published in the TESOL Quarterly. These authors reported 228 t-tests in their study of which 11 were significant. In other words, their results were that 4.8% of their 228 t-tests were significant at p < .05 (11 / 228 = .048). Thus their results are no different from what we would expect if the numbers they used in their study had been randomly generated rather than based on real-life data. In short, despite the fact that they found "significant differences", because those differences are based on multiple comparisons (i.e., 228 t-tests), their results were no different from what we would expect by chance alone. Yet the authors of that article, who apparently did not notice this problem, went on to interpret their results and publish them in TESOL Quarterly, and other researchers continue to cite those results. That's the problem.
Is the Politzer and McGroarty study the only one with multiple statistical comparisons? No, I have noticed many such studies during my 30 years in applied linguistics. A few examples that jump immediately to mind are Anisfeld, Bogo, and Lambert (1962), who reported a total of 223 t-tests, Carrell, Pharis, and Liberto (1989), who conducted 12 t-tests, and Fotos and Ellis (1991), who included 21 t-tests.

What's the Probability of One or More t-tests Being Spuriously Significant?

The problem with multiple t-tests, then, is that a certain, but unknown, probability exists that one or more significant differences will be found by chance alone. As the number of t-tests increases in a given study, the probability that one or more spuriously significant differences will be detected increases. The gravity of this problem depends to some degree on whether the means are independent (i.e., the groups are made up of different participants, as in a comparison of males and females, where they cannot be the same people) or non-independent (i.e., the same participants are measured under different circumstances, as in a comparison of the pretest and posttest scores, where there are two scores for each person, one for the pretest and one for the posttest).
For independent means, the probability of one or more t-tests being spuriously significant (i.e., the probability of committing a Type I error, see Brown, 2007 for an explanation of Type I error) can be calculated using (1 - α)c, where α is the predetermined acceptable significance level (e.g., .05) and c is the number of comparisons, in this case, the number of independent t-tests. For example, with α set at .05, the probability of a Type I error for one comparison is 5% (just as we would expect), for six comparisons it is 26% percent [(1 - α)c = (1 - .05)6 = (1 - .7351) = .2649], for ten comparisons it is 40%, for fifteen comparisons it is 54%, for twenty comparisons it is 64% percent, and so forth.

[ p. 25 ]


For non-independent means, the probabilities are even higher: Cochran and Cox (1957) estimated that the probability of a spuriously significant difference occurring for six t-tests is approximately 40%, for 10 t-tests it's about 60%, for 20 t-tests it's 90%, and so forth.
Whether the means are independent or non-independent, then, the problem is that one (or more) of the observed significant differences may be spuriously significant. We can calculate the probability of this occurring, but because we cannot determine which of the "significant" differences are spurious, the interpretation of results for studies using multiple comparisons becomes very tricky, if not impossible. That's where the Bonferroni adjustment comes in.

How Can We Solve This Problem of Spuriously Significant Differences?


Well trained researchers in our field design their studies to include one relatively complex analysis drawn from the analysis-of-variance family of inferential statistics (e.g., ANOVA, ANCOVA, MANOVA, MANCOVA, etc.) that maintains an experimentwise alpha level of .05 or .01 yet allows a posteriori multiple comparison tests like the Duncan, Newman-Keuls, Dunnett, Tukey HSD, or Scheffé test (see Kirk, 1968, pp. 87-97; Jaccard, Becker & Wood 1984 for more on these methods). The issue of multiple comparisons is one reason that such analyses are so useful. Indeed, we must ask ourselves why these more complex analyses would exist at all if we can simply go ahead and do as many t-tests as we want? Unfortunately, using these more complex ANOVA-family designs requires a fair amount of training (e.g., when I was at UCLA, ANOVA was a course in educational statistics requiring three prerequisite statistics courses). Also unfortunately, most of the people doing statistical research in our field do not have that much training.
What can researchers who are not so well trained do? One thing they can do is to go ahead and do multiple t-tests, but then use the Bonferroni adjustment to at least roughly compensate for the mess they are making with their p values.

[ p. 22 ]

How Does the Bonferroni Adjustment Help with this Problem?


One version of the Bonferroni adjustment that is commonly used is as follows: αadjusted = α / c (where α is the overall experimentwise alpha; c is the number of comparisons made; and αadjusted is the adjusted alpha level at which each of c comparisons must be tested for significance).
For example, we might want to apply the Bonferroni adjustment to the data in Table 1 in order to maintain an experimentwise alpha of .05. Given that we have 100 comparisons, the adjusted alpha would be αadjusted = α / c = .05 / 100 = .0005. Checking to see if any of the t-tests in Table 1 reached p < .0005, it turns out that none of the results are significant. Thus, all six spuriously significant differences have been removed with the result that no significant differences have been found for the randomly generated data in Table 1, which is just as it should be. Another example of using the Bonferroni adjustment is provided in Sasaki (1996) who did 25 t-tests and found (after the adjustment) that 24 of those t-tests were significant. And, of course, the Weaver (2007) article that you mentioned in your question used the same strategy for the same purpose.

[ p. 26 ]


Increasingly, researchers in applied linguistics are using this Bonferroni adjustment strategy as one of their many statistical tools. However, because the Bonferroni adjustment is a rough approximation, it is not always the best solution for the problem of multiple comparisons. Often the best way to proceed is to use the ANOVA family of statistical tools. Nevertheless, if there is no other solution, the Bonferroni adjustment is better than nothing.
I should also point out that the Bonferroni adjustment is not restricted to use with multiple t-tests. It is also used by researchers to account for spurious significances that might occur in multiple ANOVAs, multiple chi-squared tests, multiple correlation coefficients, and so forth.

Does Everybody Agree that the Bonferroni Adjustment Should be Used?


As Perneger (1998) points out, application of the Bonferroni adjustment is increasingly widespread in the social sciences. Does everybody agree that this adjustment should be used? No. But then, does everybody agree on anything in the social sciences? The answer to this question is also no. Maybe in the social sciences as in the sciences, we should take the safe route and follow whatever the consensus is. If so, the consensus seems to be that using the Bonferroni adjustment is alright if you use it cautiously. But then, following the herd is not always the correct way to go. So what are you to do?
Perhaps the clearest way to think about the differences in views on any issue in statistics is to recognize that there are many points of view on this issue ranging from the relatively liberal points of view of some people (like Perneger,1998; Siegel,1990), who seem to be saying that multiple t-tests are okay, to very conservative views expressed by other researchers (like myself in Brown, 1990 & 2001; Sasaki, 1996; Tabachnick & Fidell, 2001, 2007; and Weaver, 2007), who advocate that care be taken in interpreting multiple t-tests and that the Bonferroni is one way of being careful. One problem I see with views of people on the liberal end of this issue is that they tend to assume that researchers (and readers of research) have more knowledge about statistics than they actually have. Certainly in applied linguistics, we should probably never assume much statistical sophistication on the part of readers or researchers.
As I have pointed out elsewhere, "The position taken by this author is based on the 'conservative' philosophy that, in applying statistics, great care and caution must always be practiced in order to minimize the possibility of publishing 'significant' results that may have occurred by chance alone" (Brown, 1990, p. 770). In short, I know enough about statistics and research to be very distrustful of the numbers and statistics involved. As a result, I take a conservative and careful position on the issue of multiple comparisons, one that sometimes involves the Bonferroni adjustment.

[ p. 27 ]

References

Anisfeld, E., Bogo, N., & Lambert, W. E. (1962). Evaluational reactions to accented English speech. Journal of Abnormal and Social Psychology, 65, 223-231.

Brown, J. D. (1988). Understanding research in second language learning: A teacher's guide to statistics and research design. Cambridge: Cambridge.

Brown, J. D. (1990). The use of multiple t tests in language research. TESOL Quarterly, 24 (4), 770-773.

Brown, J. D. (2001). Using surveys in language programs. Cambridge: Cambridge University Press.

Brown, J. D. (2007). Statistics Corner. Questions and answers about language testing statistics: Sample size and power. Shiken: JALT Testing & Evaluation SIG Newsletter, 11 (1), 31-35. Also retrieved from the World Wide Web at http://jalt.org/test/bro_25.htm

Brown, J. D., & Rodgers, T. S. (2002). Doing second language research. Oxford: Oxford University.

Carrell, P. L., Pharis, B. C., & Liberto, J. C. (1989). Metacognitive strategy training for ESL reading. TESOL Quarterly, 23, 647-678.

Cochran, W. G., & Cox, G. M. (1957). Experimental designs. New York: John Wiley.

Dayton, C. M. (1970). The design of educational experiments. New York: McGrawHill.

Fotos, S., & Ellis, R. (1991). Communicating about grammar: A task-based approach. TESOL Quarterly, 25 (4), 605-628.

Jaccard, J., Becker, M. A., & Wood, G. (1984). Pairwise multiple comparison procedures: A review. Psychological Bulletin, 96, 589-596.

Kirk, R. E. (1968). Experimental design: Procedures for the behavioral sciences. Belmont, CA: Brooks/Cole.

Perneger, T. V. (1998, April 18). What's wrong with Bonferroni adjustments? British Medical Journal, 316 (7139), 1236-1238. Retrieved from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1112991

Politzer, R. L., & McGroarty, M. (1985). An exploratory study of learning behaviors and their relationship to gains in linguistic and communicative competence. TESOL Quarterly, 19 (1), 103-123.

Sasaki, C. L. (1996). Teacher preferences of student behavior in Japan. JALT Journal, 18, 229-239.

Siegel, A. F. (1990). Multiple t tests: Some practical considerations. TESOL Quarterly, 24 (4), 773-775.

Shavelson, R. J. (1981). Statistical reasoning for the behavioral sciences. Boston, MA: Allyn & Bacon.

Tabachnick, B., & Fidell, L. (2001). Using multivariate statistics (4th ed.). Needham, MA: Allyn & Bacon.

Tabachnick, B., & Fidell, L. (2007). Using multivariate statistics (5th ed.). Boston: Pearson.

Weaver, C. (2007). A Rasch-based evaluation of the presence of item bias in a placement examination designed for an EFL reading program. In T. Newfields, I. Gledall, P. Wanner, & M. Kawate-Mierzejewska (Eds.) Second Language Acquisition - Theory and Pedagogy: Proceedings of the 6th Annual JALT Pan-SIG Conference. Sendai, Japan: Tohoku Bunka Gakuen University. Retrieved from http://jalt.org/pansig/2007/HTML/Weaver.htm.



Where to Submit Questions:
Please submit questions for this column to the following address:
JD Brown
Department of Second Language Studies
University of Hawai'i at Manoa
1890 East-West Road
Honolulu, HI 96822 USA


NEWSLETTER: Topic IndexAuthor IndexTitle IndexDate Index
TEVAL SIG: Main Page Background Links Network Join

STATISTICS CORNER ARTICLES:
#1   #2   #3   #4   #5   #6   #7   #8   #9   #10   #11   #12   #13   #14   #15   #16   #17   #18   #19   #20   #21   #22   #23   #24   #25   #26   #27   #28   #29   #30   #31   #32   #33  
last Main Page next
http://jalt.org/test/bro_27.htm

[ p. 28 ]