Is it time to stop using statistical significance?
- Oliver Frank, CW Michael Tam, Joel Rhee
- Aust Prescr 2021;44:16-18
- 1 February 2021
- DOI: 10.18773/austprescr.2020.074
The important first step in the critical appraisal of a randomised trial is not an evaluation of the statistical analyses. The most important aspect to consider when reviewing a study of a new drug is the appropriateness and quality of the trial design and methods.
The next most important aspect is the effect size of different treatments and its clinical significance. Rather than reporting statistical significance, studies should report the difference between treatments and its precision.
Over-reliance on statistical significance and p values may lead to incorrect conclusions. Trial reports about drugs should therefore avoid the term statistical significance and quote p values with caution.
Criticisms of the misuse and misinterpretations of statistical significance testing (and of p values) were made throughout the last century.1 William Rozeboom, an eminent philosopher of science, once asserted that it was ‘surely the most bone-headedly misguided procedure ever institutionalised in the rote training of science students’.2 This criticism reached a zenith in 2019, when the American Statistical Association, an international peak body of professional statisticians, formally recommended against statistical significance testing – both its use and in the reporting of results.3
‘At the end of the trial, monthly injections had reduced the number of headache days by 4.6 days and the number of migraine days by 5.0 days. With quarterly injection the reductions were 4.3 days for headache and 4.9 days for migraine. Both regimens were significantly better than the reductions of 2.5 days and 3.2 days seen in the placebo group.’
For most readers of Australian Prescriber, that statement might seem eminently reasonable. However, the routine use of the word ‘significantly’ is misleading.3
To understand why the term statistical significance is problematic, it is necessary to consider the context in which statistical significance testing occurs. Empirical research is about discovering and constructing knowledge about the world, for instance, whether a new drug works from the perspective of causation and predicting patient outcomes. This research often involves describing the empirical world using numbers (quantitative methods). Statistical inferential testing can be a useful tool whose results can inform us about the real world. However, discomfort with uncertainty promotes overconfidence in statistical rituals,5 and contributes to the belief that statistical testing is always necessary.
Clinicians commonly misinterpret statistical significance and its conceptual twin, the p value.6 This potentially results in gross overestimation of the strength of evidence.7 Importantly, neither the validity of the study nor the truth of its findings can be inferred from p values and statistical significance alone.
Two simple heuristics to reduce misinterpretation of p values and statistical significance are:6
Statistical significance is fundamentally a mathematical concept that should be understood only in the context of null hypothesis statistical testing. This involves creating a statistical model, a simplified and artificial ‘mathematical world’ where the researcher can define all the rules. In this model, one of the rules is that drugs or procedures have zero effectiveness – hence the term null hypothesis.
Seen from within the mathematical world, using the assumptions of this ‘zero-effectiveness’ statistical model, the unusualness of the real-world data collected in the study can be calculated. The p value can be considered a measure of how compatible the data are with this statistical model. Larger p values are more compatible with the null hypothesis and small p values less so.
Statistical significance only means that the data reached an arbitrarily defined level of incompatibility with the statistical model. However, this zero-effectiveness statistical model might be incompatible with the data for many reasons. For instance, the data collected might have been biased, or one or more assumptions used in the statistical model were unsound or violated. Statistical significance does not indicate on its own that the result is true or that the null hypothesis is false. Moreover, statistical significance does not indicate or imply that a result is clinically important.
Clinical significance pertains to patient care. Deciding whether or not a study result is clinically significant cannot be determined by an algorithm. Rather it requires judgement, clinical expertise and a respect for context.
The important first step in the critical appraisal of a clinical trial is not an evaluation of the statistical analyses. Analysing the patients, intervention, comparison and outcomes in the methods section of the report, and being satisfied with the reasonableness of the question asked by the researchers, is important in deciding whether or not to read more of the report.
Next is an appraisal of the internal validity of the trial, which can be framed as a series of questions. For a randomised trial:9
Threats to the internal validity of a study’s methodology reduce the confidence that the results usefully represent what the study sought to investigate. Simply, if the study has major methodological biases, the results will need to be taken with a grain of salt. The results might even be uninterpretable.
When looking at trial results, the focus should be on the primary outcome, its effect size, and the precision with which that effect has been able to be estimated. This precision is often described as a confidence interval. If the differences in outcomes between groups are small, there is likely to be little clinical benefit from using a trial treatment instead of a comparator. However, it is important to remember that the reported effect size is the average for the sample of people in the study and it is likely that many participants (half of the sample, assuming normal distribution) benefited more while others benefited less (again half, assuming normal distribution). Whether an effect size is clinically significant depends on the nature of the condition, the effect and the context. Synthesising these together requires clinical judgement. Fortunately, investigators often include a discussion of clinical significance when describing the power and sample size calculations in the methods section of their reports.
A useful concept to consider is the minimum clinically important difference, especially when there may not be a good intuitive grasp of the outcome measure. For example, the six-item headache impact test (HIT-6) has a range from 36 (no impact) to 78 (very severe). The minimum clinically important difference is considered to be 2.5 points.10 In the trial described in Australian Prescriber, fremanezumab reduced the HIT-6 score compared with placebo by 1.9 when given quarterly and by 2.4 when given monthly.11 Both changes are statistically significant, but are less than the minimum clinically important difference. It is important to note that only about 20% of participants in the trial were using any migraine-preventing medicine. When balancing the modest average therapeutic effect of fremanezumab with the need for it to be injected and its high cost compared to established drugs for migraine prophylaxis, it seems hard to justify it as a first-line treatment.
The confidence interval, typically reported at 95%, can be interpreted as the (im)precision of the effect-size estimate. This is the range of values that are mathematically compatible with the effect-size estimate.
If the confidence interval is wide, the lower and upper limits indicate very different clinical effects ranging from a tiny effect size to a substantial effect. The effect-size estimate is therefore imprecise and it would be misleading for it to be quoted without caution and appropriate context.
If the confidence interval is subjectively narrow, the lower and upper limits would give roughly the same clinical interpretation. It could then be claimed that the estimate of effect size is precise.
Judgement and care are required regardless of the confidence interval. A large drug trial undertaken in men could conceivably yield a very precise effect-size estimate, that would be incorrect in women.
As an exercise to develop insight, try replacing instances of the term statistically significant with the synonym ‘mathematically unusual’. Paraphrasing the original quoted Australian Prescriber new drug comment as ‘both regimens were [statistically] significantly better than the placebo group’ becomes ‘both regimens were mathematically unusually better than the placebo group’. The apparent meaninglessness of the second sentence is what is meant by the first.
The hidden absurdity of commonly seen statements in reports such as ‘the results approached [statistical] significance’ is revealed when they are transformed into ‘the results approached mathematical unusualness’.
Significance is still a useful word that should not be abandoned. However, for too long statistical significance has co-opted the use of the word. The medical literature commonly conflates statistical significance with the everyday meaning of significance. In line with the recommendation of the American Statistical Association, it is time to move on. Its executive director wrote in unambiguous terms ‘statistically significant – don’t say it and don’t use it’.3 Rather, we should focus on the effect-size estimate and its precision and interpret these through the lens of clinical significance.
Conflicts of interest: none declared
Specialist general practitioner, Oakden Medical Centre, Hillcrest, South Australia
Discipline of General Practice, Adelaide Medical School, University of Adelaide
Staff specialist, Primary and Integrated Care Unit, South Western Sydney Local Health District
Conjoint senior lecturer, School of Population Health, UNSW Sydney
Associate professor of General Practice, School of Medicine, University of Wollongong, New South Wales