Episode 103

Speaking of statistical significance

Is it time to scrap the term 'statistical significance'?

13 Apr 2021
18 min 49

13 Apr 2021
18 min 49

Just because something is mathematically significant does not mean it is clinically significant. Ashlea Broomfield chats with GP Oliver Frank about ‘statistical significance’ and whether it is time to scrap the term all together. Read the full article in Australian Prescriber.

Transcript

Welcome to the Australian Prescriber Podcast. Australian Prescriber, independent, peer-reviewed and free.

Welcome to the Australian Prescriber Podcast. My name is Ashlea Broomfield, and I'm here today with Dr Oliver Frank, who is a specialist general practitioner and researcher with the University of Adelaide. Oliver has written an article for Australian Prescriber discussing whether it is time to scrap the phrase ‘statistical significance’. So I'm really pleased to be interviewing you today, Oliver, so we can chat all about statistics in a fun kind of way.

Yeah. Thank you very much.

Perhaps we should start with some definitions. What are we talking about with statistical significance?

That's a very interesting question. Statistical significance is really a mathematical concept. Statistical testing, itself, is a mathematical concept, and it's important to understand that and how it relates to research that has been conducted.

And I love how in the article you managed to make what sometimes people feel is a dry subject into interesting reading. I think the most interesting point that you made is, "Just because something is mathematically significant does not mean that it's clinically significant."

No, that's right. And that is really one of the key messages in this article. As you'll see in the article, we actually talk about... And I do want to acknowledge my co-authors Michael Tam and Joel Rhee who helped to make it more of a fun article. It was really their inspiration that made it more fun. Rather than talking about mathematical significance, we're suggesting we call it mathematically unusual because we are trying to get away from uses of the word significance that can be misleading to people. So in mathematics, we can talk about things that are unusual or surprising. Just the unusuality of it. That is the key thing.

You make a point that statistical significance shouldn't be conflated with the notion of a p value. Could you go through the difference in what is significance versus statistical significance versus a p value?

Yes, sure. P values have really become a kind of shorthand way that people have been trying to work out whether they should take any notice of the study. And journal editors, still largely, are continuing to insist that authors supply p values and you'll see them quoted all the time in abstracts and in bodies of papers. And of course, everybody's aware of this question of p being less than 0.05. Having something cross that 0.05 line doesn't mean, on one side, the study is important and meaningful, and on the other side that it isn't. The fundamental assumption in calculating a p value is that neither treatment, if there are two treatments, has any effect at all. And you know, this would probably shock people standing in the front bar if you were to tell them that when we calculate this, we're assuming that if two medicines are being compared, or a placebo medicine and a medicine thought to be active are being compared, that the calculation of the p value assumes that neither has any effect at all.

And on that assumption, which has very little to do with the real world, which is why we say it's a mathematical concept... On that assumption, the p value tells us, if those two treatments both had no effect at all, what is the likelihood that the results we got, that show a difference, would be as extreme or more extreme than that?

So this is quite a tricky concept. And as you can understand, when we're testing medicines where we believe that at least one of them has some effect, starting with a calculation that gives you a figure based on the assumption that neither has any effect at all, doesn't really tell us what we're trying to find out. And the p value, as we've said in the article, really doesn't tell you, "Is this study likely to be meaningful or what is the probability of the thing under test having some effect?" So, p values really don't have a place in the kind of studies that we're interested in and the kinds of things that Australian Prescriber publishes about, which is tests of new medicines mostly.

So, keeping on that line of questioning where we're talking about the language, given that p values or statistical significance is generally a mathematical term, you used the term in the article ‘mathematically unusual’ to reframe some of the phrases that we might be considering in studies to show the apparent uselessness of the phrase ‘statistically significant’. So instead of ‘the results approached statistical significance’ you suggest that the term should be ‘the results approached mathematical unusualness’, which really puts it into perspective of creating an arbitrary line where we say that, "At this point, then this drug is better than the other or the effect is somehow different."

And I love that this is something that you and your co-authors have brought about in terms of a discussion, because as a busy clinician, or student, or researcher, sometimes it's easier instead of reading the whole article to jump to the results and conclusions section and just read whether it was statistically significant or not, or just use the abstract and see whether there's statistical significance or not. And we really know that that's not necessarily the best indicator of whether something is effective or not. So, I'm interested in you talking through for our listeners, if we move away from the term ‘statistical significance’ and moving towards looking at internal and external validity, what that actually looks like when we're looking at a study.

Yeah, that's right. In the article what we're saying is that looking at any statistical analysis is really something that should be the very last part of what we do when we're trying to work out whether a report of a study is telling us something that seems important... that we want to know. I think we just indirectly referred to the fact that statistical testing... We kind of assume it's always needed, but in fact, when two things... when results are clearly and obviously different, no statistical testing needs to be done at all. I've tried to persuade journal editors about that... when we've had results like that and they could not accept that when there was a wild difference in results and we said, look, anybody can see these are different. They still wanted statistical testing. It's become very strongly ingrained. What we're saying in the article is that when we're looking at reports of studies, what we really need to look at first is whether the study was reasonable and conducted in an appropriate way.

And the very first question is, "Is even the research question of interest and importance?" If somebody is proposing to research a question that we think has no relevance to our patients or the patients out in the real world, or is it about something not really important? There's increasing questions being asked saying, "Should studies like that even be funded at all?" Assuming that the question being asked does seem important then we've mentioned in the section called ‘clinical significance’ that we should look at, "Do we think the study was conducted in a reasonable way?" And our comments there are about randomised controlled trials. So, we've put in there some of the fundamental questions to say, "Did the researchers go about this in a way that means that we could have some confidence in the findings that we think they actually might be meaningful in some way?" And, when I say meaningful, I'm talking about clinically meaningful. And then the actual most important thing out of that is what's called the effect size.

And that is, "What was the difference in results for people on a new treatment and people on some comparative, whether it was placebo or an existing medicine, and how big is that difference, and is it big enough for us to care?" And this is where we might've all seen studies which have large numbers of patients. And there's actually very little difference in the outcomes, but in a mathematical sense you can show that there is a difference that traditionally has been called significant, but it doesn't really mean anything. And then the remaining question is what we call ‘confidence interval’ which is to say, "How precise do we think that difference that we found is?" If it's got a pretty narrow range, we can say, "Well, that's good." If it's got a very wide range, which means there could have been a much smaller effect or a much bigger effect than that point estimate suggests, we have to be more cautious.

We also touch on the fact, and this is a phenomenon of randomised controlled trials as conventionally done, is that not every patient has the same outcome. The new medicine might've had a very big effect on some patients and a very small or no effect on others, but they all got lumped in together and so we produce a figure that represents the whole group overall. So, we need to keep that in mind, this is a limitation of the conventional method of reporting the results of randomised controlled trials and there are efforts being made to consider that question of, "Are there certain people who actually have a better result or a bigger effect than others?" And that's important for us to know, because of course we know all our patients are not the same.

Yes. And I think from a general practitioner's perspective, this concept of clinical significance is really important in decision making because our patients really want to know, "Is this drug going to work for me or not, and how much better is it going to be in addition to everything else that I'm on?" And I think the important point that you raised in an example in the article that you wrote, which is around treatment of migraine and the particular study that you referred to, on analysis, many of the people in that study were not on migraine prophylaxis medication, which is highly unusual in a clinical setting, intractable migraine or chronic daily migraine or severe symptoms. We would normally start with migraine prophylaxis before going to other more expensive and specialised medication, and straight away that tends to make the external validity basically useless if it's not comparable to our patient populations.

Yeah, that's right. And what I didn't specifically mention was setting. So, I don't remember exactly, I think these trials might've been done in the US. We know in the US many people have little or no access to health care, and they just simply can't afford some medicines. We're lucky here that the PBS subsidises most medicines, but not all. And therefore we would expect patients with migraines this bad to be on some kind of preventive treatment. So, this is one of the relevant questions for this study.

One of the other things that you said, or your authors said in the article, was that the medical literature commonly conflates statistical significance with the everyday meaning of significance. And you mentioned before that when there's easy-to-determine difference, then there's no need for a p value. Can you give us an idea about what that would look like, if we moved away from talking about statistical significance, how that could be helpful for us as clinicians to view things differently? How would we know that it is everyday significant?

It's any simple example you can take if you had a medicine, say, for high blood pressure, and on some conventional treatment five people in 100 had a reduction of more than whatever the amount was specified, but in the other arm, maybe in the new medicine, 95 people out of 100 achieved that reduction. The difference between five people in 100 and 95 people in 100 is so large that you don't need to do any statistical testing. You can say clearly the newer medicine achieved that effect and blood pressure for a lot more people. It's just a kind of common sense thing and some statistical analysis to produce essentially a confidence interval. You really only need it where it's not obvious what the difference is or how big it is.

And that raises another interesting question really is, if the effect is not very obvious, then is there really an effect at all? Is that going to translate in terms of clinically significant outcomes for patients?

Well, that's right. So of course, effects can be and often are quite small, and that comes back to that question of the minimal clinically important difference. And if it's approaching that kind of level, then you might want to know the confidence interval to say, "Well, how precise is our estimate of the effect, and in order to really let us know, are we really getting down to levels where we're not sure that the effect size actually matters at all?" So, statistical analysis is a secondary method of trying to work some of that out, but it's really not mandatory. And I think that's again another assumption wrongly made by doctors and by journal editors that you must always do statistical testing. That is not true.

And it's an interesting concept to have to explain to people about to take a medication. When you're talking there in that setting, "Well, how many people do I have to give this drug to for one person to have benefit or the number needed to treat?" And to explain to someone who's sitting in front of you, "Well, if I gave this medication to 10 people with your problem, one of them might get benefit from it." It's really interesting to put it in a phrase like that because you might have, say, 10,000 people where 1000 of them got benefit and that seems like a really big amount of people, whereas for that person sitting in front of you a one in 10 chance may not actually be good enough for them to take a chance on that specific medication.

Yeah. That approach to trying to give patients an idea of what is their chance of benefit is clearly... has been shown to be a useful way. And I think when we do that, we're often shocked to find how many people have to be treated for one person to achieve the particular benefit we're interested in.

Mm-hmm (affirmative). In your article you talked about an American organisation who has recommended that we no longer use this term. You and your authors are proposing that something similar happens here in Australia.

Yeah, absolutely. Just because it is misleading, it’s misleading as we've said because the use of statistics is a mathematical thing, it's based on an artificially constructed mathematical world and therefore can be misleading. And one of its misuses is churning out these p values which people think, as we've said, kind of as a stamp of certification that a study was conducted in a valid way, and that its results are important and that we should take note of them. And in fact, a p value doesn't tell you any of that. So, it's had this bad effect where researchers are chasing results and will manipulate results in order to achieve a required p value. It's kind of corrupted the conduct and reporting of science in that chase. So, for that reason we really think and we agree with the American Statistical Association. Because of that, it is a kind of a fetish that's come as a misuse of this, that we think we should stop talking about statistical significance altogether because it does more harm than good.

So in summary, we really have to go back to the basics of, “Was this trial conducted well, can I trust what I'm reading in this article and does it apply to me and my patients, and if so, how?”

That's right. And “How big was the effect?” Just looking at how much difference was there between any new treatment or treatments being compared, and is that difference big enough for us to think it actually would matter?

Mm-hmm (affirmative). So, don't just read the results section then.

No, that's probably the important message. And I think we said that close to the beginning that the results in any statistical analysis should be the last thing you look at, but start by deciding whether the whole study was worth doing in the first place, because it asked some important questions or not, whether it was conducted appropriately, what the effect size was, and then possibly you'd look at any statistical analysis.

Awesome. Have you got anything else you wanted to add?

Only, I suppose for GPs reading research studies, it is really difficult. The most downloaded paper from Public Library of Science, you might know, is one called ‘Why Most Published Research Findings Are False’ and that is a very disturbing article, especially for those of us who are researchers. And it's really hard for GPs who are not specifically interested in one area of medicine, one organ, one kind of patient to understand what research is telling us. And we do need the best possible ways to get that information which some of our academic colleagues need to assemble for us.

Yeah. Then it comes down to really relying on the guidance of some specialist colleagues, or guidelines in specific areas, and systematic reviews or summaries. And then you hope that the people that wrote those and combined all of those actually went through the same considerations that we're talking about today.

I mean, for those of us who want to play along at home, we should... Obviously, if we're interested in some... Most of us have got some particular clinical interests, and we might want to read original research studies so having some approach to doing that and trying to work out, in a particular study, what we think about it and whether it's likely to be able to tell us something useful is still worth knowing how to do. And we should all have some basic understanding of that.

Absolutely, 100 percent agree with you there. Thank you for joining us on today's episode.

Yeah. Thank you very much.

[Music]

The views of the guests and the hosts of this episode are their own, and do not reflect Australian Prescriber or NPS MedicineWise. My name is Ashlea Broomfield and thank you for joining us.

Episode 103

Speaking of statistical significance

Transcript

Date published: 13 April 2021