Classical statistical inference and its discontents

“Classical” statistical inference in medicine is usually synonymous with frequentist inference, the central element of which is the null hypothesis significance testing (NHST). Even though that was not its original intent, NHST is in practice used to evaluate the evidence for or against a hypothesis, due to confusion in mixing Fisher’s approach with that of Neyman and Pearson.

NHST has recently become very controversial, due to the replication crisis in medicine, and the phenomenon of p-hacking. In short, the medical literature suffers from publication bias (negative studies are confined to the bottom drawer of the investigator’s desk) and unreproducible results, due to small (i.e. low-power) studies, and multiple testing. The p-value (the measure of surprise) routinely overestimates the strength of evidence against the null, as pointed out elsewhere, and recent attempts to move the goalpost p-value from 0.05 to 0.005, or other arbitrary number, fail to address the core problem.

In a more philosophical sense, the null hypothesis can never be true, as there always is a difference between groups, even if that effect is small. From this results that a large enough sample will always reject $H_0$ , regardless of the p-value.

Frequentist attempts to address the NHST controversy include reporting the width of the CI as a measure of the precision associated with the effect estimate, relying on meta-analyses, replication, and power analyses. The perception of CI as a range of plausible values for the estimate (true effect in this case) is helpful, even if strictly incorrect (Baguley p370).

For the the non-statistician, it appears that the 95% CI is the safest frequentist inferential tool to evaluate a reported effect and gauge its reproducibility (this interval contains the mean of a replicated experiment about 83-90% of the time).

Sources:

Baguley: Serious Stats
Reinhart: Statistics Done Wrong

Leave a Reply Cancel reply