The Statistical Trap That Makes Bad Medicine Look Good

A treatment can simultaneously fail in every group of patients it's tested on and succeed overall. This isn't a logical contradiction—it's a real statistical phenomenon that has fooled doctors, researchers, and hospital administrators into believing broken treatments work.

Most people assume that if something is good for men and bad for women, it must be bad overall. If a drug helps mild cases and hurts severe cases, surely the net effect is mixed at best. This intuition feels airtight. You're simply averaging the results, right? But this confidence in straight addition and subtraction is precisely what Simpson's Paradox exploits. According to research on statistical misrepresentation, the paradox emerges when group sizes are unequal and groups have opposite directional effects—a scenario far more common in medicine than anyone admits.

Here's how it actually works. Imagine a kidney stone treatment tested on 700 patients. In the small-stone group (81 patients), the treatment succeeds 93% of the time. In the large-stone group (263 patients), it succeeds 73% of the time. By every rational measure, this treatment is worse for patients with large stones and better for those with small stones. Now look at the untreated control group: small stones have a 100% success rate naturally, and large stones have a 69% rate. But here's the trick: far more patients in the treatment group had large stones (263 versus 81), while far more in the control group had small stones. When you collapse the numbers and look at overall success rates, the treatment appears superior.

This isn't hypothetical. A 1986 analysis in the Journal of the American Statistical Association documented exactly this paradox in real kidney stone data. As Scientific American has reported, the phenomenon reveals how statistical aggregation can mask harmful effects hiding in plain sight within subgroups. The treatment actually increases complication rates for both small and large stones—but because it was preferentially given to patients with naturally worse prognoses, the overall numbers looked better.

The mechanism is distribution. When a treatment is given disproportionately to sicker patients (who have worse baseline prognosis but more room for apparent improvement), or when one subgroup is much larger than another, aggregation creates an optical illusion. The treatment doesn't need to help the majority of patients. It just needs to be distributed in a way that mathematically reverses the direction of effect. This happens because we're not truly comparing apples to apples—we're mixing different group compositions and pretending the mixture represents reality.

Why should this matter to anyone outside a statistics department? Because medical trials, policy evaluations, and hiring analyses all suffer from this exact problem. A hiring practice can discriminate against every demographic group separately while the company touts improving overall diversity numbers. A public health intervention can fail in every at-risk subpopulation while a press release declares victory. The paradox doesn't require deception—just negligence about which groups receive treatment and how their outcomes are counted. The antidote is unflinching: always disaggregate your data. Look at the subgroups first. Only then, if the effect holds across groups, trust the overall number. The aggregate is often the lie that feels most like the truth.

The Statistical Trap That Makes Bad Medicine Look Good

In this article:

Subscribe to keep reading