✚ Why bad research makes it into good medical journals—a critique of the Ontario surgical checklist study

This past week, a study in the New England Journal of Medicine called into question the effectiveness of surgical checklists for preventing harm. Atul Gawande—one of the original researchers demonstrating the effectiveness of such checklists and author of a book on the subject—quickly wrote a rebuttal on the The Incidental Economist. He writes, “I wish the Ontario study were better,” and I join him in that assessment, but want to take it a step further.

Gawande first criticizes the study for being underpowered. I had a hard time swallowing this argument given they looked at over 200,000 cases from 100 hospitals. I had to do the math. A quick calculation shows that given the rates of death in their sample, they only had about 40% power [1]. Then I became curious about Gawande’s original study. They achieved better than 80% power with just over 7,500 cases. How is this possible?!?

The most important thing I keep in mind when I think about statistical significance—other than the importance of clinical significance [2]—is that not only does it depend on the sample size, but also the baseline prevalence and the magnitude of the difference you are looking for. In Gawande’s original study, the baseline prevalence of death was 1.5%. This is substantially higher than the 0.7% in the Ontario study. When your baseline prevalence approaches the extremes (i.e.—0% or 50%) you have to pump up the sample size to achieve statistical significance.

So, Gawande’s study achieved adequate power because their baseline rate was higher and the difference they found was bigger. The Ontario study would have needed a little over twice as many cases to achieve 80% power.

This raises an important question: why didn’t the Ontario study look at more cases?

The number of cases in a study is dictated by limitations in data collection. Studies are generally limited by the manpower they can afford to hire and the realistic time limitations of conducting a study. However, studies that use existing databases are usually not subject to these constraints. While creating queries to extract data is often tricky, once you have setup your extraction methodology it simply dumps the data into your study database. You can extend or contract the time period for data collection by simply changing the parameters of your query. Modern computing power means there are few limitations on the sizes of these study databases and the statistical methodologies we can employ. Simply put, the Ontario study (which relied on ‘administrative health data,’ read: ‘existing data’) easily could have doubled the number of cases in their study.

Exactly how did they define their study group? As Gawande points out in his critique, the Ontario study relied on this bizarre 3-month window before and after checklist implementation at individual hospitals. Why 3 months? Why not 6 or 12 or 18? They even write in their methods:

We conducted sensitivity analyses using different periods for comparison. [3]

They never give the results of these sensitivity analyses or provide sound justification for the choice of a 3-month period. Three months not only keeps their power low, but it fails to account for secular trends. Maybe something like influenza was particularly bad in the post-checklist period, leading to more deaths despite effective checklist use. Maybe a new surgical technique or tool was introduced, like DaVinci robots, or many new, inexperienced surgeons were hired that increased mortality. In discussing their limitations, they address this:

Since surgical outcomes tend to improve over time, it is highly unlikely that confounding due to time-dependent factors prevented us from identifying a significant improvement after implementation of a surgical checklist.

I will leave it to you to decide if you think this is an adequate explanation. I’m not buying it.

Gawande concludes that this study reflects a failure of implementation of using checklists, rather than a failure of checklists themselves. I’m inclined to agree.

Ultimately, I don’t wonder why this study was published; bad studies are published all the time (hence the work of John Ioannidis). I wonder why this study was published in the New England Journal of Medicine. NEJM is supposed to be the gold-standard for academic medical research. If they print it, you should be confident in the results and conclusions. Their editors and peer reviewers are supposed to be the best in the world. The Ontario study seems to be far below the standards I expect for NEJM.

I think their decision to accept the paper hinged on the fact that this was a large study that showed a negative finding on a subject that has been particularly hot over the past few years [4]. Nobody seemed to care that this was not a particularly well-conducted study; this is the sadness that plagues the medical research community. Be a critical reader.


  1. Remember, we conventionally aim for a power of 80% (or better).  ↩

  2. Clinical significance refers to the importance of a finding in terms of its impact on something clinically meaningful. To use data for the Ontario study as an example, they show a statistically significant drop in the length of hospital stays from 5.11 days to 5.07 days. Despite this finding’s statistical significance, who cares?! You’re still in the hospital 5 days roughly.  ↩

  3. I am taking ‘sensitivity analysis’ to mean in this case that they actually looked at various time periods—maybe 6 or 12 or 18 months—to see how their results changed. Usually when people do this, they give some indication of the results of their sensitivity analyses and why they decided to stick with the original plan.  ↩

  4. Yes, checklists are hot. I mean, Atul Gawande wrote a best-selling book about them. Granted, he’s such a great writer that he could spend 300 pages expounding upon why the sky is blue and it would sell.  ↩