Approach to Demystify Black Box AI Not Ready for Prime Time

Research suggests that compared with human clinicians, image heat maps underperform and require further refinement

Ekaterina Pesheva October 10, 2022

A human physician looking at AI on a tablet

Image: Ignatiev/Getty Images

Artificial intelligence models that interpret medical images hold the promise to enhance clinicians’ ability to make accurate and timely diagnoses, while also lessening workload by allowing busy physicians to focus on critical cases and delegate rote tasks to AI.

But AI models that lack transparency about how and why a diagnosis is made can be problematic. This opaque reasoning — also known “black box” AI — can diminish clinician trust in the reliability of the AI tool and thus discourage its use. This lack of transparency could also mislead clinicians into overtrusting the tool’s interpretation.

In the realm of medical imaging, one way to create more understandable AI models and to demystify AI decision-making has been saliency assessments — an approach that uses heat maps to pinpoint whether the tool is correctly focusing only on the relevant pieces of a given image or homing in on irrelevant parts of it.

Heat maps work by highlighting areas on an image that influenced the AI model’s interpretation. This could help human physicians see whether the AI model focuses on the same areas as they do or is mistakenly focusing on irrelevant spots on an image.

But a new study, published in Nature Machine Intelligence on Oct. 10, shows that for all their promise, saliency heat maps may not be yet ready for prime time.

The analysis, led by Harvard Medical School investigator Pranav Rajpurkar, Matthew Lungren of Stanford, and Adriel Saporta of New York University, quantified the validity of seven widely used saliency methods to determine how reliably and accurately they could identify pathologies associated with 10 conditions commonly diagnosed on X-ray, such as lung lesions, pleural effusion, edema, or enlarged heart structures. To ascertain performance, the researchers compared the tools’ performance against human expert judgment.

Read full article in HMS News