Deep learning in medical image interpretation: Common pitfalls

Medicine is a science of uncertainty and an art of probability.”

-William Osler, founding professor of Johns Hopkins Hospital

The COVID-19 pandemic is relentless and there has been escalating interest in the data science community and thus many manuscripts submitted to the new journal Intelligence-Based Medicine. The largest category of submissions has been use of deep learning for medical image interpretation and it is timely and useful to discuss some common issues with papers in this domain.

In addition, I was invited to be a guest panelist at the American College of Cardiology journal club that focused on medical image and deep learning. Here are some interesting observations for medical image interpretation with deep learning based on the aforementioned activities:

Lack of clinician input. While the data science of medical image studies may be adequate, the clinical relevance or even the clinical assumptions are sometimes incorrect or overstated. An example is to claim that a chest CT should be the diagnostic test of choice for COVID-19.

Low data input quality. Many studies are done with publicly available datasets and the incorrect assumption is that these datasets are always excellent in quality of labeling when in fact these datasets are equally vulnerable to the myriad of labeling challenges.

Data augmentation limitations. Data augmentation is an alteration of the medical image to create new training data and thus increase the amount of data. This technique does not necessarily lead to a significant increase in the diversity of presentations of disease manifested by the images.

Presence of data leakage. Partly because of insufficient number of medical images, the test dataset is often coupled to (vs sequestered from) the training dataset and therefore not independent from the training dataset.

Transfer learning and the dataset. Often the transfer learning for medical image interpretation is from ImageNet to medical image datasets whereas this methodology is usually more efficient going from a well-validated medical image dataset to another medical image dataset.

Size of neural network matters. There is an affinity for very deep neural networks in some of the projects with relatively low numbers of images, and this mismatch of volume of data to capability of the neural network can create problems.

Generalizability is key for impact. A medical image deep learning model can perform very well with the training and test datasets but may not necessarily be able to generalize to another hospital, even in the same region.

Recommended Posts