Skip to content

Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models

Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models

“Captain James Kirk frequently used voice commands to interact with the computer on the Starship Enterprise in the original Star Trek series. He would also ask the computer for information such as ship’s status updates, sensor readings, and calculations for various scenarios. For example, he might ask, “Computer, what is the status of the ship’s engines?” or “Computer, what is the estimated time of arrival at our destination?”

 ChatGPT, when asked, “What was a typical question that Captain James Kirk asked the computer in the show Star Trek?”

Given the high level of interest in the large language model (LLM) ChatGPT and its diverse array of natural language tasks, even in biomedicine, this research article from PLOS Digital Health was noteworthy. 

The ChatGPT LLM is the advent of a new type of AI algorithm that is trained with transformers that can predict the likelihood of a given sequence of words. ChatGPT is trained on the OpenAI 175 billion parameter foundation model with a large corpus of text from the Internet with both reinforcement and supervised learning. ChatGPT is particularly good at handling long-range dependencies amongst words (or “tokens”) and concomitantly generating in situ (without search) coherent responses. 

The authors from the medical education domain in the United States collaborated on this project. Using ChatGPT, the authors evaluated the United States Medical Licensing Exam (USMLE) which consists of steps 1, 2CK, and 3. This exam is known for its linguistic and conceptual richness of multimodal clinical data.

ChatGPT performed at or near the passing threshold (60%) for all three exams without any specialized training or reinforcement that is not already part of the language model. This accomplishment by ChatGPT is a notable milestone in AI use for clinical medicine. ChatGPT was also impressive in its ability to express comprehensible reasoning and valid clinical insights. However, just as in prior studies of AI tools in clinical medicine, ChatGPT is placed in a “human vs machine” position instead of a more positive “human + machine” situation. Perhaps a more elegant and thoughtful future study would have been measuring the potential augmentation of the USMLE score of human test takers (at all levels, from medical students to seasoned clinicians) with and without ChatGPT.

This level of performance of the ChatGPT language model can have outstanding potential in the future of medical education and clinical training as well as clinical practice in real time. These language models, like convolutional neural networks (CNNs) that initiated the artificial intelligence resources in medical imaging, will make a sizable impact on clinical decision-making in the very near future. Trust and explainability as well as ethical and medical-legal issues will need to be reconciled with these large language models just like the other AI in healthcare tools thus far. ChatGPT and other even more sophisticated LLMs that are biomedicine-focused (such as BioGPT and PubMedGPT) are here. Captain Kirk and his famous queries to his computer in Star Trek are no longer science fiction. 

Read the full paper here

Show Buttons
Hide Buttons