Which curriculum components do medical students find most helpful for evaluating AI outputs?

Abstract Introduction The risk and opportunity of Large Language Models (LLMs) in medical education both rest in their imitation of human communication. Future doctors working with generative artificial intelligence (AI) need to judge the value of any outputs from LLMs to safely direct the managemen...

Full description

Saved in:
Bibliographic Details
Main Authors: William J. Waldock, George Lam, Ana Baptista, Risheka Walls, Amir H. Sam
Format: Article
Language:English
Published: BMC 2025-02-01
Series:BMC Medical Education
Subjects:
Online Access:https://doi.org/10.1186/s12909-025-06735-5
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1823861946083966976
author William J. Waldock
George Lam
Ana Baptista
Risheka Walls
Amir H. Sam
author_facet William J. Waldock
George Lam
Ana Baptista
Risheka Walls
Amir H. Sam
author_sort William J. Waldock
collection DOAJ
description Abstract Introduction The risk and opportunity of Large Language Models (LLMs) in medical education both rest in their imitation of human communication. Future doctors working with generative artificial intelligence (AI) need to judge the value of any outputs from LLMs to safely direct the management of patients. We set out to investigate medical students’ ability to evaluate LLM responses to clinical vignettes, identify which prior learning they utilised to scrutinise the LLM answers, and assess their awareness of ‘clinical prompt engineering’. Methods Final year medical students were asked in a survey to assess the accuracy of the answers provided by generative pre-trained transformer (GPT) 3.5 in response to ten clinical scenarios, five of which GPT 3.5 had answered incorrectly, and to identify which prior training enabled them to evaluate the GPT 3.5 output. A content analysis was conducted amongst 148 consenting medical students. Results The median percentage of students who correctly evaluated the LLM output was 56%. Students reported interactive case-based and pathology teaching using questions to be the most helpful training provided by the medical school for evaluating AI outputs. Only 5% were familiar with the concept of ‘clinical prompt engineering’. Conclusion Pathology and interactive case-based teaching using questions were the self-reported best training for medical students to safely interact with the outputs of LLMs. This study can inform the design of medical training for future doctors graduating into AI-enhanced health services.
format Article
id doaj-art-240d1529bce542479b1cfa430d783e06
institution Kabale University
issn 1472-6920
language English
publishDate 2025-02-01
publisher BMC
record_format Article
series BMC Medical Education
spelling doaj-art-240d1529bce542479b1cfa430d783e062025-02-09T12:42:29ZengBMCBMC Medical Education1472-69202025-02-012511710.1186/s12909-025-06735-5Which curriculum components do medical students find most helpful for evaluating AI outputs?William J. Waldock0George Lam1Ana Baptista2Risheka Walls3Amir H. Sam4Imperial College School of Medicine, Imperial College LondonImperial College School of Medicine, Imperial College LondonImperial College School of Medicine, Imperial College LondonImperial College School of Medicine, Imperial College LondonImperial College School of Medicine, Imperial College LondonAbstract Introduction The risk and opportunity of Large Language Models (LLMs) in medical education both rest in their imitation of human communication. Future doctors working with generative artificial intelligence (AI) need to judge the value of any outputs from LLMs to safely direct the management of patients. We set out to investigate medical students’ ability to evaluate LLM responses to clinical vignettes, identify which prior learning they utilised to scrutinise the LLM answers, and assess their awareness of ‘clinical prompt engineering’. Methods Final year medical students were asked in a survey to assess the accuracy of the answers provided by generative pre-trained transformer (GPT) 3.5 in response to ten clinical scenarios, five of which GPT 3.5 had answered incorrectly, and to identify which prior training enabled them to evaluate the GPT 3.5 output. A content analysis was conducted amongst 148 consenting medical students. Results The median percentage of students who correctly evaluated the LLM output was 56%. Students reported interactive case-based and pathology teaching using questions to be the most helpful training provided by the medical school for evaluating AI outputs. Only 5% were familiar with the concept of ‘clinical prompt engineering’. Conclusion Pathology and interactive case-based teaching using questions were the self-reported best training for medical students to safely interact with the outputs of LLMs. This study can inform the design of medical training for future doctors graduating into AI-enhanced health services.https://doi.org/10.1186/s12909-025-06735-5Large language modelsGenerative artificial intelligenceUndergraduate medical assessmentCase-based learningPathology
spellingShingle William J. Waldock
George Lam
Ana Baptista
Risheka Walls
Amir H. Sam
Which curriculum components do medical students find most helpful for evaluating AI outputs?
BMC Medical Education
Large language models
Generative artificial intelligence
Undergraduate medical assessment
Case-based learning
Pathology
title Which curriculum components do medical students find most helpful for evaluating AI outputs?
title_full Which curriculum components do medical students find most helpful for evaluating AI outputs?
title_fullStr Which curriculum components do medical students find most helpful for evaluating AI outputs?
title_full_unstemmed Which curriculum components do medical students find most helpful for evaluating AI outputs?
title_short Which curriculum components do medical students find most helpful for evaluating AI outputs?
title_sort which curriculum components do medical students find most helpful for evaluating ai outputs
topic Large language models
Generative artificial intelligence
Undergraduate medical assessment
Case-based learning
Pathology
url https://doi.org/10.1186/s12909-025-06735-5
work_keys_str_mv AT williamjwaldock whichcurriculumcomponentsdomedicalstudentsfindmosthelpfulforevaluatingaioutputs
AT georgelam whichcurriculumcomponentsdomedicalstudentsfindmosthelpfulforevaluatingaioutputs
AT anabaptista whichcurriculumcomponentsdomedicalstudentsfindmosthelpfulforevaluatingaioutputs
AT rishekawalls whichcurriculumcomponentsdomedicalstudentsfindmosthelpfulforevaluatingaioutputs
AT amirhsam whichcurriculumcomponentsdomedicalstudentsfindmosthelpfulforevaluatingaioutputs