Benchmarking LLM chatbots’ oncological knowledge with the Turkish Society of Medical Oncology’s annual board examination questions

Abstract Background Large language models (LLMs) have shown promise in various medical applications, including clinical decision-making and education. In oncology, the increasing complexity of patient care and the vast volume of medical literature require efficient tools to assist practitioners. How...

Full description

Saved in:

Bibliographic Details
Main Authors:	Efe Cem Erdat, Engin Eren Kavak
Format:	Article
Language:	English
Published:	BMC 2025-02-01
Series:	BMC Cancer
Subjects:	Artificial intelligence Large language models Oncology Clinical decision support systems Medical education Board examinations
Online Access:	https://doi.org/10.1186/s12885-025-13596-0
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1823861948911976448
author	Efe Cem Erdat Engin Eren Kavak
author_facet	Efe Cem Erdat Engin Eren Kavak
author_sort	Efe Cem Erdat
collection	DOAJ
description	Abstract Background Large language models (LLMs) have shown promise in various medical applications, including clinical decision-making and education. In oncology, the increasing complexity of patient care and the vast volume of medical literature require efficient tools to assist practitioners. However, the use of LLMs in oncology education and knowledge assessment remains underexplored. This study aims to evaluate and compare the oncological knowledge of four LLMs using standardized board examination questions. Methods We assessed the performance of four LLMs—Claude 3.5 Sonnet (Anthropic), ChatGPT 4o (OpenAI), Llama-3 (Meta), and Gemini 1.5 (Google)—using the Turkish Society of Medical Oncology’s annual board examination questions from 2016 to 2024. A total of 790 valid multiple-choice questions covering various oncology topics were included. Each model was tested on its ability to answer these questions in Turkish. Performance was analyzed based on the number of correct answers, with statistical comparisons made using chi-square tests and one-way ANOVA. Results Claude 3.5 Sonnet outperformed the other models, passing all eight exams with an average score of 77.6%. ChatGPT 4o passed seven out of eight exams, with an average score of 67.8%. Llama-3 and Gemini 1.5 showed lower performance, passing four and three exams respectively, with average scores below 50%. Significant differences were observed among the models’ performances (F = 17.39, p < 0.001). Claude 3.5 and ChatGPT 4.0 demonstrated higher accuracy across most oncology topics. A decline in performance in recent years, particularly in the 2024 exam, suggests limitations due to outdated training data. Conclusions Significant differences in oncological knowledge were observed among the four LLMs, with Claude 3.5 Sonnet and ChatGPT 4o demonstrating superior performance. These findings suggest that advanced LLMs have the potential to serve as valuable tools in oncology education and decision support. However, regular updates and enhancements are necessary to maintain their relevance and accuracy, especially to incorporate the latest medical advancements.
format	Article
id	doaj-art-2db9e9416815418bb4640a2e8e30114d
institution	Kabale University
issn	1471-2407
language	English
publishDate	2025-02-01
publisher	BMC
record_format	Article
series	BMC Cancer
spelling	doaj-art-2db9e9416815418bb4640a2e8e30114d2025-02-09T12:41:38ZengBMCBMC Cancer1471-24072025-02-012511710.1186/s12885-025-13596-0Benchmarking LLM chatbots’ oncological knowledge with the Turkish Society of Medical Oncology’s annual board examination questionsEfe Cem Erdat0Engin Eren Kavak1Department of Medical Oncology, Ankara University Cebeci HospitalDepartment of Medical Oncology, Ankara Etlik City Training and Research HospitalAbstract Background Large language models (LLMs) have shown promise in various medical applications, including clinical decision-making and education. In oncology, the increasing complexity of patient care and the vast volume of medical literature require efficient tools to assist practitioners. However, the use of LLMs in oncology education and knowledge assessment remains underexplored. This study aims to evaluate and compare the oncological knowledge of four LLMs using standardized board examination questions. Methods We assessed the performance of four LLMs—Claude 3.5 Sonnet (Anthropic), ChatGPT 4o (OpenAI), Llama-3 (Meta), and Gemini 1.5 (Google)—using the Turkish Society of Medical Oncology’s annual board examination questions from 2016 to 2024. A total of 790 valid multiple-choice questions covering various oncology topics were included. Each model was tested on its ability to answer these questions in Turkish. Performance was analyzed based on the number of correct answers, with statistical comparisons made using chi-square tests and one-way ANOVA. Results Claude 3.5 Sonnet outperformed the other models, passing all eight exams with an average score of 77.6%. ChatGPT 4o passed seven out of eight exams, with an average score of 67.8%. Llama-3 and Gemini 1.5 showed lower performance, passing four and three exams respectively, with average scores below 50%. Significant differences were observed among the models’ performances (F = 17.39, p < 0.001). Claude 3.5 and ChatGPT 4.0 demonstrated higher accuracy across most oncology topics. A decline in performance in recent years, particularly in the 2024 exam, suggests limitations due to outdated training data. Conclusions Significant differences in oncological knowledge were observed among the four LLMs, with Claude 3.5 Sonnet and ChatGPT 4o demonstrating superior performance. These findings suggest that advanced LLMs have the potential to serve as valuable tools in oncology education and decision support. However, regular updates and enhancements are necessary to maintain their relevance and accuracy, especially to incorporate the latest medical advancements.https://doi.org/10.1186/s12885-025-13596-0Artificial intelligenceLarge language modelsOncologyClinical decision support systemsMedical educationBoard examinations
spellingShingle	Efe Cem Erdat Engin Eren Kavak Benchmarking LLM chatbots’ oncological knowledge with the Turkish Society of Medical Oncology’s annual board examination questions BMC Cancer Artificial intelligence Large language models Oncology Clinical decision support systems Medical education Board examinations
title	Benchmarking LLM chatbots’ oncological knowledge with the Turkish Society of Medical Oncology’s annual board examination questions
title_full	Benchmarking LLM chatbots’ oncological knowledge with the Turkish Society of Medical Oncology’s annual board examination questions
title_fullStr	Benchmarking LLM chatbots’ oncological knowledge with the Turkish Society of Medical Oncology’s annual board examination questions
title_full_unstemmed	Benchmarking LLM chatbots’ oncological knowledge with the Turkish Society of Medical Oncology’s annual board examination questions
title_short	Benchmarking LLM chatbots’ oncological knowledge with the Turkish Society of Medical Oncology’s annual board examination questions
title_sort	benchmarking llm chatbots oncological knowledge with the turkish society of medical oncology s annual board examination questions
topic	Artificial intelligence Large language models Oncology Clinical decision support systems Medical education Board examinations
url	https://doi.org/10.1186/s12885-025-13596-0
work_keys_str_mv	AT efecemerdat benchmarkingllmchatbotsoncologicalknowledgewiththeturkishsocietyofmedicaloncologysannualboardexaminationquestions AT enginerenkavak benchmarkingllmchatbotsoncologicalknowledgewiththeturkishsocietyofmedicaloncologysannualboardexaminationquestions

Benchmarking LLM chatbots’ oncological knowledge with the Turkish Society of Medical Oncology’s annual board examination questions

Similar Items