Evaluating the Performance of Large Language Models in Anatomy Education Advancing Anatomy Learning with ChatGPT-4o


Abstract views: 167 / PDF downloads: 61

Authors

DOI:

https://doi.org/10.58600/eurjther2611

Keywords:

Anatomy education, large language models, ChatGPT

Abstract

Objective: Large language models (LLMs), such as ChatGPT, Gemini, and Copilot, have garnered significant attention across various domains, including education. Their application is becoming increasingly prevalent, particularly in medical education, where rapid access to accurate and up-to-date information is imperative. This study aimed to assess the validity, accuracy, and comprehensiveness of utilizing LLMs for the preparation of lecture notes in medical school anatomy education.

Methods: The study evaluated the performance of four large language models—ChatGPT-4o, ChatGPT-4o-Mini, Gemini, and Copilot—in generating anatomy lecture notes for medical students. In the first phase, the lecture notes produced by these models using identical prompts were compared to a widely used anatomy textbook through thematic analysis to assess relevance and alignment with standard educational materials. In the second phase, the generated lecture notes were evaluated using content validity index (CVI) analysis. The threshold values for S-CVI/Ave and S-CVI/UA were set at 0.90 and 0.80, respectively, to determine the acceptability of the content.

Results: ChatGPT-4o demonstrated the highest performance, achieving a theme success rate of 94.6% and a subtheme success rate of 76.2%. ChatGPT-4o-Mini followed, with theme and subtheme success rates of 89.2% and 62.3%, respectively. Copilot achieved moderate results, with a theme success rate of 91.8% and a subtheme success rate of 54.9%, while Gemini showed the lowest performance, with a theme success rate of 86.4% and a subtheme success rate of 52.3%. In the Content Validity Index (CVI) analysis, ChatGPT-4o again outperformed the other models, exceeding the thresholds with an S-CVI/Ave value of 0.943 and an S-CVI/UA value of 0.857. ChatGPT-4o-Mini met the S-CVI/UA threshold (0.714) but fell slightly short of the S-CVI/Ave threshold (0.800). Copilot and Gemini, however, exhibited significantly lower CVI results. Copilot achieved an S-CVI/Ave value of 0.486 and an S-CVI/UA value of 0.286, while Gemini obtained the lowest scores, with an S-CVI/Ave value of 0.286 and an S-CVI/UA value of 0.143.

Conclusion: This study assessed various LLMs through two distinct analysis methods, revealing that ChatGPT-4o performed best in both thematic analysis and CVI evaluations. These results suggest that anatomy educators and medical students could benefit from adopting ChatGPT-4o as a supplementary tool for anatomy lecture notes generation. Conversely, models like ChatGPT-4o-Mini, Gemini, and Copilot require further improvements to meet the standards necessary for reliable use in medical education.

Metrics

Metrics Loading ...

References

Ghassemi M, Birhane A, Bilal M, Kankaria S, Malone C, Mollick E, Tustumi F (2023) ChatGPT one year on: who is using it, how and why?. Nature. 624(7990):39-41. https://doi.org/10.1038/d41586-023-03798-6

Dave T, Athaluri SA, Singh S (2023) ChatGPT in medicine: an overview of its applications, advantages, limitations, prospects, and ethical considerations. Front Artif Intell. 6:1169595. https://doi.org/10.3389/frai.2023.1169595

Torres-Zegarra BC, Rios-Garcia W, Ñaña-Cordova AM, Arteaga-Cisneros KF, Chalco XCB, Ordoñez MAB, Rios CJG, Godoy CAR, Quezada KLTP, Gutierrez-Arratia JS, Flores-Cohalia JA (2023) Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study. J Educ Eval Health Prof. 20:30. https://doi.org/10.3352/jeehp.2023.20.30

Ornstein J (1955) Mechanical Translation: New Challenge to Communication. Science. 122(3173):745-748. https://doi.org/10.1126/science.122.3173.745

Akinci D'Antonoli T, Stanzione A, Bluethgen C, Vernuccio F, Ugga L, Klontzas ME, Cuocolo R, Cannella R, Koçak B (2024) Large language models in radiology: fundamentals, applications, ethical considerations, risks, and future directions. Diagn Interv Radiol. 30(2):80-90. https://doi.org/10.4274/dir.2023.232417

Farhat F, Chaudhry BM, Nadeem M, Sohail SS, Madsen D (2024) Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard. JMIR Med Educ. 10:e51523. https://doi.org/10.2196/51523

Sallam M (2023) ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare (Basel). 11(6):887. https://doi.org/10.3390/healthcare11060887

Lee H (2024) The rise of ChatGPT: Exploring its potential in medical education. Anat Sci Educ. 17(5):926-931. https://doi.org/10.1002/ase.2270

Ilgaz HB, Çelik Z (2023) The Significance of Artificial Intelligence Platforms in Anatomy Education: An Experience With ChatGPT and Google Bard. Cureus. 15(9):e45301. https://doi.org/10.7759/cureus.45301

Arun G, Perumal V, Urias FPJB, Ler YE, Tan BWT, Vallabhajosyula R, Tan E, Ng O, Ng KB, Mogali SR (2024) ChatGPT versus a customized AI chatbot (Anatbuddy) for anatomy education: A comparative pilot study. Anat Sci Educ. 17(7):1396–1405. https://doi.org/10.1002/ase.2502

Mavrych V, Ganguly P, Bolgova O (2025) Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis. Clin Anat. 38(2):200–210. https://doi.org/10.1002/ca.24244

Totlis T, Natsis K, Filos D, Ediaroglou V, Mantzou N, Duparc F, Piagkou M (2023) The potential role of ChatGPT and artificial intelligence in anatomy education: a conversation with ChatGPT. Surg Radiol Anat. 45(10):1321–1329. https://doi.org/10.1007/s00276-023-03229-1

Moore KL, Dalley AF (2018) Clinically oriented anatomy, 8th edn. Wolters kluwer, India

Shi J, Mo X, Sun Z (2012) [Content validity index in scale development]. Zhong Nan Da Xue Xue Bao Yi Xue Ban. 37(2):152-155. https://doi.org/10.3969/j.issn.1672-7347.2012.02.007

Mogali SR (2024) Initial impressions of ChatGPT for anatomy education. Anat Sci Educ. 17(2):444–447. https://doi.org/10.1002/ase.2261

Talan T, Kalınkara Y (2023) The Role of Artificial Intelligence in Higher Education: ChatGPT Assessment for Anatomy Course. UYBISBBD. 7(1):33-40. https://doi.org/10.33461/uybisbbd.1244777

Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, Madriaga M, Aggabao R, Diaz-Candido G, Maningo J, Tseng V (2023) Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2(2):e0000198. https://doi.org/10.1371/journal.pdig.0000198

Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, Chartash D (2023) How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 9:e45312. https://doi.org/10.2196/45312

Oh N, Choi GS, Lee WY (2023) ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models. Ann Surg Treat Res. 104(5):269-273. https://doi.org/10.4174/astr.2023.104.5.269

Skalidis I, Cagnina A, Luangphiphat W, Mahendiran T, Muller O, Abbe E, Fournier S (2023) ChatGPT takes on the European Exam in Core Cardiology: an artificial intelligence success story?. Eur Heart J Digit Health. 4(3):279-281. https://doi.org/10.1093/ehjdh/ztad029

Humar P, Asaad M, Bengur FB, Nguyen V (2023) ChatGPT Is Equivalent to First-Year Plastic Surgery Residents: Evaluation of ChatGPT on the Plastic Surgery In-Service Examination. Aesthet Surg J. 43(12):Np1085-np1089. https://doi.org/10.1093/asj/sjad130

Tung AYZ, Dong LW (2023) Malaysian Medical Students' Attitudes and Readiness Toward AI (Artificial Intelligence): A Cross-Sectional Study. J Med Educ Curric Dev. 10:23821205231201164. https://doi.org/10.1177/23821205231201164

Buabbas AJ, Miskin B, Alnaqi AA, Ayed AK, Shehab AA, Syed-Abdul S, Uddin M (2023) Investigating Students' Perceptions towards Artificial Intelligence in Medical Education. Healthcare (Basel). 11(9):1298. https://doi.org/10.3390/healthcare11091298

Downloads

Published

2025-02-28

How to Cite

Ok, F., Karip, B., & Temizsoy Korkmaz, F. (2025). Evaluating the Performance of Large Language Models in Anatomy Education Advancing Anatomy Learning with ChatGPT-4o. European Journal of Therapeutics, 31(1), 35–43. https://doi.org/10.58600/eurjther2611

Issue

Section

Original Articles

Categories