Artificial Intelligence-Based Chatbots’ Ability to Interpret Mammography Images: A Comparison of Chat-GPT 4o and Claude 3.5


Abstract views: 132 / PDF downloads: 69

Authors

DOI:

https://doi.org/10.58600/eurjther2599

Keywords:

Artificial Intelligence, Chatbots, ChatGPT-4o, Claude 3.5, Mammography, BI-RADS Classification, Breast Parenchymal Type, Radiology

Abstract

Objectives: The aim of this study is to compare the ability of artificial intelligence-based chatbots, ChatGPT-4o and Claude 3.5, to interpret mammography images. The study focuses on evaluating their accuracy and consistency in BI-RADS classification and breast parenchymal type assessment. It also aims to explore the potential of these technologies to reduce radiologists’ workload and identify their limitations in medical image analysis.

Methods: A total of 53 mammography images obtained between January and July 2024 were analyzed, focusing on BI-RADS classification and breast parenchymal type assessment. The same anonymized mammography images were provided to both chatbots under identical prompts.

Results: The results showed accuracy rates for BI-RADS classification ranging from 18.87% to 26.42% for ChatGPT-4o and 18.7% for Claude 3.5. When BI-RADS categories were grouped into benign group(BI-RADS 1,2) and malignant group(BI-RADS 4,5), the combined accuracy was 57.5% for ChatGPT-4o (initial evaluation) and 55% (second evaluation), compared to 47.5% for Claude 3.5. Breast parenchymal type accuracy rates were 30.19% and 22.64% for ChatGPT-4o, and 26.42% for Claude 3.5.

Conclusions: The findings indicate that chatbots demonstrate limited accuracy and reliability in interpreting mammography images. These results highlight the need for further optimization, larger datasets, and advanced training processes to improve their performance in medical image analysis.

Metrics

Metrics Loading ...

References

Tepe M, Emekli E. (2024) Assessing the Responses of Large Language Models (ChatGPT-4, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Breast Imaging: A Study on Readability and Accuracy. Cureus. 9;16(5). https:// doi.org/10.7759/cureus.59960

Fütterer T, Fischer C, Alekseeva A, et al. (2023) ChatGPT in education: global reactions to AI innovations. Sci Rep 15;13(1):15310. https:// doi.org/10.1038/s41598-023-42227-6.

Kıyak YS, Emekli E. (2024) ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review. Postgrad Med J. 18;100(1189):858-865. https:// doi.org/10.1093/postmj/qgae065.

Huang S, Zhang H, Gao Y, Hu Y, Qin Z. (2024) From image to video, what do we need in multimodal LLMs? ArXiv abs/2404.11865 https://doi.org/10.48550/arXiv.2404.11865

Alshehri AS, Lee FL, Wang S. (2023) Multimodal deep learning for scientific imaging interpretation. ArXiv/ 2309.12460. https://doi.org/10.48550/arXiv.2309.12460

Reizinger P, Ujváry S, Mészáros A, Kerekes A, Brendel W, Huszár F. (2024) Understanding LLMs requires more than statistical generalization. ArXiv preprint arXiv:2405.01964. https://doi.org/10.48550/arXiv.2405.01964

Siegel RL, Miller KD, Fuchs HE, Jemal A. Cancer statistics. (2021). CA Cancer J Clin. 2021;71:7–33. https://doi.org/10.3322/caac.21654.

Swedish Organised Service Screening Evaluation Group. Reduction in breast cancer mortality from organized service screening with mammography: 1. Further confirmation with extended data. (2006) Cancer Epidemiol Biomarkers Prev. 15(1):45-51. https://doi.org/10.1158/1055-9965.EPI-05-0349.

Kalager M, Haldorsen T, Bretthauer M, Hoff G, Thoresen SO, Adami HO. (2009) Improved breast cancer survival following introduction of an organized mammography screening program among both screened and unscreened women: A population-based cohort study. Breast Cancer Res. 11:R44. https://doi.org/10.1186/bcr2331.

Lauby-Secretan B, Scoccianti C, Loomis D, Benbrahim-Tallaa L, Bouvard V, Bianchini F, et al. (2015) Breast-cancer screening--viewpoint of the IARC Working Group. N Engl J Med. 372:2353–8. https://doi.org/10.1056/NEJMsr1504363.

American College of Radiology. ACR BI-RADS® Atlas, Breast Imaging Reporting and Data System Reston, VA: (USA) 2013. Available at: https://www.acr.org/-/media/ACR/Files/RADS/BI-RADS/Mammography-Reporting.pdf.

Tepe M, Emekli E. (2024) Decoding medical jargon: The use of AI language models (ChatGPT-4, BARD, microsoft copilot) in radiology reports. Patient Educ Couns. 126:108307. https://doi.org/10.1016/j.pec.2024.108307.

Temsah MH, Alhuzaimi AN, Almansour M, et al. Art or artifact: evaluating the accuracy, appeal, and educational value of AI-generated imagery in DALL.E 3 for illustrating congenital heart diseases. (2024) J Med Syst. 48(1):54. https://doi.org/10.1007/s10916-024-02072-0.

Adams LC, Busch F, Truhn D, Makowski MR, Aerts HJWL, Bressem KK. What Does DALL-E 2 know about radiology? (2023) J Med Internet Res. 25:e43110. https://doi.org/10.2196/43110.

Bing M. Microsoft Bing Chatbot. (2023)

OpenAI. GPT-4. OpenAI. https://openai.com/index/gpt-4/ Access date: 10.02.2025

Paananen V, Oppenlaender J, Visuri A. Using text-to-image generation for architectural design ideation. (2024) International Journal of Architectural Computing. 22(3):458-474. https://doi.org/10.1177/14780771231222783.

Rombach R, Blattmann A, Lorenz D, et al. High-resolution image synthesis with latent diffusion models. (2022) IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10674–10685. https://doi.org/10.1109/CVPR52688.2022.01042.

Ajmera P, Nischal N, Ariyaratne S, Botchu B, Bhamidipaty KDP, Iyengar KP, Ajmera SR, Jenko N, Botchu R. (2024) Validity of ChatGPT-generated musculoskeletal images. Skeletal Radiol. 53(8):1583-1593. https://doi.org/10.1109/10.1007/s00256-024-04638-y.

Zhu L, Lai Y, Mou W, et al. (2024) ChatGPT’s ability to generate realistic experimental images poses a new challenge to academic integrity. J Hematol Oncol. 17:27. https://doi.org/10.1186/s13045-024-01543-8.

Shifai N, van Doorn R, Malvehy J, Sangers TE. (2024) Can ChatGPT vision diagnose melanoma? An exploratory diagnostic accuracy study. J Am Acad Dermatol. S0190-9622(24)00076-8. https://doi.org/10.1016/j.jaad.2023.12.062.

Xu P, Chen X, Zhao Z, Zheng Y, Jin G, Shi D, He M. (2023) Evaluation of a digital ophthalmologist app built by GPT4-V(ision). medRxiv https://doi.org/10.1101/2023.11.27.2329905.

Horiuchi D, Tatekawa H, Oura T, et al. (2024) ChatGPT; diagnostic performance based on textual vs. visual information compared to radiologists; diagnostic performance in musculoskeletal radiology. Eur Radiol. https://doi.org/10.1007/s00330-024-10902-5.

Nguyen D, Rao A, Mazumder A, Succi MD. (2025) Exploring the accuracy of embedded ChatGPT-4 and ChatGPT-4o in generating BI-RADS scores: a pilot study in radiologic clinical practice. 117:110335. https://doi.org/10.1016/j.clinimag.2024.110335.

Downloads

Published

2025-02-28

How to Cite

Karahan, B. N., Emekli, E., & Altın, M. A. (2025). Artificial Intelligence-Based Chatbots’ Ability to Interpret Mammography Images: A Comparison of Chat-GPT 4o and Claude 3.5. European Journal of Therapeutics, 31(1), 28–34. https://doi.org/10.58600/eurjther2599

Issue

Section

Original Articles