Assessing GPT-4 multimodal performance in radiological image analysis

被引:5
|
作者
Brin, Dana [1 ,2 ]
Sorin, Vera [1 ,2 ,3 ]
Barash, Yiftach [1 ,2 ,3 ]
Konen, Eli [1 ,2 ]
Glicksberg, Benjamin S. [4 ]
Nadkarni, Girish N. [5 ,6 ]
Klang, Eyal [1 ,2 ,3 ,5 ,6 ]
机构
[1] Chaim Sheba Med Ctr, Dept Diagnost Imaging, Tel Hashomer, Israel
[2] Tel Aviv Univ, Fac Med, Tel Aviv, Israel
[3] Chaim Sheba Med Ctr, DeepVis Lab, Tel Hashomer, Israel
[4] Icahn Sch Med Mt Sinai, Hasso Plattner Inst Digital Hlth, New York, NY USA
[5] Icahn Sch Med Mt Sinai, Div Data Driven & Digital Med D3M, New York, NY USA
[6] Icahn Sch Med Mt Sinai, Charles Bronfman Inst Personalized Med, New York, NY USA
关键词
Artificial intelligence; Diagnostic imaging; Radiology; Ultrasonography; Computed tomography (x-ray);
D O I
10.1007/s00330-024-11035-5
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Objectives This study aims to assess the performance of a multimodal artificial intelligence (AI) model capable of analyzing both images and textual data (GPT-4V), in interpreting radiological images. It focuses on a range of modalities, anatomical regions, and pathologies to explore the potential of zero-shot generative AI in enhancing diagnostic processes in radiology. Methods We analyzed 230 anonymized emergency room diagnostic images, consecutively collected over 1 week, using GPT-4V. Modalities included ultrasound (US), computerized tomography (CT), and X-ray images. The interpretations provided by GPT-4V were then compared with those of senior radiologists. This comparison aimed to evaluate the accuracy of GPT-4V in recognizing the imaging modality, anatomical region, and pathology present in the images. Results GPT-4V identified the imaging modality correctly in 100% of cases (221/221), the anatomical region in 87.1% (189/217), and the pathology in 35.2% (76/216). However, the model's performance varied significantly across different modalities, with anatomical region identification accuracy ranging from 60.9% (39/64) in US images to 97% (98/101) and 100% (52/52) in CT and X-ray images (p < 0.001). Similarly, pathology identification ranged from 9.1% (6/66) in US images to 36.4% (36/99) in CT and 66.7% (34/51) in X-ray images (p < 0.001). These variations indicate inconsistencies in GPT-4V's ability to interpret radiological images accurately. Conclusion While the integration of AI in radiology, exemplified by multimodal GPT-4, offers promising avenues for diagnostic enhancement, the current capabilities of GPT-4V are not yet reliable for interpreting radiological images. This study underscores the necessity for ongoing development to achieve dependable performance in radiology diagnostics. Clinical relevance statement Although GPT-4V shows promise in radiological image interpretation, its high diagnostic hallucination rate (> 40%) indicates it cannot be trusted for clinical use as a standalone tool. Improvements are necessary to enhance its reliability and ensure patient safety. Key Points...
引用
收藏
页码:1959 / 1965
页数:7
相关论文
共 50 条
  • [21] Performance of GPT-4 Vision on kidney pathology exam questions
    Miao, Jing
    Thongprayoon, Charat
    Cheungpasitporn, Wisit
    Cornell, Lynn D.
    AMERICAN JOURNAL OF CLINICAL PATHOLOGY, 2024, 162 (03) : 220 - 226
  • [22] Performance of GPT-4 Vision on kidney pathology exam questions
    Daungsupawong, Hinpetch
    Wiwanitkit, Viroj
    AMERICAN JOURNAL OF CLINICAL PATHOLOGY, 2024,
  • [23] Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination
    Maciej Rosoł
    Jakub S. Gąsior
    Jonasz Łaba
    Kacper Korzeniewski
    Marcel Młyńczak
    Scientific Reports, 13
  • [24] Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination
    Rosol, Maciej
    Gasior, Jakub S.
    Laba, Jonasz
    Korzeniewski, Kacper
    Mlynczak, Marcel
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [25] Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations
    Ali, Rohaid
    Tang, Oliver Y.
    Connolly, Ian D.
    Sullivan, Patricia L. Zadnik
    Shin, John H.
    Fridley, Jared S.
    Asaad, Wael F.
    Cielo, Deus
    Oyelese, Adetokunbo A.
    Doberstein, Curtis E.
    Gokaslan, Ziya L.
    Telfeian, Albert E.
    NEUROSURGERY, 2023, 93 (06) : 1353 - 1365
  • [26] The Implementation of Multimodal Large Language Models for Hydrological Applications: A Comparative Study of GPT-4 Vision, Gemini, LLaVa, and Multimodal-GPT
    Kadiyala, Likith Anoop
    Mermer, Omer
    Samuel, Dinesh Jackson
    Sermet, Yusuf
    Demir, Ibrahim
    HYDROLOGY, 2024, 11 (09)
  • [27] An exploratory assessment of GPT-4o and GPT-4 performance on the Japanese National Dental Examination
    Morishita, Masaki
    Fukuda, Hikaru
    Yamaguchi, Shino
    Muraoka, Kosuke
    Nakamura, Taiji
    Hayashi, Masanari
    Yoshioka, Izumi
    Ono, Kentaro
    Awano, Shuji
    SAUDI DENTAL JOURNAL, 2024, 36 (12) : 1577 - 1581
  • [28] Accuracy of GPT-4 in histopathological image detection and classification of colorectal adenomas
    Laohawetwanit, Thiyaphat
    Namboonlue, Chutimon
    Apornvirat, Sompon
    JOURNAL OF CLINICAL PATHOLOGY, 2024,
  • [29] GPT-4-Trinis: assessing GPT-4's communicative competence in the English-speaking majority world
    Jackson, Samantha
    Beekhuizen, Barend
    Zhao, Zhao
    Mcewen, Rhonda
    AI & SOCIETY, 2024, 40 (3) : 1785 - 1801
  • [30] Performance of GPT-3.5 and GPT-4 on the Korean Pharmacist Licensing Examination: Comparison Study
    Jin, Hye Kyung
    Kim, Eunyoung
    JMIR MEDICAL EDUCATION, 2024, 10