Re-evaluating GPT-4's bar exam performance

被引:4
|
作者
Martinez, Eric [1 ]
机构
[1] MIT, Dept Brain & Cognit Sci, Cambridge, MA 02138 USA
关键词
NLP; Legal NLP; Legal analytics; Natural language processing; Machine learning; Artificial intelligence; Artificial intelligence and law; Law and technology; Legal profession; LAW;
D O I
10.1007/s10506-024-09396-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Perhaps the most widely touted of GPT-4's at-launch, zero-shot capabilities has been its reported 90th-percentile performance on the Uniform Bar Exam. This paper begins by investigating the methodological challenges in documenting and verifying the 90th-percentile claim, presenting four sets of findings that indicate that OpenAI's estimates of GPT-4's UBE percentile are overinflated. First, although GPT-4's UBE score nears the 90th percentile when examining approximate conversions from February administrations of the Illinois Bar Exam, these estimates are heavily skewed towards repeat test-takers who failed the July administration and score significantly lower than the general test-taking population. Second, data from a recent July administration of the same exam suggests GPT-4's overall UBE percentile was below the 69th percentile, and similar to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim$$\end{document} 48th percentile on essays. Third, examining official NCBE data and using several conservative statistical assumptions, GPT-4's performance against first-time test takers is estimated to be similar to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim$$\end{document} 62nd percentile, including similar to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim$$\end{document} 42nd percentile on essays. Fourth, when examining only those who passed the exam (i.e. licensed or license-pending attorneys), GPT-4's performance is estimated to drop to similar to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim$$\end{document} 48th percentile overall, and similar to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim$$\end{document} 15th percentile on essays. In addition to investigating the validity of the percentile claim, the paper also investigates the validity of GPT-4's reported scaled UBE score of 298. The paper successfully replicates the MBE score, but highlights several methodological issues in the grading of the MPT + MEE components of the exam, which call into question the validity of the reported essay score. Finally, the paper investigates the effect of different hyperparameter combinations on GPT-4's MBE performance, finding no significant effect of adjusting temperature settings, and a significant effect of few-shot chain-of-thought prompting over basic zero-shot prompting. Taken together, these findings carry timely insights for the desirability and feasibility of outsourcing legally relevant tasks to AI models, as well as for the importance for AI developers to implement rigorous and transparent capabilities evaluations to help secure safe and trustworthy AI.
引用
收藏
页数:24
相关论文
共 50 条
  • [41] Performance of GPT-3.5 and GPT-4 on the Korean Pharmacist Licensing Examination: Comparison Study
    Jin, Hye Kyung
    Kim, Eunyoung
    JMIR MEDICAL EDUCATION, 2024, 10
  • [42] Evaluating the Clinical Reasoning of GPT-4, Grok, and Gemini in Different Fields of Cardiology
    Reyes-Rivera, Jonathan
    Molina, Alberto Castro
    Romero-Lorenzo, Marco
    Ali, Sajid
    Gibson, Charles
    Saucedo, Jorge
    Calandrelli, Matias
    Cruz, Edgar Garcia
    Bahit, Cecilia
    Chi, Gerald
    Angulo, Stephanie
    Moore, Michelle
    Lopez-Quijano, Juan M.
    Samman, Abdallah
    Gordillo-Moscoso, Antonio A.
    Ali, Asif
    CIRCULATION, 2024, 150
  • [43] The Need to Re-evaluate the Role of GPT-4 in Generating Radiology Reports
    Ray, Partha Pratim
    RADIOLOGY, 2023, 308 (02)
  • [44] Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study
    Takagi, Soshi
    Watari, Takashi
    Erabi, Ayano
    Sakaguchi, Kota
    JMIR MEDICAL EDUCATION, 2023, 9
  • [45] RE-EVALUATING REID'S RESPONSE TO SKEPTICISM
    McAllister, Blake
    JOURNAL OF SCOTTISH PHILOSOPHY, 2016, 14 (03) : 317 - 339
  • [46] Assessing GPT-4's Performance in Delivering Medical Advice: Comparative Analysis With Human Experts
    Jo, Eunbeen
    Song, Sanghoun
    Kim, Jong -Ho
    Lim, Subin
    Kim, Ju Hyeon
    Cha, Jung - Joon
    Kim, Young -Min
    Joo, Hyung Joon
    JMIR MEDICAL EDUCATION, 2024, 10
  • [47] Re-evaluating Word Mover's Distance
    Sato, Ryoma
    Yamada, Makoto
    Kashima, Hisashi
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022, : 19231 - 19249
  • [48] Is it terminal? Re-evaluating the master's degree
    Bartlett, AC
    JOURNAL OF THE MIDWEST MODERN LANGUAGE ASSOCIATION, 2004, 37 (02): : 26 - 29
  • [49] The Rapid Development of Artificial Intelligence: GPT-4's Performance on Orthopedic Surgery Board Questions
    Hofmann, Hayden L.
    Guerra, Gage A.
    Le, Jonathan L.
    Wong, Alexander M.
    Hofmann, Grady H.
    Mayfield, Cory K.
    Petrigliano, Frank A.
    Liu, Joseph N.
    ORTHOPEDICS, 2024, 47 (02) : e85 - e89
  • [50] Re-evaluating Rudolf Laban's choreutics
    Longstaff, JS
    PERCEPTUAL AND MOTOR SKILLS, 2000, 91 (01) : 191 - 210