What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

被引:0
|
作者
Guo, Taicheng [1 ]
Guo, Kehan [1 ]
Nan, Bozhao [1 ]
Liang, Zhenwen [1 ]
Guo, Zhichun [1 ]
Chawla, Nitesh V. [1 ]
Wiest, Olaf [1 ]
Zhang, Xiangliang [1 ]
机构
[1] Univ Notre Dame, Notre Dame, IN 46556 USA
基金
美国国家科学基金会;
关键词
GENERATION; SMILES;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language Models (LLMs) with strong abilities in natural language processing tasks have emerged and have been applied in various kinds of areas such as science, finance and software engineering. However, the capability of LLMs to advance the field of chemistry remains unclear. In this paper, rather than pursuing state-of-the-art performance, we aim to evaluate capabilities of LLMs in a wide range of tasks across the chemistry domain. We identify three key chemistry-related capabilities including understanding, reasoning and explaining to explore in LLMs and establish a benchmark containing eight chemistry tasks. Our analysis draws on widely recognized datasets facilitating a broad exploration of the capacities of LLMs within the context of practical chemistry. Five LLMs (GPT-4, GPT-3.5, Davinci-003, Llama and Galactica) are evaluated for each chemistry task in zero-shot and few-shot in-context learning settings with carefully selected demonstration examples and specially crafted prompts. Our investigation found that GPT-4 outperformed other models and LLMs exhibit different competitive levels in eight chemistry tasks. In addition to the key findings from the comprehensive benchmark analysis, our work provides insights into the limitation of current LLMs and the impact of in-context learning settings on LLMs' performance across various chemistry tasks. The code and datasets used in this study are available at https://github.com/ChemFoundationModels/ChemLLMBench.
引用
收藏
页数:27
相关论文
共 50 条
  • [31] Evaluating Large Language Models on Controlled Generation Tasks
    Sun, Jiao
    Tian, Yufei
    Zhou, Wangchunshu
    Xu, Nan
    Hu, Qian
    Gupta, Rahul
    Wieting, John
    Peng, Nanyun
    Ma, Xuezhe
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 3155 - 3168
  • [32] Evaluating large language models in theory of mind tasks
    Kosinski, Michal
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2024, 121 (45)
  • [33] Facilitating Autonomous Driving Tasks With Large Language Models
    Wu, Mengyao
    Yu, F. Richard
    Liu, Peter Xiaoping
    He, Ying
    IEEE INTELLIGENT SYSTEMS, 2025, 40 (01) : 45 - 52
  • [34] Robustness of GPT Large Language Models on Natural Language Processing Tasks
    Xuanting C.
    Junjie Y.
    Can Z.
    Nuo X.
    Tao G.
    Qi Z.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2024, 61 (05): : 1128 - 1142
  • [35] CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models
    Yu, Linhao
    Leng, Yongqi
    Huang, Yufei
    Wu, Shang
    Liu, Haixin
    Ji, Xinmeng
    Zhao, Jiahui
    Song, Jinwang
    Cui, Tingting
    Cheng, Xiaoqing
    Liu, Tao
    Xiong, Deyi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 11817 - 11837
  • [36] Large Language Models on Graphs: A Comprehensive Survey
    Jin, Bowen
    Liu, Gang
    Han, Chi
    Jiang, Meng
    Ji, Heng
    Han, Jiawei
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (12) : 8622 - 8642
  • [37] LLMBox: A Comprehensive Library for Large Language Models
    Tang, Tianyi
    Hui, Yiwen
    Li, Bingqian
    Lu, Wenyang
    Qin, Zijing
    Sun, Haoxiang
    Wang, Jiapeng
    Xu, Shiyi
    Cheng, Xiaoxue
    Guo, Geyang
    Peng, Han
    Zheng, Bowen
    Tang, Yiru
    Min, Yingqian
    Chen, Yushuo
    Chen, Jie
    Zhao, Yuanqian
    Ding, Luran
    Wang, Yuhao
    Dong, Zican
    Xia, Chunxuan
    Li, Junyi
    Zhou, Kun
    Zhao, Wayne Xin
    Wen, Ji-Rong
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 3: SYSTEM DEMONSTRATIONS, 2024, : 388 - 399
  • [38] Large Language Models: A Comprehensive Guide for Radiologists
    Kim, Sunkyu
    Lee, Choong-kun
    Kim, Seung-seob
    JOURNAL OF THE KOREAN SOCIETY OF RADIOLOGY, 2024, 85 (05): : 861 - 882
  • [39] BioCoder: a benchmark for bioinformatics code generation with large language models
    Tang, Xiangru
    Qian, Bill
    Gao, Rick
    Chen, Jiakang
    Chen, Xinyun
    Gerstein, Mark B.
    BIOINFORMATICS, 2024, 40 : i266 - i276
  • [40] SafeLLMs: A Benchmark for Secure Bilingual Evaluation of Large Language Models
    Liang, Wenhan
    Wu, Huijia
    Gao, Jun
    Shang, Yuhu
    He, Zhaofeng
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT II, NLPCC 2024, 2025, 15360 : 437 - 448