What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

被引:0
|
作者
Guo, Taicheng [1 ]
Guo, Kehan [1 ]
Nan, Bozhao [1 ]
Liang, Zhenwen [1 ]
Guo, Zhichun [1 ]
Chawla, Nitesh V. [1 ]
Wiest, Olaf [1 ]
Zhang, Xiangliang [1 ]
机构
[1] Univ Notre Dame, Notre Dame, IN 46556 USA
基金
美国国家科学基金会;
关键词
GENERATION; SMILES;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language Models (LLMs) with strong abilities in natural language processing tasks have emerged and have been applied in various kinds of areas such as science, finance and software engineering. However, the capability of LLMs to advance the field of chemistry remains unclear. In this paper, rather than pursuing state-of-the-art performance, we aim to evaluate capabilities of LLMs in a wide range of tasks across the chemistry domain. We identify three key chemistry-related capabilities including understanding, reasoning and explaining to explore in LLMs and establish a benchmark containing eight chemistry tasks. Our analysis draws on widely recognized datasets facilitating a broad exploration of the capacities of LLMs within the context of practical chemistry. Five LLMs (GPT-4, GPT-3.5, Davinci-003, Llama and Galactica) are evaluated for each chemistry task in zero-shot and few-shot in-context learning settings with carefully selected demonstration examples and specially crafted prompts. Our investigation found that GPT-4 outperformed other models and LLMs exhibit different competitive levels in eight chemistry tasks. In addition to the key findings from the comprehensive benchmark analysis, our work provides insights into the limitation of current LLMs and the impact of in-context learning settings on LLMs' performance across various chemistry tasks. The code and datasets used in this study are available at https://github.com/ChemFoundationModels/ChemLLMBench.
引用
收藏
页数:27
相关论文
共 50 条
  • [21] Large language models for chemistry robotics
    Naruki Yoshikawa
    Marta Skreta
    Kourosh Darvish
    Sebastian Arellano-Rubach
    Zhi Ji
    Lasse Bjørn Kristensen
    Andrew Zou Li
    Yuchi Zhao
    Haoping Xu
    Artur Kuramshin
    Alán Aspuru-Guzik
    Florian Shkurti
    Animesh Garg
    Autonomous Robots, 2023, 47 : 1057 - 1086
  • [22] Large language models for chemistry robotics
    Yoshikawa, Naruki
    Skreta, Marta
    Darvish, Kourosh
    Arellano-Rubach, Sebastian
    Ji, Zhi
    Kristensen, Lasse Bjorn
    Li, Andrew Zou
    Zhao, Yuchi
    Xu, Haoping
    Kuramshin, Artur
    Aspuru-Guzik, Alan
    Shkurti, Florian
    Garg, Animesh
    AUTONOMOUS ROBOTS, 2023, 47 (08) : 1057 - 1086
  • [23] Large language models for reticular chemistry
    Rampal, Nakul
    Inizan, Theo Jaffrelot
    Borgs, Christian
    Chayes, Jennifer T.
    Yaghi, Omar M.
    NATURE REVIEWS MATERIALS, 2025,
  • [24] MMAD: THE FIRST-EVER COMPREHENSIVE BENCHMARK FOR MULTIMODAL LARGE LANGUAGE MODELS IN INDUSTRIAL ANOMALY DETECTION
    Southern University of Science and Technology, China
    不详
    不详
    不详
    arXiv,
  • [25] CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models
    Lyu, Yuanjie
    Li, Zhiyu
    Niu, Simin
    Xiong, Feiyu
    Tang, Bo
    Wang, Wenjin
    Wu, Hao
    Liu, Huanyong
    Xu, Tong
    Chen, Enhong
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2025, 43 (02)
  • [26] Transmission Versus Truth, Imitation Versus Innovation: What Children Can Do That Large Language and Language-and-Vision Models Cannot (Yet)
    Yiu, Eunice
    Kosoy, Eliza
    Gopnik, Alison
    PERSPECTIVES ON PSYCHOLOGICAL SCIENCE, 2024, 19 (05) : 874 - 883
  • [27] Language Models Do Hard Arithmetic Tasks Easily and Hardly Do Easy Arithmetic Tasks
    Gambardella, Andrew
    Iwasawa, Yusuke
    Matsuo, Yutaka
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 85 - 91
  • [28] A comprehensive survey of large language models and multimodal large models in medicine
    Xiao, Hanguang
    Zhou, Feizhong
    Liu, Xingyue
    Liu, Tianqi
    Li, Zhipeng
    Liu, Xin
    Huang, Xiaoxuan
    INFORMATION FUSION, 2025, 117
  • [29] TGEA 2.0: A Large-Scale Diagnostically Annotated Dataset with Benchmark Tasks for Text Generation of Pretrained Language Models
    Ge, Huibin
    Zhao, Xiaohu
    Liu, Chuang
    Zeng, Yulong
    Liu, Qun
    Xiong, Deyi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [30] Sources of Hallucination by Large Language Models on Inference Tasks
    McKenna, Nick
    Li, Tianyi
    Cheng, Liang
    Hosseini, Mohammad Javad
    Johnson, Mark
    Steedman, Mark
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 2758 - 2774