What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

被引:0
|
作者
Guo, Taicheng [1 ]
Guo, Kehan [1 ]
Nan, Bozhao [1 ]
Liang, Zhenwen [1 ]
Guo, Zhichun [1 ]
Chawla, Nitesh V. [1 ]
Wiest, Olaf [1 ]
Zhang, Xiangliang [1 ]
机构
[1] Univ Notre Dame, Notre Dame, IN 46556 USA
基金
美国国家科学基金会;
关键词
GENERATION; SMILES;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language Models (LLMs) with strong abilities in natural language processing tasks have emerged and have been applied in various kinds of areas such as science, finance and software engineering. However, the capability of LLMs to advance the field of chemistry remains unclear. In this paper, rather than pursuing state-of-the-art performance, we aim to evaluate capabilities of LLMs in a wide range of tasks across the chemistry domain. We identify three key chemistry-related capabilities including understanding, reasoning and explaining to explore in LLMs and establish a benchmark containing eight chemistry tasks. Our analysis draws on widely recognized datasets facilitating a broad exploration of the capacities of LLMs within the context of practical chemistry. Five LLMs (GPT-4, GPT-3.5, Davinci-003, Llama and Galactica) are evaluated for each chemistry task in zero-shot and few-shot in-context learning settings with carefully selected demonstration examples and specially crafted prompts. Our investigation found that GPT-4 outperformed other models and LLMs exhibit different competitive levels in eight chemistry tasks. In addition to the key findings from the comprehensive benchmark analysis, our work provides insights into the limitation of current LLMs and the impact of in-context learning settings on LLMs' performance across various chemistry tasks. The code and datasets used in this study are available at https://github.com/ChemFoundationModels/ChemLLMBench.
引用
收藏
页数:27
相关论文
共 50 条
  • [41] The Two Word Test as a semantic benchmark for large language models
    Riccardi, Nicholas
    Yang, Xuan
    Desai, Rutvik H.
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [42] Establishing vocabulary tests as a benchmark for evaluating large language models
    Martinez, Gonzalo
    Conde, Javier
    Merino-Gomez, Elena
    Bermudez-Margaretto, Beatriz
    Hernandez, Jose Alberto
    Reviriego, Pedro
    Brysbaert, Marc
    PLOS ONE, 2024, 19 (12):
  • [43] Why Personalized Large Language Models Fail to Do What Ethics is All About
    Laacke, Sebastian
    Gauckler, Charlotte
    AMERICAN JOURNAL OF BIOETHICS, 2023, 23 (10): : 60 - 63
  • [44] Leveraging large language models for predictive chemistry
    Kevin Maik Jablonka
    Philippe Schwaller
    Andres Ortega-Guerrero
    Berend Smit
    Nature Machine Intelligence, 2024, 6 : 161 - 169
  • [45] HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models
    Li, Junyi
    Cheng, Xiaoxue
    Zhao, Wayne Xin
    Nie, Jian-Yun
    Wen, Ji-Rong
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 6449 - 6464
  • [46] Augmenting large language models with chemistry tools
    Bran, Andres M.
    Cox, Sam
    Schilter, Oliver
    Baldassari, Carlo
    White, Andrew D.
    Schwaller, Philippe
    NATURE MACHINE INTELLIGENCE, 2024, 6 (05) : 525 - 535
  • [47] Leveraging large language models for predictive chemistry
    Jablonka, Kevin Maik
    Schwaller, Philippe
    Ortega-Guerrero, Andres
    Smit, Berend
    NATURE MACHINE INTELLIGENCE, 2024, 6 (02) : 122 - 123
  • [48] Can Large Language Models Replace Therapists? Evaluating Performance at Simple Cognitive Behavioral Therapy Tasks
    Hodson, Nathan
    Williamson, Simon
    JMIR AI, 2024, 3
  • [49] What Do Language Models Hear? Probing for Auditory Representations in Language Models
    Ngo, Jerry
    Kim, Yoon
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 5435 - 5448
  • [50] Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey
    Kumar, Sachin
    Balachandran, Vidhisha
    Njoo, Lucille
    Anastasopoulos, Antonios
    Tsvetkov, Yulia
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 3299 - 3321