What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

被引：0

作者：

Guo, Taicheng ^{[1
]}

Guo, Kehan ^{[1
]}

Nan, Bozhao ^{[1
]}

Liang, Zhenwen ^{[1
]}

Guo, Zhichun ^{[1
]}

Chawla, Nitesh V. ^{[1
]}

Wiest, Olaf ^{[1
]}

Zhang, Xiangliang ^{[1
]}

机构：

[1] Univ Notre Dame, Notre Dame, IN 46556 USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

基金：

美国国家科学基金会;

关键词：

GENERATION; SMILES;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large Language Models (LLMs) with strong abilities in natural language processing tasks have emerged and have been applied in various kinds of areas such as science, finance and software engineering. However, the capability of LLMs to advance the field of chemistry remains unclear. In this paper, rather than pursuing state-of-the-art performance, we aim to evaluate capabilities of LLMs in a wide range of tasks across the chemistry domain. We identify three key chemistry-related capabilities including understanding, reasoning and explaining to explore in LLMs and establish a benchmark containing eight chemistry tasks. Our analysis draws on widely recognized datasets facilitating a broad exploration of the capacities of LLMs within the context of practical chemistry. Five LLMs (GPT-4, GPT-3.5, Davinci-003, Llama and Galactica) are evaluated for each chemistry task in zero-shot and few-shot in-context learning settings with carefully selected demonstration examples and specially crafted prompts. Our investigation found that GPT-4 outperformed other models and LLMs exhibit different competitive levels in eight chemistry tasks. In addition to the key findings from the comprehensive benchmark analysis, our work provides insights into the limitation of current LLMs and the impact of in-context learning settings on LLMs' performance across various chemistry tasks. The code and datasets used in this study are available at https://github.com/ChemFoundationModels/ChemLLMBench.

引用

页数：27

共 50 条

[31] Evaluating Large Language Models on Controlled Generation Tasks
Sun, Jiao
Tian, Yufei
Zhou, Wangchunshu
Xu, Nan
Hu, Qian
Gupta, Rahul
Wieting, John
Peng, Nanyun
Ma, Xuezhe
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 3155 - 3168
[32] Evaluating large language models in theory of mind tasks
Kosinski, Michal
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2024, 121 (45)
[33] Facilitating Autonomous Driving Tasks With Large Language Models
Wu, Mengyao
Yu, F. Richard
Liu, Peter Xiaoping
He, Ying
IEEE INTELLIGENT SYSTEMS, 2025, 40 (01) : 45 - 52
[34] Robustness of GPT Large Language Models on Natural Language Processing Tasks
Xuanting C.
Junjie Y.
Can Z.
Nuo X.
Tao G.
Qi Z.
Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2024, 61 (05): : 1128 - 1142
[35] CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models
Yu, Linhao
Leng, Yongqi
Huang, Yufei
Wu, Shang
Liu, Haixin
Ji, Xinmeng
Zhao, Jiahui
Song, Jinwang
Cui, Tingting
Cheng, Xiaoqing
Liu, Tao
Xiong, Deyi
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 11817 - 11837
[36] Large Language Models on Graphs: A Comprehensive Survey
Jin, Bowen
Liu, Gang
Han, Chi
Jiang, Meng
Ji, Heng
Han, Jiawei
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (12) : 8622 - 8642
[37] LLMBox: A Comprehensive Library for Large Language Models
Tang, Tianyi
Hui, Yiwen
Li, Bingqian
Lu, Wenyang
Qin, Zijing
Sun, Haoxiang
Wang, Jiapeng
Xu, Shiyi
Cheng, Xiaoxue
Guo, Geyang
Peng, Han
Zheng, Bowen
Tang, Yiru
Min, Yingqian
Chen, Yushuo
Chen, Jie
Zhao, Yuanqian
Ding, Luran
Wang, Yuhao
Dong, Zican
Xia, Chunxuan
Li, Junyi
Zhou, Kun
Zhao, Wayne Xin
Wen, Ji-Rong
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 3: SYSTEM DEMONSTRATIONS, 2024, : 388 - 399
[38] Large Language Models: A Comprehensive Guide for Radiologists
Kim, Sunkyu
Lee, Choong-kun
Kim, Seung-seob
JOURNAL OF THE KOREAN SOCIETY OF RADIOLOGY, 2024, 85 (05): : 861 - 882
[39] BioCoder: a benchmark for bioinformatics code generation with large language models
Tang, Xiangru
Qian, Bill
Gao, Rick
Chen, Jiakang
Chen, Xinyun
Gerstein, Mark B.
BIOINFORMATICS, 2024, 40 : i266 - i276
[40] SafeLLMs: A Benchmark for Secure Bilingual Evaluation of Large Language Models
Liang, Wenhan
Wu, Huijia
Gao, Jun
Shang, Yuhu
He, Zhaofeng
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT II, NLPCC 2024, 2025, 15360 : 437 - 448

← 1 2 3 4 5 →