Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

被引:0
|
作者
Chen, Yihan [1 ]
Xu, Benfeng [1 ]
Wang, Quan [2 ]
Liu, Yi [3 ]
Mao, Zhendong [1 ]
机构
[1] Univ Sci & Technol China, Hefei, Peoples R China
[2] Beijing Univ Posts & Telecommun, MOE Key Lab Trustworthy Distributed Comp & Serv, Beijing, Peoples R China
[3] State Key Lab Commun Content Cognit Peoples Daily, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While large language models (LLMs) have exhibited impressive instruction-following capabilities, it is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions. As a significant aspect of LLM alignment, it is thus important to formulate such a specialized set of instructions as well as investigate the resulting behavior of LLMs. To address this vacancy, we propose a new benchmark CoDI-Eval to systematically and comprehensively evaluate LLMs' responses to instructions with various constraints. We construct a large collection of constraints-attributed instructions as a test suite focused on both generalization and coverage. Specifically, we advocate an instruction diversification process to synthesize diverse forms of constraint expression and also deliberate the candidate task taxonomy with even finer-grained sub-categories. Finally, we automate the entire evaluation process to facilitate further developments. Different from existing studies on controllable text generation, CoDI-Eval extends the scope to the prevalent instruction-following paradigm for the first time. We provide extensive evaluations of representative LLMs (e.g., ChatGPT, Vicuna) on CoDI-Eval, revealing their limitations in following instructions with specific constraints and there is still a significant gap between open-source and commercial closed-source LLMs. We believe this benchmark will facilitate research into improving the controllability of LLMs' responses to instructions. Our data and code are available at https://github.com/Xt-cyh/CoDI-Eval.
引用
收藏
页码:17808 / 17816
页数:9
相关论文
共 50 条
  • [1] Benchmarking Large Language Models in Retrieval-Augmented Generation
    Chen, Jiawei
    Lin, Hongyu
    Han, Xianpei
    Sun, Le
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17754 - 17762
  • [2] Benchmarking Large Language Models for Automated Verilog RTL Code Generation
    Thakur, Shailja
    Ahmad, Baleegh
    Fan, Zhenxing
    Pearce, Hammond
    Tan, Benjamin
    Karri, Ramesh
    Dolan-Gavitt, Brendan
    Garg, Siddharth
    2023 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, DATE, 2023,
  • [3] Benchmarking medical large language models
    Bakhshandeh, Sadra
    NATURE REVIEWS BIOENGINEERING, 2023, 1 (08): : 543 - 543
  • [4] Benchmarking DNA large language models on quadruplexes
    Cherednichenko, Oleksandr
    Herbert, Alan
    Poptsova, Maria
    COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2025, 27 : 992 - 1000
  • [5] Benchmarking AutoGen with different large language models
    Barbarroxa, Rafael
    Ribeiro, Bruno
    Gomes, Luis
    Vale, Zita
    2024 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI 2024, 2024, : 263 - 264
  • [6] Benchmarking Large Language Models for News Summarization
    Zhang, Tianyi
    Ladhak, Faisal
    Durmus, Esin
    Liang, Percy
    Mckeown, Kathleen
    Hashimoto, Tatsunori B.
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 39 - 57
  • [7] Benchmarking Large Language Models: Opportunities and Challenges
    Hodak, Miro
    Ellison, David
    Van Buren, Chris
    Jiang, Xiaotong
    Dholakia, Ajay
    PERFORMANCE EVALUATION AND BENCHMARKING, TPCTC 2023, 2024, 14247 : 77 - 89
  • [8] (sic) UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation
    Liang, Xun
    Song, Shichao
    Niu, Simin
    Li, Zhiyu
    Xiong, Feiyu
    Tang, Bo
    Wang, Yezhaohui
    He, Dawei
    Cheng, Peng
    Wang, Zhonghao
    Deng, Haiying
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 5266 - 5293
  • [9] Aligning Large Language Models for Controllable Recommendations
    Lu, Wensheng
    Lian, Jianxun
    Zhang, Wei
    Li, Guanghua
    Zhou, Mingyang
    Liao, Hao
    Xie, Xing
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 8159 - 8172
  • [10] Large Language Models with Controllable Working Memory
    Li, Daliang
    Rawat, Ankit Singh
    Zaheer, Manzil
    Wang, Xin
    Lukasik, Michal
    Veit, Andreas
    Yu, Felix
    Kumar, Sanjiv
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 1774 - 1793