Evaluating Large Language Models on Controlled Generation Tasks

被引:0
|
作者
Sun, Jiao [1 ]
Tian, Yufei [2 ]
Zhou, Wangchunshu [3 ]
Xu, Nan [1 ]
Hu, Qian [4 ]
Gupta, Rahul [4 ]
Wieting, John [5 ]
Peng, Nanyun [2 ]
Ma, Xuezhe [1 ]
机构
[1] Univ Southern Calif, Los Angeles, CA 90007 USA
[2] Univ Calif Los Angeles, Los Angeles, CA 90024 USA
[3] Swiss Fed Inst Technol, Zurich, Switzerland
[4] Amazon, Seattle, WA USA
[5] Google DeepMind, London, England
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While recent studies have looked into the abilities of large language models in various benchmark tasks, few studies have looked into the controllability of large language models on generation tasks. We present a systematic and extensive analysis of the controllability of large language models on ten benchmarks, including a new simple yet challenging numerical planning benchmark with different granularities. After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing when large language models fall behind, are comparable, or exceed the ability of smaller models. We conclude that large language models struggle at meeting fine-grained hard constraints.
引用
收藏
页码:3155 / 3168
页数:14
相关论文
共 50 条
  • [1] Evaluating large language models in theory of mind tasks
    Kosinski, Michal
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2024, 121 (45)
  • [2] Framework for evaluating code generation ability of large language models
    Yeo, Sangyeop
    Ma, Yu-Seung
    Kim, Sang Cheol
    Jun, Hyungkook
    Kim, Taeho
    ETRI JOURNAL, 2024, 46 (01) : 106 - 117
  • [3] Evaluating and Advancing Large Language Models for Water Knowledge Tasks in Engineering and Research
    Xu, Boyan
    Li, Zihao
    Yang, Yuxin
    Wu, Guanlan
    Wang, Chengzhi
    Tang, Xiongpeng
    Li, Yu
    Wu, Zihao
    Su, Qingxian
    Shi, Xueqing
    Yang, Yue
    Tong, Rui
    Wen, Liang
    Ng, How Yong
    ENVIRONMENTAL SCIENCE & TECHNOLOGY LETTERS, 2025, 12 (03): : 289 - 296
  • [4] Evaluating application of large language models to biomedical patent claim generation
    Chen, Feng-Chi
    Pan, Chia-Lin
    AIPlux Development Team, AIPlux Development
    WORLD PATENT INFORMATION, 2025, 80
  • [5] Evaluating large language models on geospatial tasks: a multiple geospatial task benchmarking study
    Xu, Liuchang
    Zhao, Shuo
    Lin, Qingming
    Chen, Luyao
    Luo, Qianqian
    Wu, Sensen
    Ye, Xinyue
    Feng, Hailin
    Du, Zhenhong
    INTERNATIONAL JOURNAL OF DIGITAL EARTH, 2025, 18 (01)
  • [6] Invited Paper: VerilogEval: Evaluating Large Language Models for Verilog Code Generation
    Liu, Mingjie
    Pinckney, Nathaniel
    Khailany, Brucek
    Ren, Haoxing
    2023 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN, ICCAD, 2023,
  • [7] Evaluating the capabilities of large language models using machine learning tasks at inference-time
    Grm, Klemen
    Elektrotehniski Vestnik/Electrotechnical Review, 2023, 90 (05): : 247 - 253
  • [8] Evaluating the capabilities of large language models using machine learning tasks at inference-time
    Grm, Klemen
    ELEKTROTEHNISKI VESTNIK, 2023, 90 (05): : 247 - 253
  • [9] L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models
    Ni, Ansong
    Yin, Pengcheng
    Zhao, Yilun
    Riddell, Martin
    Feng, Troy
    Shen, Rui
    Yin, Stephen
    Liu, Ye
    Yavuz, Semih
    Xiong, Caiming
    Joty, Shafiq
    Zhou, Yingbo
    Radev, Dragomir
    Cohan, Arman
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 1311 - 1329
  • [10] Invited: Automated Code generation for Information Technology Tasks in YAML through Large Language Models
    Pujar, Saurabh
    Buratti, Luca
    Guo, Xiaojie
    Dupuis, Nicolas
    Lewis, Burn
    Suneja, Sahil
    Sood, Atin
    Nalawade, Ganesh
    Jones, Matt
    Morari, Alessandro
    Puri, Ruchir
    2023 60TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, DAC, 2023,