CodeScore: Evaluating Code Generation by Learning Code Execution

被引:0
|
作者
Dong, Yihong [1 ,2 ]
Ding, Jiazheng [1 ,2 ]
Jiang, Xue [1 ,2 ]
Li, Ge [1 ,2 ]
Li, Zhuo [1 ,2 ]
Jin, Zhi [1 ,2 ]
机构
[1] Peking Univ, Key Lab High Confidence Software Technol, Minist Educ, Beijing, Peoples R China
[2] Peking Univ, Sch Comp Sci, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Code Evaluation; Code Pre-trained Language Model; Code Generation;
D O I
10.1145/3695991
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) suffer from two significant drawbacks. 1. They primarily measure the surface differences between codes without considering their functional equivalence. However, functional equivalence is pivotal in evaluating the effectiveness of code generation, as different codes can perform identical operations. 2. They are predominantly designed for the Ref-only input format. However, code evaluation necessitates versatility in input formats. Aside from Ref-only, there are NL-only and Ref and NL formats, which existing match-based CEMs cannot effectively accommodate. In this article, we propose CodeScore, a large language model (LLM)based CEM, which estimates the functional correctness of generated code on three input types. To acquire CodeScore, we present UniCE, a unified code generation learning framework, for LLMs to learn code execution (i.e., learning PassRatio and Executability of generated code) with unified input. Extensive experimental results on multiple code evaluation datasets demonstrate that CodeScore absolutely improves up to 58.87% correlation with functional correctness compared to other CEMs, achieves state-of-the-art performance, and effectively handles three input formats.
引用
收藏
页数:22
相关论文
共 50 条
  • [1] Evaluating How Novices Utilize Debuggers and Code Execution to Understand Code
    Hassan, Mohammed
    Zeng, Grace
    Zilles, Craig
    20TH ANNUAL ACM CONFERENCE ON INTERNATIONAL COMPUTING EDUCATION RESEARCH, ICER 2024, VOL 1, 2024, : 65 - 83
  • [2] CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code
    Zhou, Shuyan
    Alon, Uri
    Agarwal, Sumit
    Neubig, Graham
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 13921 - 13937
  • [3] Dynamic Reverse Code Generation for Backward Execution
    Lee, Jooyong
    ELECTRONIC NOTES IN THEORETICAL COMPUTER SCIENCE, 2007, 174 (04) : 37 - 54
  • [4] Autogenerator: Generation and execution of programming code on demand
    Magdalenic, Ivan
    Radosevic, Danijel
    Orehovacki, Tihomir
    EXPERT SYSTEMS WITH APPLICATIONS, 2013, 40 (08) : 2845 - 2857
  • [5] Exploring and Evaluating Personalized Models for Code Generation
    Zlotchevski, Andrei
    Drain, Dawn
    Svyatkovskiy, Alexey
    Clement, Colin B.
    Sundaresan, Neel
    Tufano, Michele
    PROCEEDINGS OF THE 30TH ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2022, 2022, : 1500 - 1508
  • [6] Evaluating Social Bias in Code Generation Models
    Ling, Lin
    COMPANION PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, FSE COMPANION 2024, 2024, : 695 - 697
  • [7] Exploring and Evaluating Personalized Models for Code Generation
    Zlotchevski, Andrei
    Drain, Dawn
    Svyatkovskiy, Alexey
    Clement, Colin
    Sundaresan, Neel
    Tufano, Michele
    arXiv, 2022,
  • [8] Deep learning for code generation: a survey
    Zhang, Huangzhao
    Zhang, Kechi
    Li, Zhuo
    Li, Jia
    Li, Yongmin
    Zhao, Yunfei
    Zhu, Yuqi
    Liu, Fang
    Li, Ge
    Jin, Zhi
    SCIENCE CHINA-INFORMATION SCIENCES, 2024, 67 (09)
  • [9] Deep learning for code generation: a survey
    Huangzhao ZHANG
    Kechi ZHANG
    Zhuo LI
    Jia LI
    Jia LI
    Yongmin LI
    Yunfei ZHAO
    Yuqi ZHU
    Fang LIU
    Ge LI
    Zhi JIN
    ScienceChina(InformationSciences), 2024, 67 (09) : 5 - 40
  • [10] Learning and Evaluating Contextual Embedding of Source Code
    Kanade, Aditya
    Maniatis, Petros
    Balakrishnan, Gogul
    Shi, Kensen
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119