CodeScore: Evaluating Code Generation by Learning Code Execution

被引:0
|
作者
Dong, Yihong [1 ,2 ]
Ding, Jiazheng [1 ,2 ]
Jiang, Xue [1 ,2 ]
Li, Ge [1 ,2 ]
Li, Zhuo [1 ,2 ]
Jin, Zhi [1 ,2 ]
机构
[1] Peking Univ, Key Lab High Confidence Software Technol, Minist Educ, Beijing, Peoples R China
[2] Peking Univ, Sch Comp Sci, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Code Evaluation; Code Pre-trained Language Model; Code Generation;
D O I
10.1145/3695991
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) suffer from two significant drawbacks. 1. They primarily measure the surface differences between codes without considering their functional equivalence. However, functional equivalence is pivotal in evaluating the effectiveness of code generation, as different codes can perform identical operations. 2. They are predominantly designed for the Ref-only input format. However, code evaluation necessitates versatility in input formats. Aside from Ref-only, there are NL-only and Ref and NL formats, which existing match-based CEMs cannot effectively accommodate. In this article, we propose CodeScore, a large language model (LLM)based CEM, which estimates the functional correctness of generated code on three input types. To acquire CodeScore, we present UniCE, a unified code generation learning framework, for LLMs to learn code execution (i.e., learning PassRatio and Executability of generated code) with unified input. Extensive experimental results on multiple code evaluation datasets demonstrate that CodeScore absolutely improves up to 58.87% correlation with functional correctness compared to other CEMs, achieves state-of-the-art performance, and effectively handles three input formats.
引用
收藏
页数:22
相关论文
共 50 条
  • [41] Dynamic schemes for speculative execution of code
    Raghavan, P
    Shachnai, H
    Yaniv, M
    PERFORMANCE EVALUATION, 2003, 53 (02) : 125 - 142
  • [42] Virtual Machine for Encrypted Code Execution
    Togan, Mihai
    Feraru, Alin
    Popescu, Adrian
    PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTERS AND ARTIFICIAL INTELLIGENCE - ECAI 2017, 2017,
  • [43] Threaded Execution as a Dual to Native Code
    Mason, Dave
    COMPANION PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON THE ART, SCIENCE, AND ENGINEERING OF PROGRAMMING, PROGRAMMING 2023, 2023, : 7 - 11
  • [44] Mobility and Remote-Code Execution
    Sanchis, Eric
    MOBILE WIRELESS MIDDLEWARE, OPERATING SYSTEMS, AND APPLICATIONS-WORKSHOPS, 2009, 12 : 85 - 97
  • [45] Unifying Execution of Imperative and Declarative Code
    Milicevic, Aleksandar
    Rayside, Derek
    Yessenov, Kuat
    Jackson, Daniel
    2011 33RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2011, : 511 - 520
  • [46] Unifying execution of imperative and declarative code
    Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, United States
    Proc Int Conf Software Eng, (511-520):
  • [47] Code Generation as a Dual Task of Code Summarization
    Wei, Bolin
    Li, Ge
    Xia, Xin
    Fu, Zhiyi
    Jin, Zhi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [48] Challenges for Code Generated OCL Execution
    Willink, Edward D.
    ACM/IEEE 25TH INTERNATIONAL CONFERENCE ON MODEL DRIVEN ENGINEERING LANGUAGES AND SYSTEMS, MODELS 2022 COMPANION, 2022, : 872 - 881
  • [49] Trusted code execution in Java']JavaCard
    Mana, Antonio
    Munoz, Antonio
    TRUST, PRIVACY AND SECURITY IN DIGITAL BUSINESS, PROCEEDINGS, 2007, 4657 : 269 - +
  • [50] Retargetable code optimization for predicated execution
    Hohenauer, M.
    Engel, F.
    Leupers, R.
    Ascheid, G.
    Meyr, H.
    Bette, Gerrit
    Singh, Balpreet
    2008 DESIGN, AUTOMATION AND TEST IN EUROPE, VOLS 1-3, 2008, : 1296 - +