Evaluating the Performance of Code Generation Models for Solving Parsons Problems With Small Prompt Variations

被引:30
|
作者
Reeves, Brent [1 ]
Sarsa, Sami [2 ]
Prather, James [1 ]
Denny, Paul [3 ]
Becker, Brett A. [4 ]
Hellas, Arto [2 ]
Kimmel, Bailey [1 ]
Powell, Garrett [1 ]
Leinonen, Juho [3 ]
机构
[1] Abilene Christian Univ, Abilene, TX 79699 USA
[2] Aalto Univ, Espoo, Finland
[3] Univ Auckland, Auckland, New Zealand
[4] Univ Coll Dublin, Dublin, Ireland
关键词
academic integrity; AI; artificial intelligence; ChatGPT; code generation; code writing; Codex; computer programming; Copilot; CS1; deep learning; generative AI; introductory programming; GitHub; GPT-3; large language models; machine learning; ML; neural networks; natural language processing; novice programming; OpenAI;
D O I
10.1145/3587102.3588805
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
The recent emergence of code generation tools powered by large language models has attracted wide attention. Models such as OpenAI Codex can take natural language problem descriptions as input and generate highly accurate source code solutions, with potentially significant implications for computing education. Given the many complexities that students face when learning to write code, they may quickly become reliant on such tools without properly understanding the underlying concepts. One popular approach for scaffolding the code writing process is to use Parsons problems, which present solution lines of code in a scrambled order. These remove the complexities of low-level syntax, and allow students to focus on algorithmic and design-level problem solving. It is unclear how well code generation models can be applied to solve Parsons problems, given the mechanics of these models and prior evidence that they underperform when problems include specific restrictions. In this paper, we explore the performance of the Codex model for solving Parsons problems over various prompt variations. Using a corpus of Parsons problems we sourced from the computing education literature, we find that Codex successfully reorders the problem blocks about half of the time, a much lower rate of success when compared to prior work on more free-form programming tasks. Regarding prompts, we find that small variations in prompting have a noticeable effect on model performance, although the effect is not as pronounced as between different problems.
引用
收藏
页码:299 / 305
页数:7
相关论文
共 38 条
  • [21] Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models
    Vaithilingam, Priyan
    Zhang, Tianyi
    Glassman, Elena L.
    EXTENDED ABSTRACTS OF THE 2022 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, CHI 2022, 2022,
  • [22] SWIRL plus plus : Evaluating Performance Models to Guide Code Transformation in Convolutional Neural Networks
    Patabandi, Tharindu R.
    Venkat, Anand
    Barik, Rajkishore
    Hall, Mary
    LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, LCPC 2019, 2021, 11998 : 108 - 126
  • [23] A More Expressive Spline Representation for SBML Models Improves Code Generation Performance in AMICI
    Contento, Lorenzo
    Stapor, Paul
    Weindl, Daniel
    Hasenauer, Jan
    COMPUTATIONAL METHODS IN SYSTEMS BIOLOGY, CMSB 2023, 2023, 14137 : 36 - 43
  • [24] Communication-oriented performance optimisation during code generation from Simulink models
    Yan, Rongjie
    Yu, Min
    Huang, Kai
    Zhang, Xiaomeng
    INTERNATIONAL JOURNAL OF EMBEDDED SYSTEMS, 2014, 6 (2-3) : 124 - 134
  • [25] Low-cost language models: Survey and performance evaluation on Python']Python code generation
    Espejel, Jessica Lopez
    Alassan, Mahaman Sanoussi Yahaya
    Bouhandi, Merieme
    Dahhane, Walid
    Ettifouri, El Hassane
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 140
  • [26] Evaluating and Enhancing the Robustness of Code Pre-trained Models through Structure-Aware Adversarial Samples Generation
    Chen, Nuo
    Sun, Qiushi
    Wang, Jianing
    Gao, Ming
    Li, Xiaoli
    Li, Xiang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 14857 - 14873
  • [27] EVALUATING LARGE LANGUAGE MODELS' (LLM) PERFORMANCE IN CONTENT GENERATION FOR GLOBAL VALUE DOSSIERS (GVD)
    Walters, J.
    Rtveladze, K.
    Xu, W.
    Green, N.
    Joseph, J.
    Matev, K.
    Gallinaro, J.
    Guerra, I
    VALUE IN HEALTH, 2024, 27 (12)
  • [28] Evaluating the Performance of Two Inter-Frequency Code Bias (IFCB) Models in Combined Precise Point Positioning (PPP)
    Zhao, Ban
    Xiong, Yongliang
    REMOTE SENSING, 2022, 14 (06)
  • [29] Comparison via parallel performance models of angular and spatial domain decompositions for solving neutral particle transport problems
    Fischer, James W.
    Azmy, Y. Y.
    PROGRESS IN NUCLEAR ENERGY, 2007, 49 (01) : 37 - 60
  • [30] Evaluating CZTS Solar Cell Performance Based on Generation and Recombination Models for Possible ETLs Through Numerical Analysis
    Pratap Kumar Dakua
    Rajib Kumar Dash
    Abdelmoumene Laidouci
    Sagar Bhattarai
    Usen Dudekula
    Savita Kashyap
    Vipul Agarwal
    Ahmed Nabih Zaki Rashed
    Journal of Electronic Materials, 2024, 53 : 2015 - 2025