Evaluating the Performance of Code Generation Models for Solving Parsons Problems With Small Prompt Variations

被引：30

作者：

Reeves, Brent ^{[1
]}

Sarsa, Sami ^{[2
]}

Prather, James ^{[1
]}

Denny, Paul ^{[3
]}

Becker, Brett A. ^{[4
]}

Hellas, Arto ^{[2
]}

Kimmel, Bailey ^{[1
]}

Powell, Garrett ^{[1
]}

Leinonen, Juho ^{[3
]}

机构：

[1] Abilene Christian Univ, Abilene, TX 79699 USA

[2] Aalto Univ, Espoo, Finland

[3] Univ Auckland, Auckland, New Zealand

[4] Univ Coll Dublin, Dublin, Ireland

来源：

PROCEEDINGS OF THE 2023 CONFERENCE ON INNOVATION AND TECHNOLOGY IN COMPUTER SCIENCE EDUCATION, ITICSE 2023, VOL 1 | 2023年

关键词：

academic integrity; AI; artificial intelligence; ChatGPT; code generation; code writing; Codex; computer programming; Copilot; CS1; deep learning; generative AI; introductory programming; GitHub; GPT-3; large language models; machine learning; ML; neural networks; natural language processing; novice programming; OpenAI;

D O I：

10.1145/3587102.3588805

中图分类号：

G40 [教育学];

学科分类号：

040101 ; 120403 ;

摘要：

The recent emergence of code generation tools powered by large language models has attracted wide attention. Models such as OpenAI Codex can take natural language problem descriptions as input and generate highly accurate source code solutions, with potentially significant implications for computing education. Given the many complexities that students face when learning to write code, they may quickly become reliant on such tools without properly understanding the underlying concepts. One popular approach for scaffolding the code writing process is to use Parsons problems, which present solution lines of code in a scrambled order. These remove the complexities of low-level syntax, and allow students to focus on algorithmic and design-level problem solving. It is unclear how well code generation models can be applied to solve Parsons problems, given the mechanics of these models and prior evidence that they underperform when problems include specific restrictions. In this paper, we explore the performance of the Codex model for solving Parsons problems over various prompt variations. Using a corpus of Parsons problems we sourced from the computing education literature, we find that Codex successfully reorders the problem blocks about half of the time, a much lower rate of success when compared to prior work on more free-form programming tasks. Regarding prompts, we find that small variations in prompting have a noticeable effect on model performance, although the effect is not as pronounced as between different problems.

引用

页码：299 / 305

页数：7

共 38 条

[1] Solving Parsons Problems Versus Fixing and Writing Code
Ericson, Barbara J.
Margulieux, Lauren E.
Rick, Jochen
17TH KOLI CALLING INTERNATIONAL CONFERENCE ON COMPUTING EDUCATION RESEARCH (KOLI CALLING 2017), 2017, : 20 - 29
[2] CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code
Zhou, Shuyan
Alon, Uri
Agarwal, Sumit
Neubig, Graham
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 13921 - 13937
[3] Exploring and Evaluating Personalized Models for Code Generation
Zlotchevski, Andrei
Drain, Dawn
Svyatkovskiy, Alexey
Clement, Colin B.
Sundaresan, Neel
Tufano, Michele
PROCEEDINGS OF THE 30TH ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2022, 2022, : 1500 - 1508
[4] Exploring and Evaluating Personalized Models for Code Generation
Zlotchevski, Andrei
Drain, Dawn
Svyatkovskiy, Alexey
Clement, Colin
Sundaresan, Neel
Tufano, Michele
arXiv, 2022,
[5] Evaluating Social Bias in Code Generation Models
Ling, Lin
COMPANION PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, FSE COMPANION 2024, 2024, : 695 - 697
[6] Parsons Problems to Scaffold CodeWriting: Impact on Performance and Problem-Solving Efficiency
Hou, Xinying
Ericson, Barbara Jane
Wang, Xu
PROCEEDINGS OF THE 2023 CONFERENCE ON INNOVATION AND TECHNOLOGY IN COMPUTER SCIENCE EDUCATION, ITICSE 2023, VOL. 2, 2023, : 665 - 665
[7] Framework for evaluating code generation ability of large language models
Yeo, Sangyeop
Ma, Yu-Seung
Kim, Sang Cheol
Jun, Hyungkook
Kim, Taeho
ETRI JOURNAL, 2024, 46 (01) : 106 - 117
[8] Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models
Riddell, Martin
Ni, Ansong
Cohan, Arman
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 14116 - 14137
[9] Problem-Solving Efficiency and Cognitive Load for Adaptive Parsons Problems vs. Writing the Equivalent Code
Haynes, Carl C.
Ericson, Barbara J.
CHI '21: PROCEEDINGS OF THE 2021 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, 2021,
[10] EVALUATING MIXED INTEGER PROGRAMMING MODELS FOR SOLVING STOCHASTIC INVENTORY PROBLEMS
Bluemink, Bas
de Kok, A. G.
Srinivasan, Balan
Uzsoy, Reha
2019 WINTER SIMULATION CONFERENCE (WSC), 2019, : 1696 - 1707

← 1 2 3 4 →