Exploring the Boundaries Between LLM Code Clone Detection and Code Similarity Assessment on Human and AI-Generated Code

被引：0

作者：

Zhang, Zixian ^{[1
]}

Saber, Takfarinas ^{[2
]}

机构：

[1] Univ Galway, Sch Comp Sci, CRT AI, Galway H91TK33, Ireland

[2] Univ Galway, Sch Comp Sci, Lero, Galway H91 TK33, Ireland

来源：

BIG DATA AND COGNITIVE COMPUTING | 2025年 / 9卷 / 02期

基金：

爱尔兰科学基金会;

关键词：

code clone detection; code similarity; large language model; fine-tuning; LLM-generated code;

D O I：

10.3390/bdcc9020041

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

As Large Language Models (LLMs) continue to advance, their capabilities in code clone detection have garnered significant attention. While much research has assessed LLM performance on human-generated code, the proliferation of LLM-generated code raises critical questions about their ability to detect clones across both human- and LLM-created codebases, as this capability remains largely unexplored. This paper addresses this gap by evaluating two versions of LLaMA3 on these distinct types of datasets. Additionally, we perform a deeper analysis beyond simple prompting, examining the nuanced relationship between code cloning and code similarity that LLMs infer. We further explore how fine-tuning impacts LLM performance in clone detection, offering new insights into the interplay between code clones and similarity in human versus AI-generated code. Our findings reveal that LLaMA models excel in detecting syntactic clones but face challenges with semantic clones. Notably, the models perform better on LLM-generated datasets for semantic clones, suggesting a potential bias. The fine-tuning technique enhances the ability of LLMs to comprehend code semantics, improving their performance in both code clone detection and code similarity assessment. Our results offer valuable insights into the effectiveness and characteristics of LLMs in clone detection and code similarity assessment, providing a foundation for future applications and guiding further research in this area.

引用

页数：19

共 29 条

[1] DeVAIC: : A tool for security assessment of AI-generated code
Cotroneo, Domenico
De Luca, Roberta
Liguori, Pietro
INFORMATION AND SOFTWARE TECHNOLOGY, 2025, 177
[2] AI-Generated Code Not Considered Harmful
Kendon, Tyson
Wu, Leanne
Aycock, John
PROCEEDINGS OF THE 25TH WESTERN CANADIAN CONFERENCE ON COMPUTING EDUCATION, 2023,
[3] Navigating (in)security of AI-generated code
Ambati, Sri Haritha
Ridley, Norah
Branca, Enrico
Stakhanova, Natalia
2024 IEEE INTERNATIONAL CONFERENCE ON CYBER SECURITY AND RESILIENCE, CSR, 2024, : 30 - 37
[4] Automating the correctness assessment of AI-generated code for security contexts
Cotroneo, Domenico
Foggia, Alessio
Improta, Cristina
Liguori, Pietro
Natella, Roberto
JOURNAL OF SYSTEMS AND SOFTWARE, 2024, 216
[5] Validating AI-Generated Code with Live Programming
Ferdowsi, Kasra
Huang, Ruanqianqian
James, Michael B.
Polikarpova, Nadia
Lerner, Sorin
PROCEEDINGS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYTEMS (CHI 2024), 2024,
[6] EX-CODE: A Robust and Explainable Model to Detect AI-Generated Code
Bulla, Luana
Midolo, Alessandro
Mongiovi, Misael
Tramontana, Emiliano
INFORMATION, 2024, 15 (12)
[7] Poisoning Programs by Un-Repairing Code: Security Concerns of AI-generated Code
Improta, Cristina
2023 IEEE 34TH INTERNATIONAL SYMPOSIUM ON SOFTWARE RELIABILITY ENGINEERING WORKSHOPS, ISSREW, 2023, : 128 - 131
[8] Creating Thorough Tests for AI-Generated Code is Hard
Singhal, Shreya
Kumar, Viraj
PROCEEDINGS OF THE 16TH ANNUAL ACM INDIA COMPUTE CONFERENCE, COMPUTE 2023, 2023, : 108 - 111
[9] A Quantitative Analysis of Quality and Consistency in AI-generated Code
Clark, Autumn
Igbokwe, Daniel
Ross, Samantha
Zibran, Minhaz F.
2024 7TH INTERNATIONAL CONFERENCE ON SOFTWARE AND SYSTEM ENGINEERING, ICOSSE 2024, 2024, : 37 - 41
[10] A Comparative Analysis between AI Generated Code and Human Written Code: A Preliminary Study
Patel, Abhi
Sultana, Kazi Zakia
Samanthula, Bharath K.
Proceedings - 2024 IEEE International Conference on Big Data, BigData 2024, 2024, : 7521 - 7529

← 1 2 3 →