Exploring the Boundaries Between LLM Code Clone Detection and Code Similarity Assessment on Human and AI-Generated Code

被引:0
|
作者
Zhang, Zixian [1 ]
Saber, Takfarinas [2 ]
机构
[1] Univ Galway, Sch Comp Sci, CRT AI, Galway H91TK33, Ireland
[2] Univ Galway, Sch Comp Sci, Lero, Galway H91 TK33, Ireland
基金
爱尔兰科学基金会;
关键词
code clone detection; code similarity; large language model; fine-tuning; LLM-generated code;
D O I
10.3390/bdcc9020041
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As Large Language Models (LLMs) continue to advance, their capabilities in code clone detection have garnered significant attention. While much research has assessed LLM performance on human-generated code, the proliferation of LLM-generated code raises critical questions about their ability to detect clones across both human- and LLM-created codebases, as this capability remains largely unexplored. This paper addresses this gap by evaluating two versions of LLaMA3 on these distinct types of datasets. Additionally, we perform a deeper analysis beyond simple prompting, examining the nuanced relationship between code cloning and code similarity that LLMs infer. We further explore how fine-tuning impacts LLM performance in clone detection, offering new insights into the interplay between code clones and similarity in human versus AI-generated code. Our findings reveal that LLaMA models excel in detecting syntactic clones but face challenges with semantic clones. Notably, the models perform better on LLM-generated datasets for semantic clones, suggesting a potential bias. The fine-tuning technique enhances the ability of LLMs to comprehend code semantics, improving their performance in both code clone detection and code similarity assessment. Our results offer valuable insights into the effectiveness and characteristics of LLMs in clone detection and code similarity assessment, providing a foundation for future applications and guiding further research in this area.
引用
收藏
页数:19
相关论文
共 29 条
  • [21] Distinguishing AI- and Human-Generated Code: A Case Study
    Bukhari, Sufiyan
    Tan, Benjamin
    De Carli, Lorenzo
    PROCEEDINGS OF THE 2023 WORKSHOP ON SOFTWARE SUPPLY CHAIN OFFENSIVE RESEARCH AND ECOSYSTEM DEFENSES, SCORED 2023, 2023, : 17 - 25
  • [22] DroidCC: A Scalable Clone Detection Approach for Android Applications to Detect Similarity at Source Code Level
    Akram, Junaid
    Shi, Zhendong
    Mumtaz, Majid
    Ping, Luo
    2018 IEEE 42ND ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE (COMPSAC), VOL 1, 2018, : 100 - 105
  • [23] A systematic literature review on source code similarity measurement and clone detection: Techniques, applications, and challenges
    Zakeri-Nasrabadi, Morteza
    Parsa, Saeed
    Ramezani, Mohammad
    Roy, Chanchal
    Ekhtiarzadeh, Masoud
    JOURNAL OF SYSTEMS AND SOFTWARE, 2023, 204
  • [24] Agents for Data Science: From Raw Data to AI-Generated Notebooks using LLMs and Code Execution (Invited Talk)
    Cai, Jiahao
    PROCEEDINGS OF THE 1ST ACM INTERNATIONAL CONFERENCE ON AI-POWERED SOFTWARE, AIWARE 2024, 2024, : 181 - 181
  • [25] Exploring the boundaries of authorship: a comparative analysis of AI-generated text and human academic writing in English literature
    Amirjalili, Forough
    Neysani, Masoud
    Nikbakht, Ahmadreza
    FRONTIERS IN EDUCATION, 2024, 9
  • [26] Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text
    Ahmed M. Elkhatat
    Khaled Elsaid
    Saeed Almeer
    International Journal for Educational Integrity, 19
  • [27] Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text
    Elkhatat, Ahmed M.
    Elsaid, Khaled
    Almeer, Saeed
    INTERNATIONAL JOURNAL FOR EDUCATIONAL INTEGRITY, 2023, 19 (01)
  • [28] Utility of the Early Delay and Disabilities Code Set for Exploring the Linkage Between ICF-CY and Assessment Reports for Children With Developmental Delay
    Pan, Yi-Ling
    Hwang, Ai-Wen
    Simeonsson, Rune J.
    Lu, Lu
    Liao, Hua-Fang
    INFANTS & YOUNG CHILDREN, 2019, 32 (03): : 215 - 227
  • [29] No easy path to HRM performance measurement systems: Exploring the introduction of the U.S. Human Capital Assessment and Accountability Framework and the Flemish management code
    Vandenabeele, Wouter
    Hondeghem, Annie
    PUBLIC PERSONNEL MANAGEMENT, 2008, 37 (02) : 243 - 260