CCStokener: Fast yet accurate code clone detection with semantic token

被引:6
|
作者
Wang, Wenjie [1 ,2 ]
Deng, Zihan [1 ,2 ]
Xue, Yinxing [1 ]
Xu, Yun [1 ,2 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei 230027, Peoples R China
[2] Key Lab High Performance Comp Anhui Prov, Hefei 230027, Peoples R China
基金
美国国家科学基金会;
关键词
Code clone detection; Semantic token; Near-miss clones; Scalable detection;
D O I
10.1016/j.jss.2023.111618
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Code clone detection refers to the discovery of identical or similar code fragments in the code repository. AST-based, PDG-based, and DL-based tools can achieve good results on detecting near -miss clones (i.e., clones with small differences or gaps) by using syntax and semantic information, but they are difficult to apply to large code repositories due to high time complexity. Traditional token -based tools can rapidly detect clones by the low-cost index (i.e., low frequency or k-lines tokens) on sequential source code, but most of them have the poor capability on detecting near-miss clones because of the lack of semantic information.In this study, we propose a fast yet accurate code clone detection tool with the semantic token, called CCSTOKENER. The idea behind the semantic token is to enhance the detection capability of token-based tool via complementing the traditional token with semantic information such as the structural information around the token and its dependency with other tokens in form of n-gram. Specifically, we extract the type of relevant nodes in the AST path of every token and transform these types into a fixed-dimensional vector, then model its semantic information by applying n-gram on its related tokens. Meanwhile, our tool adopts and improves the location-filtration-verification process also used in CCALIGNER and LVMAPPER, during which process we build the low-cost k-tokens index to quickly locate the candidate code blocks and speed up detection efficiency. Our experiments show that CCSTOKENER achieves excellent accuracy on detecting more near-miss clone pairs, which exhibits the best recall on Moderately Type-3 clones and detects more true positive clones on four java open-source projects. Moreover, CCSTOKENER attains the best generalization and transferability compared with two DL-based tools (i.e., ASTNN, TBCCD).(c) 2023 Elsevier Inc. All rights reserved.
引用
收藏
页数:16
相关论文
共 50 条
  • [41] Generalizability of Code Clone Detection on CodeBERT
    Sonnekalb, Tim
    Gruner, Bernd
    Brust, Clemens-Alexander
    Maeder, Patrick
    PROCEEDINGS OF THE 37TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE 2022, 2022,
  • [42] Code Clone Detection using Wavelets
    Karus, Siim
    Kilgi, Karl
    2015 IEEE 9TH INTERNATIONAL WORKSHOP ON SOFTWARE CLONES (IWSC), 2015, : 8 - 14
  • [43] CCCD: Concolic Code Clone Detection
    Krutz, Daniel E.
    Shihab, Emad
    2013 20TH WORKING CONFERENCE ON REVERSE ENGINEERING (WCRE), 2013, : 489 - 490
  • [44] Challenges in Behavioral Code Clone Detection
    Su, Fang-Hsiang
    Bell, Jonathan
    Kaiser, Gail
    2016 IEEE 23RD INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION, AND REENGINEERING (SANER), VOL 3, 2016, : 21 - 22
  • [45] On the Robustness of Clone Detection to Code Obfuscation
    Schulze, Sandro
    Meyer, Daniel
    2013 7TH INTERNATIONAL WORKSHOP ON SOFTWARE CLONES (IWSC), 2013, : 62 - 68
  • [46] Indexing source code and clone detection
    Tronicek, Zdenek
    INFORMATION AND SOFTWARE TECHNOLOGY, 2022, 144
  • [47] Code Clone Detection: A Literature Review
    Chen Q.-Y.
    Li S.-P.
    Yan M.
    Xia X.
    Ruan Jian Xue Bao/Journal of Software, 2019, 30 (04): : 962 - 980
  • [48] Interface Driven Code Clone Detection
    Misu, Md Rakib Hossain
    Sakib, Kazi
    2017 24TH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE (APSEC 2017), 2017, : 747 - 748
  • [49] Gapped Code Clone Detection with Lightweight Source Code Analysis
    Murakami, Hiroaki
    Hotta, Keisuke
    Higo, Yoshiki
    Igaki, Hiroshi
    Kusumoto, Shinji
    2013 IEEE 21ST INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC), 2013, : 93 - 102
  • [50] SourcererCC: Scaling Code Clone Detection to Big-Code
    Sajnani, Hitesh
    Saini, Vaibhav
    Svajlenko, Jeffrey
    Roy, Chanchal K.
    Lopes, Cristina V.
    2016 IEEE/ACM 38TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2016, : 1157 - 1168