CCStokener: Fast yet accurate code clone detection with semantic token

被引:6
|
作者
Wang, Wenjie [1 ,2 ]
Deng, Zihan [1 ,2 ]
Xue, Yinxing [1 ]
Xu, Yun [1 ,2 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei 230027, Peoples R China
[2] Key Lab High Performance Comp Anhui Prov, Hefei 230027, Peoples R China
基金
美国国家科学基金会;
关键词
Code clone detection; Semantic token; Near-miss clones; Scalable detection;
D O I
10.1016/j.jss.2023.111618
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Code clone detection refers to the discovery of identical or similar code fragments in the code repository. AST-based, PDG-based, and DL-based tools can achieve good results on detecting near -miss clones (i.e., clones with small differences or gaps) by using syntax and semantic information, but they are difficult to apply to large code repositories due to high time complexity. Traditional token -based tools can rapidly detect clones by the low-cost index (i.e., low frequency or k-lines tokens) on sequential source code, but most of them have the poor capability on detecting near-miss clones because of the lack of semantic information.In this study, we propose a fast yet accurate code clone detection tool with the semantic token, called CCSTOKENER. The idea behind the semantic token is to enhance the detection capability of token-based tool via complementing the traditional token with semantic information such as the structural information around the token and its dependency with other tokens in form of n-gram. Specifically, we extract the type of relevant nodes in the AST path of every token and transform these types into a fixed-dimensional vector, then model its semantic information by applying n-gram on its related tokens. Meanwhile, our tool adopts and improves the location-filtration-verification process also used in CCALIGNER and LVMAPPER, during which process we build the low-cost k-tokens index to quickly locate the candidate code blocks and speed up detection efficiency. Our experiments show that CCSTOKENER achieves excellent accuracy on detecting more near-miss clone pairs, which exhibits the best recall on Moderately Type-3 clones and detects more true positive clones on four java open-source projects. Moreover, CCSTOKENER attains the best generalization and transferability compared with two DL-based tools (i.e., ASTNN, TBCCD).(c) 2023 Elsevier Inc. All rights reserved.
引用
收藏
页数:16
相关论文
共 50 条
  • [31] Semantic code clone detection for Internet of Things applications using reaching definition and liveness analysis
    Rajkumar Tekchandani
    Rajesh Bhatia
    Maninder Singh
    The Journal of Supercomputing, 2018, 74 : 4199 - 4226
  • [32] Semantic code clone detection for Internet of Things applications using reaching definition and liveness analysis
    Tekchandani, Rajkumar
    Bhatia, Rajesh
    Singh, Maninder
    JOURNAL OF SUPERCOMPUTING, 2018, 74 (09): : 4199 - 4226
  • [33] An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis
    Kamiya, Toshihiro
    2015 IEEE 9TH INTERNATIONAL WORKSHOP ON SOFTWARE CLONES (IWSC), 2015, : 1 - 7
  • [34] Semantic-enabled Clone Detection
    Keivanloo, Iman
    Rilling, Juergen
    2013 IEEE 37TH ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE (COMPSAC), 2013, : 393 - 398
  • [35] Deep Learning Code Fragments for Code Clone Detection
    White, Martin
    Tufano, Michele
    Vendome, Christopher
    Poshyvanyk, Denys
    2016 31ST IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE), 2016, : 87 - 98
  • [36] Accurate and Language Agnostic Code Clone Detection by Measuring Edit Distance of ANTLR Parse Tree
    Ankali, Sanjay B.
    Parthiban, Latha
    INTERNATIONAL JOURNAL OF SOFTWARE INNOVATION, 2022, 10 (01)
  • [37] Generalizability of Code Clone Detection on CodeBERT
    Sonnekalb, Tim
    Gruner, Bernd
    Brust, Clemens-Alexander
    Mäder, Patrick
    arXiv, 2022,
  • [38] A Systematic Review on Code Clone Detection
    Ul Ain, Qurat
    Butt, Wasi Haider
    Anwar, Muhammad Waseem
    Azam, Farooque
    Maqbool, Bilal
    IEEE ACCESS, 2019, 7 : 86121 - 86144
  • [39] On Precision of Code Clone Detection Tools
    Farmahinifarahani, Farima
    Saini, Vaibhav
    Yang, Di
    Sajnani, Hitesh
    Lopes, Cristina V.
    2019 IEEE 26TH INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING (SANER), 2019, : 84 - 94
  • [40] Generalizability of Code Clone Detection on CodeBERT
    Sonnekalb, Tim
    Gruner, Bernd
    Brust, Clemens-Alexander
    Mäder, Patrick
    ACM International Conference Proceeding Series, 2022,