CCStokener: Fast yet accurate code clone detection with semantic token

被引:6
|
作者
Wang, Wenjie [1 ,2 ]
Deng, Zihan [1 ,2 ]
Xue, Yinxing [1 ]
Xu, Yun [1 ,2 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei 230027, Peoples R China
[2] Key Lab High Performance Comp Anhui Prov, Hefei 230027, Peoples R China
基金
美国国家科学基金会;
关键词
Code clone detection; Semantic token; Near-miss clones; Scalable detection;
D O I
10.1016/j.jss.2023.111618
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Code clone detection refers to the discovery of identical or similar code fragments in the code repository. AST-based, PDG-based, and DL-based tools can achieve good results on detecting near -miss clones (i.e., clones with small differences or gaps) by using syntax and semantic information, but they are difficult to apply to large code repositories due to high time complexity. Traditional token -based tools can rapidly detect clones by the low-cost index (i.e., low frequency or k-lines tokens) on sequential source code, but most of them have the poor capability on detecting near-miss clones because of the lack of semantic information.In this study, we propose a fast yet accurate code clone detection tool with the semantic token, called CCSTOKENER. The idea behind the semantic token is to enhance the detection capability of token-based tool via complementing the traditional token with semantic information such as the structural information around the token and its dependency with other tokens in form of n-gram. Specifically, we extract the type of relevant nodes in the AST path of every token and transform these types into a fixed-dimensional vector, then model its semantic information by applying n-gram on its related tokens. Meanwhile, our tool adopts and improves the location-filtration-verification process also used in CCALIGNER and LVMAPPER, during which process we build the low-cost k-tokens index to quickly locate the candidate code blocks and speed up detection efficiency. Our experiments show that CCSTOKENER achieves excellent accuracy on detecting more near-miss clone pairs, which exhibits the best recall on Moderately Type-3 clones and detects more true positive clones on four java open-source projects. Moreover, CCSTOKENER attains the best generalization and transferability compared with two DL-based tools (i.e., ASTNN, TBCCD).(c) 2023 Elsevier Inc. All rights reserved.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Boreas: An Accurate and Scalable Token-Based Approach to Code Clone Detection
    Yuan, Yang
    Guo, Yao
    2012 PROCEEDINGS OF THE 27TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE), 2012, : 286 - 289
  • [2] Efficient transformer with code token learner for code clone detection
    Zhang, Aiping
    Fang, Liming
    Ge, Chunpeng
    Li, Piji
    Liu, Zhe
    JOURNAL OF SYSTEMS AND SOFTWARE, 2023, 197
  • [3] Semantic Code Clone Detection Based on Community Detection
    Wan, Zexuan
    Xie, Chunli
    Lv, Quanrun
    Fan, Yasheng
    INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2024, 34 (10) : 1661 - 1692
  • [4] Obfuscated code is identifiable by a token-based code clone detection technique
    Akram, Junaid
    Vasan, Danish
    Luo, Ping
    INTERNATIONAL JOURNAL OF INFORMATION AND COMPUTER SECURITY, 2022, 19 (3-4) : 254 - 273
  • [5] Interpreting CodeBERT for Semantic Code Clone Detection
    Abid, Shamsa
    Cai, Xuemeng
    Jiang, Lingxiao
    PROCEEDINGS OF THE 2023 30TH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE, APSEC 2023, 2023, : 229 - 238
  • [6] Semantic Code Clone Detection for Enterprise Applications
    Svacina, Jan
    Simmons, Jonathan
    Cerny, Tomas
    PROCEEDINGS OF THE 35TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING (SAC'20), 2020, : 129 - 131
  • [7] Multi-threshold token-based code clone detection
    Golubev, Yaroslav
    Poletansky, Viktor
    Povarov, Nikita
    Bryksin, Timofey
    2021 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING (SANER 2021), 2021, : 496 - 500
  • [8] Semantic Code Clone Detection Method for Distributed Enterprise Systems
    Svacina, Jan
    Bushong, Vincent
    Das, Dipta
    Cerny, Tomas
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND SERVICES SCIENCE (CLOSER), 2022, : 27 - 37
  • [9] Case Study on Semantic Clone Detection Based On Code Behavior
    Priyambadha, Bayu
    Rochimah, Siti
    2014 International Conference on Data and Software Engineering (ICODSE), 2014,
  • [10] Semantic Clone Detection: Can Source Code Comments Help?
    Ghosh, Akash
    Kuttal, Sandeep Kaur
    2018 IEEE SYMPOSIUM ON VISUAL LANGUAGES AND HUMAN-CENTRIC COMPUTING (VL/HCC), 2018, : 315 - 317