Pathways to Leverage Transcompiler based Data Augmentation for Cross-Language Clone Detection

被引:1
|
作者
Pinku, Subroto Nag [1 ]
Mondal, Debajyoti [1 ]
Roy, Chanchal K. [1 ]
机构
[1] Univ Saskatchewan, Dept Comp Sci, Saskatoon, SK, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Code Clone Detection; Cross-Language Clones; Data Augmentation; Deep Learning; Graph Matching Networks;
D O I
10.1109/ICPC58990.2023.00031
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Software clones are often introduced when developers reuse code fragments to implement similar functionalities in the same or different software systems resulting in duplicated fragments or code clones in those systems. Due to the adverse effect of clones on software maintenance, a great many tools and techniques and techniques have appeared in the literature to detect clones. Many high-performing clone detection tools today are based on deep learning techniques and are mostly used for detecting clones written in the same programming language, whereas clone detection tools for detecting cross-language clones are also emerging rapidly. The popularity of deep learning-based clone detection tools creates an opportunity to investigate how known strategies that boost the performances of deep learning models could be further leveraged to improve the clone detection tools. In this paper, we investigate such a strategy, data augmentation, which has not yet been explored for cross-language clone detection as opposed to single language clone detection. We show how the existing knowledge on transcompilers (source-to-source translators) can be used for data augmentation to boost the performance of cross-language clone detection models, as well as to adapt single-language clone detection models to create cross-language clone detection pipelines. To demonstrate the performance boost for cross-language clone detection through data augmentation, we exploit Transcoder, which is a pre-trained source-to-source translator. To show how to extend single-language models for cross-language clone detection, we extend a popular single-language model, Graph Matching Network (GMN), in a combination with the transcompilers and code parsers (srcML). We evaluated our models on popular benchmark datasets. Our experimental results showed improvements in F1 scores (sometimes up to 3%) for the cutting-edge cross-language clone detection models. Even when extending GMN for cross-language clone detection, the models built leveraging data augmentation outperformed the baseline with scores of 0.90, 0.92, and 0.91 for precision, recall, and F1 score, respectively.
引用
收藏
页码:169 / 180
页数:12
相关论文
共 50 条
  • [1] Structural and Nominal Cross-Language Clone Detection
    Nichols, Lawton
    Emre, Mehmet
    Hardekopf, Ben
    FUNDAMENTAL APPROACHES TO SOFTWARE ENGINEERING (FASE 2019), 2019, 11424 : 247 - 263
  • [2] LICCA: A Tool for Cross-Language Clone Detection
    Vislayski, Tijana
    Rakic, Gordana
    Cardozo, Nicolas
    Budimac, Zoran
    2018 25TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING (SANER 2018), 2018, : 512 - 516
  • [3] TCCCD: Triplet-Based Cross-Language Code Clone Detection
    Fang, Yong
    Zhou, Fangzheng
    Xu, Yijia
    Liu, Zhonglin
    APPLIED SCIENCES-BASEL, 2023, 13 (21):
  • [4] Cross-language Source Code Clone Detection Based On Graph Neural Network
    Zhang, Yuguo
    Yang, Jia
    Ruan, Ou
    PROCEEDINGS OF 2024 3RD INTERNATIONAL CONFERENCE ON CRYPTOGRAPHY, NETWORK SECURITY AND COMMUNICATION TECHNOLOGY, CNSCT 2024, 2024, : 189 - 194
  • [5] C4: Contrastive Cross-Language Code Clone Detection
    Tao, Chenning
    Zhan, Qi
    Hu, Xing
    Xia, Xin
    30TH IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC 2022), 2022, : 413 - 424
  • [6] Cross-Language Code Similarity and Applications in Clone Detection and Code Search
    Mathew, George Varghese
    ProQuest Dissertations and Theses Global, 2022,
  • [7] Cross-language clone detection by learning over abstract syntax trees
    Perez, Daniel
    Chiba, Shigeru
    IEEE International Working Conference on Mining Software Repositories, 2019, 2019-May : 518 - 528
  • [8] Cross-language Sentence Selection via Data Augmentation and Rationale Training
    Chen, Yanda
    Kedzie, Chris
    Nair, Suraj
    Galuscakova, Petra
    Zhang, Rui
    Oard, Douglas W.
    McKeown, Kathleen
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 3881 - 3895
  • [9] Improve Representation for Cross-Language Clone Detection by Pretrain Using Tree Autoencoder
    Ling, Huading
    Zhang, Aiping
    Yin, Changchun
    Li, Dafang
    Chang, Mengyu
    INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2022, 33 (03): : 1561 - 1577
  • [10] Cross-Language Prominence Detection
    Rosenberg, Andrew
    Cooper, Erica
    Levitan, Rivka
    Hirschberg, Julia
    PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON SPEECH PROSODY, VOLS I AND II, 2012, : 278 - 281