Scalable Source Code Similarity Detection in Large Code Repositories

被引:0
|
作者
Alomari, Firas [1 ]
Harbi, Muhammed [1 ]
机构
[1] Saudi Aramco, Corp Applicat Dept, Dhahran, Saudi Arabia
关键词
clones; software similarity; Control Flow Graphs; Fingerprints; CLONE DETECTION; SYSTEM; ERP;
D O I
10.4108/eai.13-7-2018.159353
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Source code similarity are increasingly used in application development to identify clones, isolate bugs, and find copy-rights violations. Similar code fragments can be very problematic due to the fact that errors in the original code must be fixed in every copy. Other maintenance changes, such as extensions or patches, must be applied multiple times. Furthermore, the diversity of coding styles and flexibility of modern languages makes it difficult and cost ineffective to manually inspect large code repositories. Therefore, detection is only feasible by automatic techniques. We present an efficient and scalable approach for similar code fragment identification based on source code control flow graphs fingerprinting. The source code is processed to generate control flow graphs that are then hashed to create a unique fingerprint of the code capturing semantics as well as syntax similarity. The fingerprints can then be efficiently stored and retrieved to perform similarity search between code fragments. Experimental results from our prototype implementation supports the validity of our approach and show its effectiveness and efficiency in comparison with other solutions.
引用
收藏
页码:1 / 11
页数:11
相关论文
共 50 条
  • [1] Identifying Source Code Reuse across Repositories using LCS-based Source Code Similarity
    Kawamitsu, Naohiro
    Ishio, Takashi
    Kanda, Tetsuya
    Kula, Raula Gaikovina
    De Roover, Coen
    Inoue, Katsuro
    2014 14TH IEEE INTERNATIONAL WORKING CONFERENCE ON SOURCE CODE ANALYSIS AND MANIPULATION (SCAM 2014), 2014, : 305 - 314
  • [2] Efficient plagiarism detection for large code repositories
    Burrows, Steven
    Tahaghoghi, S. M. M.
    Zobel, Justin
    SOFTWARE-PRACTICE & EXPERIENCE, 2007, 37 (02): : 151 - 175
  • [3] Scalable Source Code Plagiarism Detection Using Source Code Vectors Clustering
    Duracik, Michal
    Krsak, Emil
    Hrkut, Patrik
    PROCEEDINGS OF 2018 IEEE 9TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS), 2018, : 499 - 502
  • [4] A Source Code Similarity System for Plagiarism Detection
    Duric, Zoran
    Gasevic, Dragan
    COMPUTER JOURNAL, 2013, 56 (01): : 70 - 86
  • [5] FastDCF: A Partial Index Based Distributed and Scalable Near-Miss Code Clone Detection Approach for Very Large Code Repositories
    Yang, Liming
    Ren, Yi
    Guan, Jianbo
    Li, Bao
    Ma, Jun
    Han, Peng
    Tan, Yusong
    PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES, PDCAT 2021, 2022, 13148 : 210 - 222
  • [6] Source code repositories and agile methods
    Sillitti, A
    Succi, G
    EXTREME PROGRAMMING AND AGILE PROCESSES IN SOFTWARE ENGINEERING, PROCEEDINGS, 2005, 3556 : 243 - 246
  • [7] CCEyes: An Effective Tool for Code Clone Detection on Large-Scale Open Source Repositories
    Zhang, Yanzhi
    Wang, Tao
    2021 IEEE INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND SOFTWARE ENGINEERING (ICICSE 2021), 2021, : 61 - 70
  • [8] DroidCC: A Scalable Clone Detection Approach for Android Applications to Detect Similarity at Source Code Level
    Akram, Junaid
    Shi, Zhendong
    Mumtaz, Majid
    Ping, Luo
    2018 IEEE 42ND ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE (COMPSAC), VOL 1, 2018, : 100 - 105
  • [9] Scalable and Systematic Detection of Buggy Inconsistencies in Source Code
    Gabel, Mark
    Yang, Junfeng
    Yu, Yuan
    Goldszmidt, Moises
    Su, Zhendong
    ACM SIGPLAN NOTICES, 2010, 45 (10) : 175 - 190
  • [10] ExPort: Detecting and Visualizing API Usages in Large Source Code Repositories
    Moritz, Evan
    Linares-Vasquez, Mario
    Poshyvanyk, Denys
    Grechanik, Mark
    McMillan, Collin
    Gethers, Malcom
    2013 28TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE), 2013, : 646 - 651