Statistical Unigram Analysis for Source Code Repository

被引:12
|
作者
Xu, Weifeng [1 ]
Xu, Dianxiang [1 ]
El Ariss, Omar [2 ]
Liu, Yunkai [3 ]
Alatawi, Abdulrahman [1 ]
机构
[1] Bowie State Univ, Dept Comp Sci, Bowie, MD 20715 USA
[2] Penn State Univ Harrisburg, Dept Comp Sci, Middletown, PA USA
[3] Gannon Univ, Dept Comp & Informat Sci, Erie, PA USA
关键词
programming language; source code; n-gram; unigram; abbreviations; ultra-large-scale analysis;
D O I
10.1109/BigMM.2017.13
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Unigram is a fundamental element of n-gram in natural language processing. However, unigrams collected from a natural language corpus are unsuitable for solving problems in the domain of computer programming languages. In this paper, we analyze the properties of unigrams collected from an ultralarge source code repository. Specifically, we have collected 1.01 billion unigrams from 0.7 million open source projects hosted at GitHub. com. By analyzing these unigrams, we have discovered statistical patterns regarding (1) how developers name variables, methods, and classes, and (2) how developers choose abbreviations. Our study describes a probabilistic model for solving a well-known problem in source code analysis: how to expand a given abbreviation to its original indented word. It shows that the unigrams collected from source code repositories are essential resources to solving the domain specific problems.
引用
收藏
页码:1 / 8
页数:8
相关论文
共 50 条
  • [31] Aspectual Source Code Analysis with GASR
    Fabry, Johan
    De Roover, Coen
    Jonckers, Viviane
    2013 IEEE 13TH INTERNATIONAL WORKING CONFERENCE ON SOURCE CODE ANALYSIS AND MANIPULATION (SCAM), 2013, : 53 - 62
  • [32] Compiler Hacking for Source Code Analysis
    G. Antoniol
    M. Di Penta
    G. Masone
    U. Villano
    Software Quality Journal, 2004, 12 : 383 - 406
  • [33] Quality Analysis of Source Code Comments
    Steidl, Daniela
    Hummel, Benjamin
    Juergens, Elmar
    2013 IEEE 21ST INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC), 2013, : 83 - 92
  • [34] Source code analysis: A road map
    Binkley, David
    FoSE 2007: Future of Software Engineering, 2007, : 104 - 119
  • [35] Analysis of Source Code Using UPPAAL
    Kulczynski, Mitja
    Legay, Axel
    Nowotka, Dirk
    Poulsen, Danny Bogsted
    ELECTRONIC PROCEEDINGS IN THEORETICAL COMPUTER SCIENCE, 2021, (338): : 31 - 38
  • [36] A Source Code Quality Analysis Approach
    Iqbal, Tahira
    Iqbal, Moniba
    Asad, Muhammad
    Khan, Aihab
    PROCEEDINGS OF 2016 10TH INTERNATIONAL CONFERENCE ON SOFTWARE, KNOWLEDGE, INFORMATION MANAGEMENT & APPLICATIONS (SKIMA), 2016, : 142 - 145
  • [37] Quality analysis of source code comments
    Steidl, Daniela
    Hummel, Benjamin
    Juergens, Elmar
    IEEE International Conference on Program Comprehension, 2013, : 83 - 92
  • [38] Gapped Code Clone Detection with Lightweight Source Code Analysis
    Murakami, Hiroaki
    Hotta, Keisuke
    Higo, Yoshiki
    Igaki, Hiroshi
    Kusumoto, Shinji
    2013 IEEE 21ST INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC), 2013, : 93 - 102
  • [39] Code Coverage of Assertions Using RTL Source Code Analysis
    Athavale, Viraj
    Ma, Sai
    Hertz, Samuel
    Vasudevan, Shobha
    2014 51ST ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2014,
  • [40] Open-source Python repository for data drift analysis
    Wrobel, Krzysztof
    Porwik, Piotr
    Orczyk, Tomasz
    Procedia Computer Science, 2024, 246 (0C) : 482 - 489