Statistical Unigram Analysis for Source Code Repository

被引:12
|
作者
Xu, Weifeng [1 ]
Xu, Dianxiang [1 ]
El Ariss, Omar [2 ]
Liu, Yunkai [3 ]
Alatawi, Abdulrahman [1 ]
机构
[1] Bowie State Univ, Dept Comp Sci, Bowie, MD 20715 USA
[2] Penn State Univ Harrisburg, Dept Comp Sci, Middletown, PA USA
[3] Gannon Univ, Dept Comp & Informat Sci, Erie, PA USA
关键词
programming language; source code; n-gram; unigram; abbreviations; ultra-large-scale analysis;
D O I
10.1109/BigMM.2017.13
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Unigram is a fundamental element of n-gram in natural language processing. However, unigrams collected from a natural language corpus are unsuitable for solving problems in the domain of computer programming languages. In this paper, we analyze the properties of unigrams collected from an ultralarge source code repository. Specifically, we have collected 1.01 billion unigrams from 0.7 million open source projects hosted at GitHub. com. By analyzing these unigrams, we have discovered statistical patterns regarding (1) how developers name variables, methods, and classes, and (2) how developers choose abbreviations. Our study describes a probabilistic model for solving a well-known problem in source code analysis: how to expand a given abbreviation to its original indented word. It shows that the unigrams collected from source code repositories are essential resources to solving the domain specific problems.
引用
收藏
页码:1 / 8
页数:8
相关论文
共 50 条
  • [41] Research, Implementation and Analysis of Source Code Metrics in Rust-Code-Analysis
    Ardito, Luca
    Ballario, Marco
    Valsesia, Michele
    IEEE International Conference on Software Quality, Reliability and Security, QRS, 2023, : 497 - 506
  • [42] A Statistical Interpolation Code for Ocean Analysis and Forecasting
    Srinivasan, Ashwanth
    Chin, T. M.
    Chassignet, E. P.
    Iskandarani, M.
    Groves, N.
    JOURNAL OF ATMOSPHERIC AND OCEANIC TECHNOLOGY, 2022, 39 (03) : 367 - 386
  • [43] Feature Space for Statistical Classification of Java']Java Source Code Patterns
    Mojzes, Matej
    Rost, Michal
    Smolka, Josef
    Virius, Miroslav
    2014 15TH INTERNATIONAL CARPATHIAN CONTROL CONFERENCE (ICCC), 2014, : 357 - 361
  • [44] Constructing a usage model for statistical testing with source code generation methods
    Takagi, T
    Furukawa, Z
    11TH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE, PROCEEDINGS, 2004, : 448 - 454
  • [45] Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context
    Agrawal, Lakshya A.
    Kanade, Aditya
    Goyal, Navin
    Lahiri, Shuvendu K.
    Rajamani, Sriram K.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [46] Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context
    Agrawal, Lakshya A.
    Goyal, Navin
    Kanade, Aditya
    Lahiri, Shuvendu K.
    Rajamani, Sriram K.
    Advances in Neural Information Processing Systems, 2023, 36
  • [47] Creating and Analyzing Source Code Repository Models A Model-based Approach to Mining Software Repositories
    Scheidgen, Markus
    Smidt, Martin
    Fischer, Joachim
    MODELSWARD: PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON MODEL-DRIVEN ENGINEERING AND SOFTWARE DEVELOPMENT, 2017, : 329 - 336
  • [48] Write me this Code: An Analysis of ChatGPT Quality for Producing Source Code
    Moratis, Konstantinos
    Diamantopoulos, Themistoklis
    Nastos, Dimitrios-Nikitas
    Symeonidis, Andreas
    2024 IEEE/ACM 21ST INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES, MSR, 2024, : 147 - 151
  • [49] Open source repository launched
    不详
    DR DOBBS JOURNAL, 2000, 25 (03): : 18 - 18
  • [50] Integration of Static and Dynamic Code Analysis for Understanding Legacy Source Code
    Kirchmayr, Wilhelm
    Moser, Michael
    Nocke, Ludwig
    Pichler, Josef
    Tober, Rudolf
    32ND IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME 2016), 2016, : 543 - 552