Statistical Unigram Analysis for Source Code Repository

被引:12
|
作者
Xu, Weifeng [1 ]
Xu, Dianxiang [1 ]
El Ariss, Omar [2 ]
Liu, Yunkai [3 ]
Alatawi, Abdulrahman [1 ]
机构
[1] Bowie State Univ, Dept Comp Sci, Bowie, MD 20715 USA
[2] Penn State Univ Harrisburg, Dept Comp Sci, Middletown, PA USA
[3] Gannon Univ, Dept Comp & Informat Sci, Erie, PA USA
关键词
programming language; source code; n-gram; unigram; abbreviations; ultra-large-scale analysis;
D O I
10.1109/BigMM.2017.13
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Unigram is a fundamental element of n-gram in natural language processing. However, unigrams collected from a natural language corpus are unsuitable for solving problems in the domain of computer programming languages. In this paper, we analyze the properties of unigrams collected from an ultralarge source code repository. Specifically, we have collected 1.01 billion unigrams from 0.7 million open source projects hosted at GitHub. com. By analyzing these unigrams, we have discovered statistical patterns regarding (1) how developers name variables, methods, and classes, and (2) how developers choose abbreviations. Our study describes a probabilistic model for solving a well-known problem in source code analysis: how to expand a given abbreviation to its original indented word. It shows that the unigrams collected from source code repositories are essential resources to solving the domain specific problems.
引用
收藏
页码:1 / 8
页数:8
相关论文
共 50 条
  • [1] Statistical Unigram Analysis for Source Code Repository
    Xu, Weifeng
    Xu, Dianxiang
    Alatawi, Abdulrahman
    El Ariss, Omar
    Liu, Yunkai
    INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2018, 12 (02) : 237 - 260
  • [2] The OpenMP source code repository
    Dorta, AJ
    Rodríguez, C
    de Sande, F
    González-Escribano, A
    13TH EUROMICRO CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, PROCEEDINGS, 2005, : 244 - 250
  • [3] Bayesian Unigram-Based Inference for Expanding Abbreviations in Source Code
    Alatawi, Abdulrahman
    Xu, Weifeng
    Xu, Dianxiang
    2017 IEEE 29TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2017), 2017, : 543 - 550
  • [4] A NEW SOURCE CODE REPOSITORY FOR DYNAMIC STORING, BROWSING, AND RETRIEVAL OF SOURCE CODES
    Chakraborty, Prithwi Raj
    Chowdhury, Sujan
    Chowdhury, Alok Kumar
    Al Hasan, Shahed
    2013 INTERNATIONAL CONFERENCE ON INFORMATICS, ELECTRONICS & VISION (ICIEV), 2013,
  • [5] Source Code Features and their Dependencies: An Aggregative Statistical Analysis on Open-Source Java']Java Software Systems
    Toosi, Farshad Ghassemi
    APPLIED COMPUTER SYSTEMS, 2023, 28 (02) : 221 - 231
  • [6] Statistical Approach to Increase Source Code Completion Accuracy
    Savchenko, Valeriy
    Volkov, Alexander
    PERSPECTIVES OF SYSTEM INFORMATICS, PSI 2017, 2018, 10742 : 352 - 363
  • [7] Application of Statistical Classifiers on Java']Java Source Code
    Mojzes, Matej
    Rost, Michal
    Smolka, Josef
    Virius, Miroslav
    PROCEEDINGS OF THE 2015 FEDERATED CONFERENCE ON SOFTWARE DEVELOPMENT AND OBJECT TECHNOLOGIES, 2017, 511 : 208 - 218
  • [8] CARMEN: Code analysis, Repository and Modeling for e-Neuroscience
    Austin, Jim
    Jackson, Tom
    Fletcher, Martyn
    Jessop, Mark
    Liang, Bojian
    Weeks, Mike
    Smith, Leslie
    Ingram, Colin
    Watson, Paul
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS), 2011, 4 : 768 - 777
  • [9] Software Repository Analysis for Investigating Design-Code Compliance
    Ozbas-Caglayan, Kadriye
    Dogru, Ali H.
    2013 JOINT CONFERENCE OF THE 23RD INTERNATIONAL WORKSHOP ON SOFTWARE MEASUREMENT AND THE 2013 EIGHTH INTERNATIONAL CONFERENCE ON SOFTWARE PROCESS AND PRODUCT MEASUREMENT (IWSM-MENSURA), 2013, : 231 - 233
  • [10] Source code analysis dataset
    Gelman, Ben
    Obayomi, Banjo
    Moore, Jessica
    Slater, David
    Data in Brief, 2019, 27