Statistical Unigram Analysis for Source Code Repository

被引：12

作者：

Xu, Weifeng ^{[1
]}

Xu, Dianxiang ^{[1
]}

El Ariss, Omar ^{[2
]}

Liu, Yunkai ^{[3
]}

Alatawi, Abdulrahman ^{[1
]}

机构：

[1] Bowie State Univ, Dept Comp Sci, Bowie, MD 20715 USA

[2] Penn State Univ Harrisburg, Dept Comp Sci, Middletown, PA USA

[3] Gannon Univ, Dept Comp & Informat Sci, Erie, PA USA

来源：

2017 IEEE THIRD INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM 2017) | 2017年

关键词：

programming language; source code; n-gram; unigram; abbreviations; ultra-large-scale analysis;

D O I：

10.1109/BigMM.2017.13

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Unigram is a fundamental element of n-gram in natural language processing. However, unigrams collected from a natural language corpus are unsuitable for solving problems in the domain of computer programming languages. In this paper, we analyze the properties of unigrams collected from an ultralarge source code repository. Specifically, we have collected 1.01 billion unigrams from 0.7 million open source projects hosted at GitHub. com. By analyzing these unigrams, we have discovered statistical patterns regarding (1) how developers name variables, methods, and classes, and (2) how developers choose abbreviations. Our study describes a probabilistic model for solving a well-known problem in source code analysis: how to expand a given abbreviation to its original indented word. It shows that the unigrams collected from source code repositories are essential resources to solving the domain specific problems.

引用

页码：1 / 8

页数：8

共 50 条

[1] Statistical Unigram Analysis for Source Code Repository
Xu, Weifeng
Xu, Dianxiang
Alatawi, Abdulrahman
El Ariss, Omar
Liu, Yunkai
INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2018, 12 (02) : 237 - 260
[2] The OpenMP source code repository
Dorta, AJ
Rodríguez, C
de Sande, F
González-Escribano, A
13TH EUROMICRO CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, PROCEEDINGS, 2005, : 244 - 250
[3] Bayesian Unigram-Based Inference for Expanding Abbreviations in Source Code
Alatawi, Abdulrahman
Xu, Weifeng
Xu, Dianxiang
2017 IEEE 29TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2017), 2017, : 543 - 550
[4] A NEW SOURCE CODE REPOSITORY FOR DYNAMIC STORING, BROWSING, AND RETRIEVAL OF SOURCE CODES
Chakraborty, Prithwi Raj
Chowdhury, Sujan
Chowdhury, Alok Kumar
Al Hasan, Shahed
2013 INTERNATIONAL CONFERENCE ON INFORMATICS, ELECTRONICS & VISION (ICIEV), 2013,
[5] Source Code Features and their Dependencies: An Aggregative Statistical Analysis on Open-Source Java']Java Software Systems
Toosi, Farshad Ghassemi
APPLIED COMPUTER SYSTEMS, 2023, 28 (02) : 221 - 231
[6] Statistical Approach to Increase Source Code Completion Accuracy
Savchenko, Valeriy
Volkov, Alexander
PERSPECTIVES OF SYSTEM INFORMATICS, PSI 2017, 2018, 10742 : 352 - 363
[7] Application of Statistical Classifiers on Java']Java Source Code
Mojzes, Matej
Rost, Michal
Smolka, Josef
Virius, Miroslav
PROCEEDINGS OF THE 2015 FEDERATED CONFERENCE ON SOFTWARE DEVELOPMENT AND OBJECT TECHNOLOGIES, 2017, 511 : 208 - 218
[8] CARMEN: Code analysis, Repository and Modeling for e-Neuroscience
Austin, Jim
Jackson, Tom
Fletcher, Martyn
Jessop, Mark
Liang, Bojian
Weeks, Mike
Smith, Leslie
Ingram, Colin
Watson, Paul
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS), 2011, 4 : 768 - 777
[9] Software Repository Analysis for Investigating Design-Code Compliance
Ozbas-Caglayan, Kadriye
Dogru, Ali H.
2013 JOINT CONFERENCE OF THE 23RD INTERNATIONAL WORKSHOP ON SOFTWARE MEASUREMENT AND THE 2013 EIGHTH INTERNATIONAL CONFERENCE ON SOFTWARE PROCESS AND PRODUCT MEASUREMENT (IWSM-MENSURA), 2013, : 231 - 233
[10] Source code analysis dataset
Gelman, Ben
Obayomi, Banjo
Moore, Jessica
Slater, David
Data in Brief, 2019, 27

← 1 2 3 4 5 →