SCC plus plus : Predicting the programming language of questions and snippets of Stack Overflow

被引:12
|
作者
Alrashedy, Kamel [1 ]
Dharmaretnam, Dhanush [1 ]
German, Daniel M. [1 ]
Srinivasan, Venkatesh [1 ]
Gulliver, T. Aaron [2 ]
机构
[1] Univ Victoria, Dept Comp Sci, Victoria, BC V8W 2Y2, Canada
[2] Univ Victoria, Dept Elect & Comp Engn, Victoria, BC V8W 2Y2, Canada
关键词
Classification; Machine learning; Natural language processing; And programming languages;
D O I
10.1016/j.jss.2019.110505
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Stack Overflow is the most popular Q&A website among software developers. As a platform for knowledge sharing and acquisition, the questions posted on Stack Overflow usually contain a code snippet. Determining the programming language of a source code file has been considered in the research community; it has been shown that Machine Learning (ML) and Natural Language Processing (NLP) algorithms can be effective in identifying the programming language of source code files. However, determining the programming language of a code snippet or a few lines of source code is still a challenging task. Online forums such as Stack Overflow and code repositories such as GitHub contain a large number of code snippets. In this paper, we design and evaluate Source Code Classification (SCC++), a classifier that can identify the programming language of a question posted on Stack Overflow. The classifier achieves an accuracy of 88.9% in classifying programming languages by combining features from the title, body and the code snippets of the question. We also propose a classifier that only uses the title and body of the question and has an accuracy of 78.9%. Finally, we propose a classifier of code snippets only that achieves an accuracy of 78.1%. These results show that deploying Machine Learning techniques on the combination of text and code snippets of a question provides the best performance. In addition, the classifier can distinguish between code snippets from a family of programming languages such as C. C++ and C#, and can also identify the programming language version such as C# 3.0, C# 4.0 and C# 5.0. (C) 2019 Elsevier Inc. All rights reserved.
引用
收藏
页数:11
相关论文
共 36 条
  • [31] Research and Practice on the Teaching Mode of C Language Programming Based on "Internet Plus"
    Feng, Mengqing
    Zhang, Jitong
    PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON EDUCATION, MANAGEMENT, INFORMATION AND MECHANICAL ENGINEERING (EMIM 2017), 2017, 76 : 194 - 198
  • [32] Multi-Language Programming Environment for C plus plus Implementation of SONAR Signal Processing by Linking with MATLAB External Interface and FFTW
    Aleksi, Ivan
    Kraus, Dieter
    Hocenski, Zeljko
    53RD INTERNATIONAL SYMPOSIUM ELMAR-2011, 2011, : 195 - 200
  • [33] Java']JavaScript primer plus: Enhancing Web pages with Java']JavaScript programming language
    Kelly, AG
    INTERNATIONAL JOURNAL OF INFORMATION MANAGEMENT, 1998, 18 (02) : 162 - 162
  • [34] CPLUS2ASP: Computing Action Language C plus in Answer Set Programming
    Babb, Joseph
    Lee, Joohyung
    LOGIC PROGRAMMING AND NONMONOTONIC REASONING (LPNMR 2013), 2013, 8148 : 122 - 134
  • [35] The Relative Utility of Three English Language Dominance Measures in Predicting the Neuropsychological Performance of HIV plus Bilingual Latino/a Adults
    Miranda, Caitlin
    Renteria, Miguel Arce
    Fuentes, Armando
    Coulehan, Kelly
    Arentoft, Alyssa
    Byrd, Desiree
    Rosario, Ana
    Monzones, Jennifer
    Morgello, Susan
    Mindt, Monica Rivera
    CLINICAL NEUROPSYCHOLOGIST, 2016, 30 (02) : 185 - 200
  • [36] MasonNLP plus at SemEval-2023 Task 8: Extracting Medical Questions, Experiences and Claims from Social Media using Knowledge-Augmented Pre-trained Language Models
    Ramachandran, Giridhar Kaushik
    Gangavarapu, Haritha
    Lybarger, Kevin
    Uzuner, Ozlem
    17TH INTERNATIONAL WORKSHOP ON SEMANTIC EVALUATION, SEMEVAL-2023, 2023, : 2143 - 2152