SCC plus plus : Predicting the programming language of questions and snippets of Stack Overflow

被引:12
|
作者
Alrashedy, Kamel [1 ]
Dharmaretnam, Dhanush [1 ]
German, Daniel M. [1 ]
Srinivasan, Venkatesh [1 ]
Gulliver, T. Aaron [2 ]
机构
[1] Univ Victoria, Dept Comp Sci, Victoria, BC V8W 2Y2, Canada
[2] Univ Victoria, Dept Elect & Comp Engn, Victoria, BC V8W 2Y2, Canada
关键词
Classification; Machine learning; Natural language processing; And programming languages;
D O I
10.1016/j.jss.2019.110505
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Stack Overflow is the most popular Q&A website among software developers. As a platform for knowledge sharing and acquisition, the questions posted on Stack Overflow usually contain a code snippet. Determining the programming language of a source code file has been considered in the research community; it has been shown that Machine Learning (ML) and Natural Language Processing (NLP) algorithms can be effective in identifying the programming language of source code files. However, determining the programming language of a code snippet or a few lines of source code is still a challenging task. Online forums such as Stack Overflow and code repositories such as GitHub contain a large number of code snippets. In this paper, we design and evaluate Source Code Classification (SCC++), a classifier that can identify the programming language of a question posted on Stack Overflow. The classifier achieves an accuracy of 88.9% in classifying programming languages by combining features from the title, body and the code snippets of the question. We also propose a classifier that only uses the title and body of the question and has an accuracy of 78.9%. Finally, we propose a classifier of code snippets only that achieves an accuracy of 78.1%. These results show that deploying Machine Learning techniques on the combination of text and code snippets of a question provides the best performance. In addition, the classifier can distinguish between code snippets from a family of programming languages such as C. C++ and C#, and can also identify the programming language version such as C# 3.0, C# 4.0 and C# 5.0. (C) 2019 Elsevier Inc. All rights reserved.
引用
收藏
页数:11
相关论文
共 36 条
  • [1] Predicting Questions' Scores on Stack Overflow
    Alharthi, Haifa
    Outioua, Djedjiga
    Baysal, Olga
    2016 IEEE/ACM 3RD INTERNATIONAL WORKSHOP ON CROWDSOURCING IN SOFTWARE ENGINEERING (CSI-SE), 2016, : 1 - 7
  • [2] Predicting the Programming Language: Extracting Knowledge from Stack Overflow Posts
    Baquero, Juan F.
    Camargo, Jorge E.
    Restrepo-Calle, Felipe
    Aponte, Jairo H.
    Gonzalez, Fabio A.
    ADVANCES IN COMPUTING, CCC 2017, 2017, 735 : 199 - 210
  • [3] Predicting Tags for Learner Questions on Stack Overflow
    Olatinwo, Segun O.
    Epp, Carrie Demmans
    INTERNATIONAL JOURNAL OF ARTIFICIAL INTELLIGENCE IN EDUCATION, 2024,
  • [4] A Study of C/C plus plus Code Weaknesses on Stack Overflow
    Zhang, Haoxiang
    Wang, Shaowei
    Li, Heng
    Chen, Tse-Hsun
    Hassan, Ahmed E.
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2021, 48 (07) : 2359 - 2375
  • [5] A Methodology for Detecting Programming Languages in Stack Overflow Questions
    Swaraj, Aman
    Kumar, Sandeep
    PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON SOFTWARE TECHNOLOGIES (ICSOFT), 2022, : 478 - 483
  • [6] Developing a hyperparameter optimization method for classification of code snippets and questions of stack overflow: HyperSCC
    Ozturk, Muhammed Maruf
    EAI ENDORSED TRANSACTIONS ON SCALABLE INFORMATION SYSTEMS, 2022, 10 (01)
  • [7] Programming Language Identification in Stack Overflow Post Snippets with Regex Based Tf-Idf Vectorization over ANN
    Swaraj, Aman
    Kumar, Sandeep
    PROCEEDINGS OF THE 18TH INTERNATIONAL CONFERENCE ON EVALUATION OF NOVEL APPROACHES TO SOFTWARE ENGINEERING, ENASE 2023, 2023, : 648 - 655
  • [8] The reproducibility of programming-related issues in Stack Overflow questions
    Saikat Mondal
    Mohammad Masudur Rahman
    Chanchal K. Roy
    Kevin Schneider
    Empirical Software Engineering, 2022, 27
  • [9] The reproducibility of programming-related issues in Stack Overflow questions
    Mondal, Saikat
    Rahman, Mohammad Masudur
    Roy, Chanchal K.
    Schneider, Kevin
    EMPIRICAL SOFTWARE ENGINEERING, 2022, 27 (03)
  • [10] CRN plus plus : Molecular programming language
    Vasic, Marko
    Soloveichik, David
    Khurshid, Sarfraz
    NATURAL COMPUTING, 2020, 19 (02) : 391 - 407