IdBench: Evaluating Semantic Representations of Identifier Names in Source Code

被引:19
|
作者
Wainakh, Yaza [1 ]
Rauf, Moiz [2 ]
Pradel, Michael [2 ]
机构
[1] Tech Univ Darmstadt, Dept Comp Sci, Darmstadt, Germany
[2] Univ Stuttgart, Dept Comp Sci, Stuttgart, Germany
基金
欧洲研究理事会;
关键词
source code; neural networks; embeddings; identifiers; benchmark; SEARCH;
D O I
10.1109/ICSE43902.2021.00059
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Identifier names convey useful information about the intended semantics of code. Name-based program analyses use this information, e.g., to detect hugs, to predict types, and to improve the readability of code. At the core of name-based analyses are semantic representations of identifiers, e.g., in the form of learned embeddings. The high-level goal of such a representation is to encode whether two identifiers, e.g., 1en and size, are semantically similar. Unfortunately, it is currently unclear to what extent semantic representations match the semantic relatedness and similarity perceived by developers. This paper presents IdBench, the first benchmark for evaluating semantic representations against a ground truth created from thousands of ratings by 509 software developers. We use IdBench to study state-of-the-art embedding techniques proposed for natural language, an embedding technique specifically designed for source code, and lexical string distance functions. Our results show that the effectiveness of semantic representations varies significantly and that the best available embeddings successfully represent semantic relatedness. On the downside, no existing technique provides a satisfactory representation of semantic similarities, among other reasons because identifiers with opposing meanings are incorrectly considered to he similar, which may lead to fatal mistakes, e.g., in a refactoring tool. Studying the strengths and weaknesses of the different techniques shows that they complement each other. As a first step toward exploiting this complementarity, we present an ensemble model that combines existing techniques and that clearly outperforms the best available semantic representation.
引用
收藏
页码:562 / 573
页数:12
相关论文
共 50 条
  • [1] Descriptive Compound Identifier Names Improve Source Code Comprehension
    Schankin, Andrea
    Berger, Annika
    Holt, Daniel, V
    Hofmeister, Johannes C.
    Riedel, Till
    Beigl, Michael
    2018 IEEE/ACM 26TH INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC 2018), 2018, : 31 - 40
  • [2] Semantic Similarity Metrics for Evaluating Source Code Summarization
    Haque, Sakib
    Eberhart, Zachary
    Bansal, Aakash
    McMillan, Collin
    arXiv, 2022,
  • [3] Semantic Similarity Metrics for Evaluating Source Code Summarization
    Haque, Sakib
    Eberhart, Zachary
    Bansal, Aakash
    McMillan, Collin
    30TH IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC 2022), 2022, : 36 - 47
  • [4] Semantic Similarity Metrics for Evaluating Source Code Summarization
    Haque, Sakib
    Eberhart, Zachary
    Bansal, Aakash
    McMillan, Collin
    IEEE International Conference on Program Comprehension, 2022, 2022-March : 36 - 47
  • [5] Dealing with Faults in Source Code: Abbreviated vs. Full-Word Identifier Names
    Scanniello, Giuseppe
    Risi, Michele
    2013 29TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE (ICSM), 2013, : 190 - 199
  • [6] Developing semantic representations for proper names
    Burgess, C
    Conley, P
    PROCEEDINGS OF THE TWENTIETH ANNUAL CONFERENCE OF THE COGNITIVE SCIENCE SOCIETY, 1998, : 185 - 190
  • [7] Exploring the influence of identifier names on code quality: An empirical study
    Centre for Research in Computing, Open University, Milton Keynes, United Kingdom
    Proc. Eur. Conf. Software Maint. Reeng., 1600, (156-165):
  • [8] Exploring the Influence of Identifier Names on Code Quality: an empirical study
    Butler, Simon
    Wermelinger, Michel
    Yu, Yijun
    Sharp, Helen
    14TH EUROPEAN CONFERENCE ON SOFTWARE MAINTENANCE AND REENGINEERING (CSMR 2010), 2010, : 156 - 165
  • [9] Fixing Faults in C and Java']Java Source Code: Abbreviated vs. Full-Word Identifier Names
    Scanniello, Giuseppe
    Risi, Michele
    Tramontana, Porfirio
    Romano, Simone
    ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2017, 26 (02)
  • [10] Effect of Identifier Tokenization on Automatic Source Code Documentation
    Sawan Rai
    Ramesh Chandra Belwal
    Atul Gupta
    Arabian Journal for Science and Engineering, 2022, 47 : 2141 - 2157