IdBench: Evaluating Semantic Representations of Identifier Names in Source Code

被引：19

作者：

Wainakh, Yaza ^{[1
]}

Rauf, Moiz ^{[2
]}

Pradel, Michael ^{[2
]}

机构：

[1] Tech Univ Darmstadt, Dept Comp Sci, Darmstadt, Germany

[2] Univ Stuttgart, Dept Comp Sci, Stuttgart, Germany

来源：

2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2021) | 2021年

基金：

欧洲研究理事会;

关键词：

source code; neural networks; embeddings; identifiers; benchmark; SEARCH;

D O I：

10.1109/ICSE43902.2021.00059

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Identifier names convey useful information about the intended semantics of code. Name-based program analyses use this information, e.g., to detect hugs, to predict types, and to improve the readability of code. At the core of name-based analyses are semantic representations of identifiers, e.g., in the form of learned embeddings. The high-level goal of such a representation is to encode whether two identifiers, e.g., 1en and size, are semantically similar. Unfortunately, it is currently unclear to what extent semantic representations match the semantic relatedness and similarity perceived by developers. This paper presents IdBench, the first benchmark for evaluating semantic representations against a ground truth created from thousands of ratings by 509 software developers. We use IdBench to study state-of-the-art embedding techniques proposed for natural language, an embedding technique specifically designed for source code, and lexical string distance functions. Our results show that the effectiveness of semantic representations varies significantly and that the best available embeddings successfully represent semantic relatedness. On the downside, no existing technique provides a satisfactory representation of semantic similarities, among other reasons because identifiers with opposing meanings are incorrectly considered to he similar, which may lead to fatal mistakes, e.g., in a refactoring tool. Studying the strengths and weaknesses of the different techniques shows that they complement each other. As a first step toward exploiting this complementarity, we present an ensemble model that combines existing techniques and that clearly outperforms the best available semantic representation.

引用

页码：562 / 573

页数：12

共 50 条

[1] Descriptive Compound Identifier Names Improve Source Code Comprehension
Schankin, Andrea
Berger, Annika
Holt, Daniel, V
Hofmeister, Johannes C.
Riedel, Till
Beigl, Michael
2018 IEEE/ACM 26TH INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC 2018), 2018, : 31 - 40
[2] Semantic Similarity Metrics for Evaluating Source Code Summarization
Haque, Sakib
Eberhart, Zachary
Bansal, Aakash
McMillan, Collin
arXiv, 2022,
[3] Semantic Similarity Metrics for Evaluating Source Code Summarization
Haque, Sakib
Eberhart, Zachary
Bansal, Aakash
McMillan, Collin
30TH IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC 2022), 2022, : 36 - 47
[4] Semantic Similarity Metrics for Evaluating Source Code Summarization
Haque, Sakib
Eberhart, Zachary
Bansal, Aakash
McMillan, Collin
IEEE International Conference on Program Comprehension, 2022, 2022-March : 36 - 47
[5] Dealing with Faults in Source Code: Abbreviated vs. Full-Word Identifier Names
Scanniello, Giuseppe
Risi, Michele
2013 29TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE (ICSM), 2013, : 190 - 199
[6] Developing semantic representations for proper names
Burgess, C
Conley, P
PROCEEDINGS OF THE TWENTIETH ANNUAL CONFERENCE OF THE COGNITIVE SCIENCE SOCIETY, 1998, : 185 - 190
[7] Exploring the influence of identifier names on code quality: An empirical study
Centre for Research in Computing, Open University, Milton Keynes, United Kingdom
Proc. Eur. Conf. Software Maint. Reeng., 1600, (156-165):
[8] Exploring the Influence of Identifier Names on Code Quality: an empirical study
Butler, Simon
Wermelinger, Michel
Yu, Yijun
Sharp, Helen
14TH EUROPEAN CONFERENCE ON SOFTWARE MAINTENANCE AND REENGINEERING (CSMR 2010), 2010, : 156 - 165
[9] Fixing Faults in C and Java']Java Source Code: Abbreviated vs. Full-Word Identifier Names
Scanniello, Giuseppe
Risi, Michele
Tramontana, Porfirio
Romano, Simone
ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2017, 26 (02)
[10] Effect of Identifier Tokenization on Automatic Source Code Documentation
Sawan Rai
Ramesh Chandra Belwal
Atul Gupta
Arabian Journal for Science and Engineering, 2022, 47 : 2141 - 2157

← 1 2 3 4 5 →