Revisiting Information Retrieval and Deep Learning Approaches for Code Summarization

Tingwei Zhu, Zhong Li, Minxue Pan, Chaoxuan Shi, Tian Zhang, Yu Pei, Xuandong Li

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

6 Citations (Scopus)

Abstract

Code summarization refers to the procedure of creating short descriptions that outline the semantics of source code snippets. Existing code summarization approaches can be broadly classified into Information Retrieval (IR)-based and Deep Learning (DL)-based approaches. However, their effectiveness, and especially their strengths and weaknesses, remain largely understudied. Existing evaluations use different benchmarks and metrics, making performance comparisons of these approaches susceptible to bias and potentially yielding misleading results. For example, the DL-based approaches typically show better code summarization performance in their original papers [1], [2]. However, Gros et al. [3] report that a naive IR approach could achieve comparable (or even better) performance to the DL-based ones. In addition, some recent work [4], [5] suggests that incorporating IR techniques can improve the DL-based approaches. To further advance code summarization techniques, it is critical that we have a good understanding of how IR-based and DL-based approaches perform on different datasets and in terms of different metrics. Prior works have studied some aspects of code summarization, such as the factors affecting performance evaluation [6] and the importance of data preprocessing [7], etc. In this paper, we focus on the study of the IR-based and DL-based code summarization approaches to enhance the understanding and design of more advanced techniques. We first compare the IR-based and DL-based approaches under the same experimental settings and benchmarks, then study their strengths and limitations through quantitative and qualitative analyses. Finally, we propose a simpler but effective strategy to combine IR and DL to further improve code summarization. Four IR-based approaches and two DL-based approaches are investigated with regard to representativeness and diversity. For IR-based approaches, we select three BM25-based approaches (i.e., BM25-spl, BM25-ast, BM25-alpha) and one nearest neighbor-based approach NNGen [8], which are often compared as baselines in prior works. They retrieve the most similar code from the database and directly output the corresponding summary. BM25-based approaches are implemented by Lucene [9]. Taking code forms as input, BM25-spl splits the CamelCase and snake_case in original source code tokens, BM25-ast obtains sequence representations using pre-order Abstract Syntax Tree (AST) traversal, and BM25-alpha keeps only the alpha tokens in the code. For DL-based approaches, we choose the state-of-the-art pre-trained model PLBART [2] and the trained-from-scratch model SiT[1]. We adopt four widely used Java datasets, namely TLC [10], CSN [11], HDC [12], and FCM [13] as our subject datasets. TLC and HDC are method-split datasets, where methods in the same project are randomly split into training/validation/test sets. CSN and FCM are project-split datasets, where examples from the same project exist in only one partition. We further process the four datasets to build cleaner datasets by removing examples that have syntax errors, empty method bodies, and too long or too short sequence lengths, etc. We also remove the duplicate examples in the validation and test sets. To comprehensively and systematically evaluate the performance of the code summarization approaches, we adopt three widely used metrics, i.e., BLEU (both C-BLEU and S-BLEU are included), ROUGE, and METEOR, in our experiments.

Original languageEnglish
Title of host publicationProceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering
Subtitle of host publicationCompanion Proceedings (ICSE-Companion)
PublisherIEEE Computer Society
Pages328-329
Number of pages2
ISBN (Electronic)9798350322637
DOIs
Publication statusPublished - May 2023
Event45th IEEE/ACM International Conference on Software Engineering: Companion, ICSE-Companion 2023 - Melbourne, Australia
Duration: 14 May 202320 May 2023

Publication series

NameProceedings - International Conference on Software Engineering
ISSN (Print)0270-5257

Conference

Conference45th IEEE/ACM International Conference on Software Engineering: Companion, ICSE-Companion 2023
Country/TerritoryAustralia
CityMelbourne
Period14/05/2320/05/23

Keywords

  • Code summarization
  • deep learning
  • empirical study
  • information retrieval

ASJC Scopus subject areas

  • Software

Fingerprint

Dive into the research topics of 'Revisiting Information Retrieval and Deep Learning Approaches for Code Summarization'. Together they form a unique fingerprint.

Cite this