Java Code Clone Detection by Exploiting Semantic and Syntax Information from Intermediate Code Based Graph

Dawei Yuan, Sen Fang, Tao Zhang, Zhou Xu, Xiapu Luo

Research output: Journal article publicationJournal articleAcademic researchpeer-review

Abstract

Code clone detection plays a critical role in the field of software engineering. To achieve this goal, developers are required to have rich development experience for finding the &#x201C;functional&#x201D; clone code. However, this is unfriendly to novice developers. Although many approaches were proposed to automatically detect code clones, the results are not satisfactory. A major reason is that it is difficult to extract syntax and semantic information from the source code. To resolve this problem, in this article, we develop a novel graph representation approach based on intermediate code to detect the functional code clones. This graph representation is built based on intermediate code compiled from the source code. By using it, we can easily utilize graph embedding techniques to extract syntactic and semantic features from abstract syntax tree, control flow graph, and DFG generated from intermediate code. After that, we use the Softmax classifier to detect functional code clone pairs. We evaluate the performance of the proposed graph representation approach based on intermediate code for the code clone detection task on the BigCloneBench dataset. In order to improve performance, the embedded representation of intermediate code is initialized based on pretrained vectors learned from the collected LLVM IR dataset in advance. The experimental results show that our proposed intermediate code-based graph approach performs better than existing functional code clone detection approaches. Especially for the type-4 code clone detection, our approach outperforms the baseline approaches by an average of 33.49&#x0025; in the term of <italic>F</italic>1 score.

Original languageEnglish
Pages (from-to)1-16
Number of pages16
JournalIEEE Transactions on Reliability
DOIs
Publication statusPublished - 9 Jun 2022

Keywords

  • Abstract syntax tree (AST)
  • Cloning
  • Codes
  • Data mining
  • Feature extraction
  • Semantics
  • Syntactics
  • Task analysis
  • clone code detection
  • control flow graph (CFG)
  • data flow graph (DFG)
  • graph embedding
  • intermediate code

ASJC Scopus subject areas

  • Safety, Risk, Reliability and Quality
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Java Code Clone Detection by Exploiting Semantic and Syntax Information from Intermediate Code Based Graph'. Together they form a unique fingerprint.

Cite this