A natural language processing approach to automatic plagiarism detection

Chi Hong Leung, Yuen Yan Chan

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

12 Citations (Scopus)

Abstract

The problem of plagiarism has existed for a long time but with the advance of information technology the problem becomes worse. It is because there are many electronic versions of published materials available to everyone. The Web is an important and common source for plagiarism. Some plagiarism detection programs (such as Turnitin) were developed to attempt to deal with this problem. To determine whether an article is copied from the Web or other electronic sources, the plagiarism detection program should calculate the similarity between two articles. However, it is often difficult to detect plagiarism accurately after modification of the copied contents. For example, it is possible to simply replace a word with its synonym (e.g. "program" - "software ") and change the entire sentence structure. Most plagiarism detection programs can only compare whether two words are the same lexically and count how many matched words are there in a paper. Thus, if the copied materials are modified deliberately, it becomes difficult to detect plagiarism. Application of natural language processing can help to resolve this kind of problem. The underlying syntactic structure and semantic meaning of two sentences can be compared to reveal their similarity. There are several steps in the matching procedure. First, the thesaurus (or the lexical hierarchical structure) is referenced to find out the synonyms, broader terms and narrower terms used in the paper being checked. Then, the paper will be compared with the documents in the database. Wordnet is a typical example of the thesaurus that can be used for this purpose. If it is suspected that the paper contains some contents from the database, the sentences of the paper may be parsed to construct their parsing trees and semantic representations for further detailed comparison. The context free grammar and the case grammar are used to represent the syntactic structure and semantic meaning of sentences in the system. It is found that plagiarism that cannot be detected by the traditional methods can be identified by this new approach.
Original languageEnglish
Title of host publicationSIGITE'07 - Proceedings of the 2007 ACM Information Technology Education Conference
Pages213-218
Number of pages6
DOIs
Publication statusPublished - 1 Dec 2007
Externally publishedYes
Event8th ACM SIG-Information Technology Education Conference, SIGITE 2007 - Destin, FL, United States
Duration: 18 Oct 200720 Oct 2007

Conference

Conference8th ACM SIG-Information Technology Education Conference, SIGITE 2007
CountryUnited States
CityDestin, FL
Period18/10/0720/10/07

Keywords

  • Natural language process
  • Plagiarism detection
  • Syntactic and semantic analysis

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Education

Cite this