Improving plagiarism detection in text document using hybrid weighted similarity

被引:9
|
作者
Arabi, Hamed [1 ]
Akbari, Mehdi [1 ,2 ]
机构
[1] Islamic Azad Univ, Fac Comp Engn, Najafabad Branch, Najafabad, Iran
[2] Islamic Azad Univ, Big Data Res Ctr, Najafabad Branch, Najafabad, Iran
关键词
Extrinsic plagiarism; Word Embedding Technique; Bag of Word Technique; Structural Similarity; FastText; VECTOR-SPACE MODEL;
D O I
10.1016/j.eswa.2022.118034
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Plagiarism is a misconduct, which refers to the use of scientific and literary content contained in other sources without reference to them. Today, the rise of plagiarism has become a serious problem for publishers and researchers. Many researchers have discussed this problem and tried to identify types of plagiarism; however, most of these methods are not effective in detecting intelligent plagiarism. In other words, most of these methods focus on direct copying. Therefore, in this study, two methods are proposed to identify Extrinsic plagiarism. In both methods, to limit the search space, two stages of filtering based on the bag of word (BoW) technique are used at the document level and at the sentence level, and plagiarism is investigated only in the outputs of these two stages. In the first method to detect similarities in suspicious documents and sentences, the combination of pretrained network technique of words embedding FastText and TF-IDF weighting technique to form two structural and semantic matrices and in the second method to form the two matrices, WordNet ontology and weighting TFIDF is used. After forming the above matrices and calculating the similarity between the pairs of matrices of each sentence, using the Dice similarity and the structural similarity of the weighted composition, two similarity values are calculated. By comparing the similarity of suspicious sentences with the minimum threshold, the document containing the suspicious sentence receives the label of plagiarism or non-plagiarism. Experimental results on the PAN-PC-11 database show that the first method has achieved 95.1% precision and the second method 93.8% precision, which shows that the use of word embedding network compared to WordNet ontology can be more successful in detecting Extrinsic plagiarism.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Psquad: Plagiarism detection and document similarity of Hindi text
    Mittal, Shashank
    Mishra, Atul
    Khatter, Kiran
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (06) : 17299 - 17326
  • [2] Psquad: Plagiarism detection and document similarity of Hindi text
    Shashank Mittal
    Atul Mishra
    Kiran Khatter
    Multimedia Tools and Applications, 2024, 83 : 17299 - 17326
  • [3] Plagiarism detection using document similarity based on distributed representation
    Baba, Kensuke
    Nakatoh, Tetsuya
    Minami, Toshiro
    8TH INTERNATIONAL CONFERENCE ON ADVANCES IN INFORMATION TECHNOLOGY, 2017, 111 : 382 - 387
  • [4] Multi-level text document similarity estimation and its application for plagiarism detection
    Hadi Veisi
    Mahboobeh Golchinpour
    Mostafa Salehi
    Erfaneh Gharavi
    Iran Journal of Computer Science, 2022, 5 (2) : 143 - 155
  • [5] Fast Plagiarism Detection Based on Simple Document Similarity
    Baba, Kensuke
    2017 TWELFTH INTERNATIONAL CONFERENCE ON DIGITAL INFORMATION MANAGEMENT (ICDIM), 2017, : 54 - 58
  • [6] Document plagiarism detection using a new concept similarity in formal concept analysis
    Muangprathub, Jirapond
    Kajornkasirat, Siriwan
    Wanichsombat, Apirat
    Journal of Applied Mathematics, 2021, 2021
  • [7] Document Plagiarism Detection Using a New Concept Similarity in Formal Concept Analysis
    Muangprathub, Jirapond
    Kajornkasirat, Siriwan
    Wanichsombat, Apirat
    JOURNAL OF APPLIED MATHEMATICS, 2021, 2021
  • [8] Plagiarism Detection of Paraphrases in Text Documents with Document Retrieval
    Sandhya, S.
    Chitrakala, S.
    ADVANCES IN COMPUTING AND INFORMATION TECHNOLOGY, 2011, 198 : 330 - 338
  • [9] Efficient document similarity detection using weighted phrase indexing
    Niyigena P.
    Zuping Z.
    Khuhro M.A.
    Hanyurwimfura D.
    1600, Science and Engineering Research Support Society (11): : 231 - 244
  • [10] HYPLAG: Hybrid Arabic Text Plagiarism Detection System
    Ghanem, Bilal
    Arafeh, Labib
    Rosso, Paolo
    Sanchez-Vega, Fernando
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2018), 2018, 10859 : 315 - 323