Improving plagiarism detection in text document using hybrid weighted similarity

被引：9

作者：

Arabi, Hamed ^{[1
]}

Akbari, Mehdi ^{[1
,2
]}

机构：

[1] Islamic Azad Univ, Fac Comp Engn, Najafabad Branch, Najafabad, Iran

[2] Islamic Azad Univ, Big Data Res Ctr, Najafabad Branch, Najafabad, Iran

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2022年 / 207卷

关键词：

Extrinsic plagiarism; Word Embedding Technique; Bag of Word Technique; Structural Similarity; FastText; VECTOR-SPACE MODEL;

D O I：

10.1016/j.eswa.2022.118034

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Plagiarism is a misconduct, which refers to the use of scientific and literary content contained in other sources without reference to them. Today, the rise of plagiarism has become a serious problem for publishers and researchers. Many researchers have discussed this problem and tried to identify types of plagiarism; however, most of these methods are not effective in detecting intelligent plagiarism. In other words, most of these methods focus on direct copying. Therefore, in this study, two methods are proposed to identify Extrinsic plagiarism. In both methods, to limit the search space, two stages of filtering based on the bag of word (BoW) technique are used at the document level and at the sentence level, and plagiarism is investigated only in the outputs of these two stages. In the first method to detect similarities in suspicious documents and sentences, the combination of pretrained network technique of words embedding FastText and TF-IDF weighting technique to form two structural and semantic matrices and in the second method to form the two matrices, WordNet ontology and weighting TFIDF is used. After forming the above matrices and calculating the similarity between the pairs of matrices of each sentence, using the Dice similarity and the structural similarity of the weighted composition, two similarity values are calculated. By comparing the similarity of suspicious sentences with the minimum threshold, the document containing the suspicious sentence receives the label of plagiarism or non-plagiarism. Experimental results on the PAN-PC-11 database show that the first method has achieved 95.1% precision and the second method 93.8% precision, which shows that the use of word embedding network compared to WordNet ontology can be more successful in detecting Extrinsic plagiarism.

引用

页数：15

共 50 条

[1] Psquad: Plagiarism detection and document similarity of Hindi text
Mittal, Shashank
Mishra, Atul
Khatter, Kiran
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (06) : 17299 - 17326
[2] Psquad: Plagiarism detection and document similarity of Hindi text
Shashank Mittal
Atul Mishra
Kiran Khatter
Multimedia Tools and Applications, 2024, 83 : 17299 - 17326
[3] Plagiarism detection using document similarity based on distributed representation
Baba, Kensuke
Nakatoh, Tetsuya
Minami, Toshiro
8TH INTERNATIONAL CONFERENCE ON ADVANCES IN INFORMATION TECHNOLOGY, 2017, 111 : 382 - 387
[4] Multi-level text document similarity estimation and its application for plagiarism detection
Hadi Veisi
Mahboobeh Golchinpour
Mostafa Salehi
Erfaneh Gharavi
Iran Journal of Computer Science, 2022, 5 (2) : 143 - 155
[5] Fast Plagiarism Detection Based on Simple Document Similarity
Baba, Kensuke
2017 TWELFTH INTERNATIONAL CONFERENCE ON DIGITAL INFORMATION MANAGEMENT (ICDIM), 2017, : 54 - 58
[6] Document plagiarism detection using a new concept similarity in formal concept analysis
Muangprathub, Jirapond
Kajornkasirat, Siriwan
Wanichsombat, Apirat
Journal of Applied Mathematics, 2021, 2021
[7] Document Plagiarism Detection Using a New Concept Similarity in Formal Concept Analysis
Muangprathub, Jirapond
Kajornkasirat, Siriwan
Wanichsombat, Apirat
JOURNAL OF APPLIED MATHEMATICS, 2021, 2021
[8] Plagiarism Detection of Paraphrases in Text Documents with Document Retrieval
Sandhya, S.
Chitrakala, S.
ADVANCES IN COMPUTING AND INFORMATION TECHNOLOGY, 2011, 198 : 330 - 338
[9] Efficient document similarity detection using weighted phrase indexing
Niyigena P.
Zuping Z.
Khuhro M.A.
Hanyurwimfura D.
1600, Science and Engineering Research Support Society (11): : 231 - 244
[10] HYPLAG: Hybrid Arabic Text Plagiarism Detection System
Ghanem, Bilal
Arafeh, Labib
Rosso, Paolo
Sanchez-Vega, Fernando
NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2018), 2018, 10859 : 315 - 323

← 1 2 3 4 5 →