A new text feature extraction model and itsapplication in document copy detection

被引:5
|
作者
Bao, JP [1 ]
Shen, JY [1 ]
Liu, XD [1 ]
Song, QB [1 ]
机构
[1] Xi An Jiao Tong Univ, Dept Comp Sci & Engn, Xian 710049, Peoples R China
关键词
text feature; similarity measure; information retrieval; copy detection;
D O I
10.1109/ICMLC.2003.1264447
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text feature extraction is a common issue in Information Retrieval, Text Mining, Web Mining, Text Classification/Clustering and Document Copy Detection etc. The most popular approach is word frequency based scheme, which uses a word frequency vector to represent a document. Cosine function, dot product and proportion function are regular similarity measures of vector. But that is only global semantic feature of a document and loses local feature and structural information so that it prevents us to distinguish text well, especially in copy detection. In this paper we present a new text feature extraction model: Semantic Sequence Model (SSM) that based on the concepts of word distance, Word density and semantic sequence. The semantic sequences of a document contain not only local semantic features but also global feature and structural information, on which we get excellent accuracy of text copy detection. At the end of the paper, we contrast SSM with VSM and RFM and the experimental results show SSM is a superior model.
引用
收藏
页码:82 / 87
页数:6
相关论文
共 50 条
  • [1] A window-based feature extraction method in document copy detection
    Li, Xu
    Liu, Guo-Hua
    Ma, Flui-Dong
    PROCEEDINGS OF THE FIRST INTERNATIONAL SYMPOSIUM ON DATA, PRIVACY, AND E-COMMERCE, 2007, : 215 - +
  • [2] A Fingerprint Feature Extraction Algorithm based on Optimal Decision for Text Copy Detection
    Wu, Guohua
    Zhao, Mengmeng
    Han, Lin
    Li, Sen
    INTERNATIONAL JOURNAL OF SECURITY AND ITS APPLICATIONS, 2016, 10 (11): : 67 - 78
  • [3] A fast document copy detection model
    Bao, JP
    Shen, JY
    Liu, HY
    Liu, XD
    SOFT COMPUTING, 2006, 10 (01) : 41 - 46
  • [4] A fast document copy detection model
    Jun-Peng Bao
    Jun-Yi Shen
    Hai-Yan Liu
    Xiao-Dong Liu
    Soft Computing, 2006, 10 : 41 - 46
  • [5] Maximum Entropy Model based on Feature Extraction for Sentiment Detection of Text
    Li, Jun
    Jin, Wei
    Zhang, Zihao
    PROCEEDINGS OF THE 2016 2ND WORKSHOP ON ADVANCED RESEARCH AND TECHNOLOGY IN INDUSTRY APPLICATIONS, 2016, 81 : 1300 - 1307
  • [6] Feature extraction for document text using Latent Dirichlet Allocation
    Prihatini, P. M.
    Suryawan, I. K.
    Mandia, I. N.
    2ND INTERNATIONAL JOINT CONFERENCE ON SCIENCE AND TECHNOLOGY (IJCST) 2017, 2018, 953
  • [7] LOCAL FEATURE EXTRACTION FOR VIDEO COPY DETECTION IN A DATABASE
    Maani, Elisan
    Tsaftaris, Sotirios A.
    Katsaggelos, Aggelos K.
    2008 15TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOLS 1-5, 2008, : 1716 - 1719
  • [8] A new feature extraction method for text classification
    Yildiz, H. Kemal
    Genctav, Murat
    Usta, Nurullah
    Diri, Banu
    Amasyali, M. Fatih
    2007 IEEE 15TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS, VOLS 1-3, 2007, : 326 - 329
  • [9] Feature extraction for document image segmentation by pLSA model
    Yamaguchi, Takuma
    Maruyama, Minoru
    PROCEEDINGS OF THE 8TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS, 2008, : 53 - 60
  • [10] Deep feature extraction for document forgery detection with convolutional autoencoders
    Jaiswal, Garima
    Sharma, Arun
    Yadav, Sumit Kumar
    COMPUTERS & ELECTRICAL ENGINEERING, 2022, 99