Duplicate document detection

被引:0
|
作者
Spitz, AL
机构
来源
DOCUMENT RECOGNITION IV | 1997年 / 3027卷
关键词
duplicate documents; included documents; document databases; document handles; character shape coding;
D O I
暂无
中图分类号
O43 [光学];
学科分类号
070207 ; 0803 ;
摘要
In document image filing applications it is important to be able to recognize whether a particular document has already been entered into the system either as an individual document or as an inclusion in another document. Document images could be matched on the basis of layout or contents. However, matching of layout may not be effective when style is strictly controlled. We develop a document ''handle'' which is stored along with the document image. The handle is simply a character shape coded representation of the image after the figures and tables have been removed. Character shape coding is a method of identifying individual character images as members of one of a small number of classes. This process is computationally inexpensive and tolerant of differing generations of photocopying, skew and scanner characteristics. When a new document is entered into the system, its handle is computed and compared against all of the extant handles using a normalized Levenshtein metric. We demonstrate the ability to detect duplicate documents comprising single and multiple pages.
引用
收藏
页码:88 / 94
页数:3
相关论文
共 50 条
  • [21] Data Duplicate Detection
    Medidar, Nikita
    Chavan, Manik
    2018 9TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND NETWORKING TECHNOLOGIES (ICCCNT), 2018,
  • [22] Active Duplicate Detection
    Deng, Ke
    Wang, Liwei
    Zhou, Xiaofang
    Sadiq, Shazia
    Fung, Gabriel Pui Cheong
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PT I, PROCEEDINGS, 2010, 5981 : 565 - +
  • [23] Progressive Duplicate Detection
    Papenbrock, Thorsten
    Heise, Arvid
    Naumann, Felix
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015, 27 (05) : 1316 - 1329
  • [24] XNDDF: Towards a Framework for Flexible Near-Duplicate Document Detection Using Supervised and Unsupervised Learning
    Pamulaparty, Lavanya
    Rao, C. V. Guru
    Rao, M. Sreenivasa
    INTERNATIONAL CONFERENCE ON COMPUTER, COMMUNICATION AND CONVERGENCE (ICCC 2015), 2015, 48 : 228 - 235
  • [25] OMNIDIRECTIONAL OBJECT DUPLICATE DETECTION
    Vajda, Peter
    Ivanov, Ivan
    Goldmann, Lutz
    Ebrahimi, Touradj
    2011 IEEE DIGITAL SIGNAL PROCESSING WORKSHOP AND IEEE SIGNAL PROCESSING EDUCATION WORKSHOP (DSP/SPE), 2011, : 332 - 337
  • [26] Adaptive Windows for Duplicate Detection
    Draisbach, Uwe
    Naumann, Felix
    Szott, Sascha
    Wonneberg, Oliver
    2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, : 1073 - 1083
  • [27] DETECTION OF DUPLICATE SECONDARY DOCUMENTS
    YAMPOLSKII, MI
    GORBONOSOV, AE
    NAUCHNO-TEKHNICHESKAYA INFORMATSIYA SERIYA 1-ORGANIZATSIYA I METODIKA INFORMATSIONNOI RABOTY, 1973, (08): : 3 - 6
  • [28] Data Preparation for Duplicate Detection
    Koumarelas, Ioannis
    Jiang, Lan
    Naumann, Felix
    ACM JOURNAL OF DATA AND INFORMATION QUALITY, 2020, 12 (03):
  • [29] Probabilistic iterative duplicate detection
    Lehti, P
    Fankhauser, P
    ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS 2005: COOPIS, DOA, AND ODBASE, PT 2, PROCEEDINGS, 2005, 3761 : 1225 - 1242
  • [30] Duplicate record detection: A survey
    Elmagarmid, Ahmed K.
    Ipeirotis, Panagiotis G.
    Verykios, Vassilios S.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2007, 19 (01) : 1 - 16