Duplicate document detection

被引:0
|
作者
Spitz, AL
机构
来源
DOCUMENT RECOGNITION IV | 1997年 / 3027卷
关键词
duplicate documents; included documents; document databases; document handles; character shape coding;
D O I
暂无
中图分类号
O43 [光学];
学科分类号
070207 ; 0803 ;
摘要
In document image filing applications it is important to be able to recognize whether a particular document has already been entered into the system either as an individual document or as an inclusion in another document. Document images could be matched on the basis of layout or contents. However, matching of layout may not be effective when style is strictly controlled. We develop a document ''handle'' which is stored along with the document image. The handle is simply a character shape coded representation of the image after the figures and tables have been removed. Character shape coding is a method of identifying individual character images as members of one of a small number of classes. This process is computationally inexpensive and tolerant of differing generations of photocopying, skew and scanner characteristics. When a new document is entered into the system, its handle is computed and compared against all of the extant handles using a normalized Levenshtein metric. We demonstrate the ability to detect duplicate documents comprising single and multiple pages.
引用
收藏
页码:88 / 94
页数:3
相关论文
共 50 条
  • [41] AUTOMATIC DETECTION OF DUPLICATE MONOGRAPHIC RECORDS
    HICKEY, TB
    RYPKA, DJ
    JOURNAL OF LIBRARY AUTOMATION, 1979, 12 (02): : 125 - 142
  • [42] Duplicate Detection Exploiting Data Relationships
    Herschel, Melanie
    IT-INFORMATION TECHNOLOGY, 2009, 51 (04): : 231 - 234
  • [43] MDedup: Duplicate Detection with Matching Dependencies
    Koumarelas, Ioannis
    Papenbrock, Thorsten
    Naumann, Felix
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 13 (05): : 712 - 725
  • [44] An Improvement Method of Duplicate Webpage Detection
    Zhang, Chengqi
    Shang, Wenqian
    Li, Yafeng
    PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON ELECTRONIC & MECHANICAL ENGINEERING AND INFORMATION TECHNOLOGY (EMEIT-2012), 2012, 23
  • [45] XML Duplicate Detection Using MapReduce
    Yu, Shoujian
    He, Shan
    ASIA-PACIFIC MANAGEMENT AND ENGINEERING CONFERENCE (APME 2014), 2014, : 1399 - 1406
  • [46] Duplicate address detection and autoconfiguration in OLSR
    Boudjit, S
    Laouiti, A
    Muhlethaler, P
    Adjih, C
    SIXTH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERNG, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING AND FIRST AICS INTERNATIONAL WORKSHOP ON SELF-ASSEMBLING WIRELESS NETWORKS, PROCEEDINGS, 2005, : 403 - 410
  • [47] DWCLEANSER: A Framework for Approximate Duplicate Detection
    Thakur, Garima
    Singh, Manu
    Pahwa, Payal
    Tyagi, Nidhi
    ADVANCES IN COMPUTING AND INFORMATION TECHNOLOGY, 2011, 198 : 355 - +
  • [48] Duplicate address detection and autoconfiguration in OLSR
    Boudjit, Saadi
    Adjih, Cedric
    Muhlethaler, Paul
    Laouiti, Anis
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2007, 13 (01) : 4 - 31
  • [49] DUPLICATE RECORD DETECTION FOR DATABASE CLEANSING
    Rehman, Mariam
    Esichaikul, Vatcharapon
    2009 SECOND INTERNATIONAL CONFERENCE ON MACHINE VISION, PROCEEDINGS, ( ICMV 2009), 2009, : 333 - 338
  • [50] An adaptive algorithm for detection of duplicate records
    Vallur, R
    Chandrasekhar, RS
    IEEE TENCON 2003: CONFERENCE ON CONVERGENT TECHNOLOGIES FOR THE ASIA-PACIFIC REGION, VOLS 1-4, 2003, : 424 - 427