Duplicate document detection

被引:0
|
作者
Spitz, AL
机构
来源
DOCUMENT RECOGNITION IV | 1997年 / 3027卷
关键词
duplicate documents; included documents; document databases; document handles; character shape coding;
D O I
暂无
中图分类号
O43 [光学];
学科分类号
070207 ; 0803 ;
摘要
In document image filing applications it is important to be able to recognize whether a particular document has already been entered into the system either as an individual document or as an inclusion in another document. Document images could be matched on the basis of layout or contents. However, matching of layout may not be effective when style is strictly controlled. We develop a document ''handle'' which is stored along with the document image. The handle is simply a character shape coded representation of the image after the figures and tables have been removed. Character shape coding is a method of identifying individual character images as members of one of a small number of classes. This process is computationally inexpensive and tolerant of differing generations of photocopying, skew and scanner characteristics. When a new document is entered into the system, its handle is computed and compared against all of the extant handles using a normalized Levenshtein metric. We demonstrate the ability to detect duplicate documents comprising single and multiple pages.
引用
收藏
页码:88 / 94
页数:3
相关论文
共 50 条
  • [1] Duplicate document detection in DocBrowse
    Chalana, V
    Bruce, A
    Nguyen, T
    DOCUMENT RECOGNITION V, 1998, 3305 : 169 - 178
  • [2] Duplicate document detection by template matching
    Caprari, RS
    IMAGE AND VISION COMPUTING, 2000, 18 (08) : 633 - 643
  • [3] A document comparison scheme for secure duplicate detection
    Mandreoli F.
    Martoglia R.
    Tiberio P.
    International Journal on Digital Libraries, 2004, 4 (3) : 223 - 244
  • [4] Collection statistics for fast duplicate document detection
    Chowdhury, A
    Frieder, O
    Grossman, D
    McCabe, MC
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2002, 20 (02) : 171 - 191
  • [5] Web Document Duplicate Detection Using Fuzzy Hashing
    Figuerola, Carlos G.
    Gomez Diaz, Raquel
    Alonso Berrocal, Jose L.
    Zazo Rodriguez, Angel F.
    TRENDS IN PRACTICAL APPLICATIONS OF AGENTS AND MULTI-AGENTS SYSTEMS, 2011, 90 : 117 - 125
  • [6] Efficient Near-Duplicate Document Detection using FPGAs
    Luo, Xi
    Najjar, Walid
    Hristidis, Vagelis
    2013 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2013,
  • [7] Near-duplicate document detection with improved similarity measurement
    袁鑫攀
    龙军
    张祖平
    桂卫华
    JournalofCentralSouthUniversity, 2012, 19 (08) : 2231 - 2237
  • [8] Deep Learning in the Domain of Near-Duplicate Document Detection
    Roul, Rajendra Kumar
    BIG DATA ANALYTICS (BDA 2019), 2019, 11932 : 439 - 459
  • [9] Near-duplicate document detection with improved similarity measurement
    Yuan Xin-pan
    Long Jun
    Zhang Zu-ping
    Gui Wei-hua
    JOURNAL OF CENTRAL SOUTH UNIVERSITY, 2012, 19 (08) : 2231 - 2237
  • [10] Near-duplicate document detection with improved similarity measurement
    Xin-pan Yuan
    Jun Long
    Zu-ping Zhang
    Wei-hua Gui
    Journal of Central South University, 2012, 19 : 2231 - 2237