Duplicate document detection

被引：0

作者：

Spitz, AL

机构：

来源：

DOCUMENT RECOGNITION IV | 1997年 / 3027卷

关键词：

duplicate documents; included documents; document databases; document handles; character shape coding;

D O I：

暂无

中图分类号：

O43 [光学];

学科分类号：

070207 ; 0803 ;

摘要：

In document image filing applications it is important to be able to recognize whether a particular document has already been entered into the system either as an individual document or as an inclusion in another document. Document images could be matched on the basis of layout or contents. However, matching of layout may not be effective when style is strictly controlled. We develop a document ''handle'' which is stored along with the document image. The handle is simply a character shape coded representation of the image after the figures and tables have been removed. Character shape coding is a method of identifying individual character images as members of one of a small number of classes. This process is computationally inexpensive and tolerant of differing generations of photocopying, skew and scanner characteristics. When a new document is entered into the system, its handle is computed and compared against all of the extant handles using a normalized Levenshtein metric. We demonstrate the ability to detect duplicate documents comprising single and multiple pages.

引用

页码：88 / 94

页数：3

共 50 条

[1] Duplicate document detection in DocBrowse
Chalana, V
Bruce, A
Nguyen, T
DOCUMENT RECOGNITION V, 1998, 3305 : 169 - 178
[2] Duplicate document detection by template matching
Caprari, RS
IMAGE AND VISION COMPUTING, 2000, 18 (08) : 633 - 643
[3] A document comparison scheme for secure duplicate detection
Mandreoli F.
Martoglia R.
Tiberio P.
International Journal on Digital Libraries, 2004, 4 (3) : 223 - 244
[4] Collection statistics for fast duplicate document detection
Chowdhury, A
Frieder, O
Grossman, D
McCabe, MC
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2002, 20 (02) : 171 - 191
[5] Web Document Duplicate Detection Using Fuzzy Hashing
Figuerola, Carlos G.
Gomez Diaz, Raquel
Alonso Berrocal, Jose L.
Zazo Rodriguez, Angel F.
TRENDS IN PRACTICAL APPLICATIONS OF AGENTS AND MULTI-AGENTS SYSTEMS, 2011, 90 : 117 - 125
[6] Efficient Near-Duplicate Document Detection using FPGAs
Luo, Xi
Najjar, Walid
Hristidis, Vagelis
2013 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2013,
[7] Near-duplicate document detection with improved similarity measurement
袁鑫攀
龙军
张祖平
桂卫华
JournalofCentralSouthUniversity, 2012, 19 (08) : 2231 - 2237
[8] Deep Learning in the Domain of Near-Duplicate Document Detection
Roul, Rajendra Kumar
BIG DATA ANALYTICS (BDA 2019), 2019, 11932 : 439 - 459
[9] Near-duplicate document detection with improved similarity measurement
Yuan Xin-pan
Long Jun
Zhang Zu-ping
Gui Wei-hua
JOURNAL OF CENTRAL SOUTH UNIVERSITY, 2012, 19 (08) : 2231 - 2237
[10] Near-duplicate document detection with improved similarity measurement
Xin-pan Yuan
Jun Long
Zu-ping Zhang
Wei-hua Gui
Journal of Central South University, 2012, 19 : 2231 - 2237

← 1 2 3 4 5 →