Duplicate document detection

被引:0
|
作者
Spitz, AL
机构
来源
DOCUMENT RECOGNITION IV | 1997年 / 3027卷
关键词
duplicate documents; included documents; document databases; document handles; character shape coding;
D O I
暂无
中图分类号
O43 [光学];
学科分类号
070207 ; 0803 ;
摘要
In document image filing applications it is important to be able to recognize whether a particular document has already been entered into the system either as an individual document or as an inclusion in another document. Document images could be matched on the basis of layout or contents. However, matching of layout may not be effective when style is strictly controlled. We develop a document ''handle'' which is stored along with the document image. The handle is simply a character shape coded representation of the image after the figures and tables have been removed. Character shape coding is a method of identifying individual character images as members of one of a small number of classes. This process is computationally inexpensive and tolerant of differing generations of photocopying, skew and scanner characteristics. When a new document is entered into the system, its handle is computed and compared against all of the extant handles using a normalized Levenshtein metric. We demonstrate the ability to detect duplicate documents comprising single and multiple pages.
引用
收藏
页码:88 / 94
页数:3
相关论文
共 50 条
  • [31] Duplicate Detection in Probabilistic Data
    Panse, Fabian
    van Keulen, Maurice
    de Keijzer, Ander
    Ritter, Norbert
    2010 IEEE 26TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOPS (ICDE 2010), 2010, : 179 - 182
  • [32] SUGGESTIONS FOR REFINING DUPLICATE DETECTION
    LOBECK, MA
    ONLINE, 1990, 14 (06): : 60 - 60
  • [33] Near-duplicate document image matching: A graphical perspective
    Liu, Li
    Lu, Yue
    Suen, Ching Y.
    PATTERN RECOGNITION, 2014, 47 (04) : 1653 - 1663
  • [34] Web document duplicate removal algorithm based on keyword sequences
    Li, W
    Liu, JY
    Wang, C
    Proceedings of the 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE'05), 2005, : 511 - 516
  • [35] Duplicate Data Detection Using GNN
    Lu, Hanrong
    Chen, Xin
    Lan, Xuhui
    Zheng, Feng
    PROCEEDINGS OF 2016 IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYSIS (ICCCBDA 2016), 2016, : 167 - 170
  • [36] Efficient and exact duplicate detection on cloud
    Rong, Chuitian
    Lu, Wei
    Du, Xiaoyong
    Zhang, Xiao
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2013, 25 (15): : 2187 - 2206
  • [37] DUPLICATE BRIDGE NUMBER DETECTION.
    Sy, K.K.
    IBM technical disclosure bulletin, 1984, 27 (7 B): : 4122 - 4123
  • [38] Duplicate Detection in a Knowledge Base with PIKA
    Prieur, Maxime
    Gadek, Guillaume
    Grilheres, Bruno
    ICAART: PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE - VOL 3, 2022, : 46 - 54
  • [39] Enhanced Duplicate Count Strategy: Towards New Algorithms to Improve Duplicate Detection
    Aassem, Youssef
    Hafidi, Imad
    Aboutabit, Noureddine
    3RD INTERNATIONAL CONFERENCE ON NETWORKING, INFORMATION SYSTEM & SECURITY (NISS'20), 2020,
  • [40] Near Duplicate Detection in Relational Databases
    Bayrak, Ahmet Tugrul
    Yilmaz, Aykut Inan
    Yilmaz, Kemal Burak
    Duzagac, Remzi
    Yildiz, Olcay Taner
    2018 26TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2018,