PaperDiff: A Script Independent Automatic Method for Finding The Text Differences Between Two Document Images

被引:2
|
作者
Ramachandrula, Sitaram [1 ]
Joshi, Gopal Datt [1 ]
Noushath, S. [1 ]
Parikh, Pulkit [1 ]
Guptat, Vishal [1 ]
机构
[1] Hewlett Packard Labs India, Bangalore, Karnataka, India
关键词
D O I
10.1109/DAS.2008.69
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we introduce a novel concept called PaperDiff and propose an algorithm to implement it. The aim of PaperDiff is to compare two printed (paper) documents using their images and determine the differences in terms of text inserted, deleted and substituted between them. This lets an end-user compare two documents which are already printed or even if one of which is printed (the other could be in electronic form such as MS-word *.doc file). The algorithm we have proposed for realizing PaperDiff is based on word image comparison and is even suitable for symbol strings and for any script/language (including multiple scripts) in the documents, where even mature optical character recognition (OCR) technology has had very little success. PaperDiff enables end-users like lawyers, novelists, etc, in comparing new document versions with older versions of them. Our proposed method is suitable even when the formatting of content is different between the two input documents, where the structures of the document images are different (for e.g., differing page widths, page structure etc). An experiment of PaperDiff on single column text documents yielded 99.2% accuracy while detecting 135 induced differences in 10 pairs of documents.
引用
收藏
页码:585 / 590
页数:6
相关论文
共 50 条
  • [1] Script-Independent Text Segmentation from Document Images
    Sahare P.
    Tembhurne J.V.
    Parate M.R.
    Diwan T.
    Dhok S.B.
    International Journal of Ambient Computing and Intelligence, 2022, 13 (01)
  • [2] Script independent text segmentation of document images using graph network based shortest path scheme
    Sahare P.
    Tembhurne J.V.
    Parate M.R.
    Diwan T.
    Dhok S.B.
    International Journal of Information Technology, 2023, 15 (4) : 2247 - 2261
  • [3] Automatic text block separation in document images
    Arvind, K. R.
    Pati, Peeta Basa
    Ramakrishnan, A. G.
    FOURTH INTERNATIONAL CONFERENCE ON INTELLIGENT SENSING AND INFORMATION PROCESSSING, PROCEEDINGS, 2006, : 53 - +
  • [4] An Approach for Automatic Indic Script Identification from Handwritten Document Images
    Obaidullah, Sk. Md.
    Halder, Chayan
    Das, Nibaran
    Roy, Kaushik
    ADVANCED COMPUTING AND SYSTEMS FOR SECURITY, VOL 2, 2016, 396 : 37 - 51
  • [5] Automatic Anonymization of Printed-Text Document Images
    Sanchez, Angel
    Velez, Jose F.
    Sanchez, Javier
    Belen Moreno, A.
    IMAGE AND SIGNAL PROCESSING (ICISP 2018), 2018, 10884 : 145 - 152
  • [6] Script-independent, HMM-based text line finding for OCR
    Lu, ZD
    Schwartz, R
    Raphael, C
    15TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 4, PROCEEDINGS: APPLICATIONS, ROBOTICS SYSTEMS AND ARCHITECTURES, 2000, : 551 - 554
  • [7] A New Method of Automatic Text Document Classification
    Yatsko, V. A.
    AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS, 2021, 55 (03) : 122 - 133
  • [8] A New Method of Automatic Text Document Classification
    V. A. Yatsko
    Automatic Documentation and Mathematical Linguistics, 2021, 55 : 122 - 133
  • [9] AUTOMATIC TEXT EXTRACTION, REMOVAL AND INPAINTING OF COMPLEX DOCUMENT IMAGES
    Chen, Yen-Lin
    INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2012, 8 (1A): : 303 - 327
  • [10] Automatic script identification from document images using cluster-based templates
    Hochberg, J
    Kelly, P
    Thomas, T
    Kerns, L
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1997, 19 (02) : 176 - 181