A probabilistic approach to printed document understanding

被引:24
|
作者
Medvet, Eric [1 ]
Bartoli, Alberto [1 ]
Davanzo, Giorgio [1 ]
机构
[1] Univ Trieste, DEEI, I-34127 Trieste, Italy
关键词
Document understanding; Automatic model upgrading; Invoice analysis; Maximum likelihood;
D O I
10.1007/s10032-010-0137-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose an approach for information extraction for multi-page printed document understanding. The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time. Describing a new class is a very simple task: the operator merely provides a few samples and then, by means of a GUI, clicks on the OCR-generated blocks of a document containing the information to be extracted. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. We estimate the parameters for a new class by applying the maximum likelihood method to the samples of the class. All these parameters depend only on block properties that can be extracted automatically from the operator actions on the GUI. Processing a document of a given class consists in finding the sequence of blocks, which maximizes the corresponding probability for that class. We evaluated experimentally our proposal using 807 multi-page printed documents of different domains (invoices, patents, data-sheets), obtaining very good results-e.g., a success rate often greater than 90% even for classes with just two samples.
引用
收藏
页码:335 / 347
页数:13
相关论文
共 50 条
  • [21] A Form Understanding Approach to Printed and Structured Engineering Documentation
    Santos, Gabriel L.
    Silva, Vanessa T.
    Dalmolin, Laura A.
    Rodrigues, Ricardo N.
    Drews Jr, Paulo L. J.
    Duarte Filho, Nelson L.
    2021 34TH SIBGRAPI CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI 2021), 2021, : 330 - 337
  • [22] A computational approach for printed document forensics using SURF and ORB features
    Kumar, Munish
    Gupta, Surbhi
    Mohan, Neeraj
    SOFT COMPUTING, 2020, 24 (17) : 13197 - 13208
  • [23] A computational approach for printed document forensics using SURF and ORB features
    Munish Kumar
    Surbhi Gupta
    Neeraj Mohan
    Soft Computing, 2020, 24 : 13197 - 13208
  • [24] A PROBABILISTIC APPROACH TO MULTI-DOCUMENT SUMMARIZATION FOR GENERATING A TILED SUMMARY
    Saravanan, M.
    Raman, S.
    Ravindran, B.
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS, 2006, 6 (02) : 231 - 243
  • [25] A Probabilistic approach to multi-document summarization for generating a tiled summary
    Saravanan, M
    Raman, S
    Ravindran, B
    ICCIMA 2005: SIXTH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND MULTIMEDIA APPLICATIONS, PROCEEDINGS, 2005, : 167 - 172
  • [26] Probabilistic Document Model for Automated Document Composition
    Damera-Venkata, Niranjan
    Bento, Jose
    O'Brien-Strain, Eamonn
    DOCENG 2011: PROCEEDINGS OF THE 2011 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, 2011, : 3 - 12
  • [27] Relational learning: Statistical approach versus logical approach in document image understanding
    Ceci, M
    Berardi, M
    Malerba, D
    AI*IA2005: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2005, 3673 : 418 - 429
  • [28] Understanding document statistics: An approach to enhance the author's work
    Hicks, Rodney W.
    JOURNAL OF THE AMERICAN ASSOCIATION OF NURSE PRACTITIONERS, 2024, 36 (12) : 674 - 676
  • [29] Towards automation of knowledge understanding: An approach for probabilistic generative classifiers
    Fisch, Dominik
    Gruhl, Christian
    Kalkowski, Edgar
    Sick, Bernhard
    Ovaska, Seppo J.
    INFORMATION SCIENCES, 2016, 370 : 476 - 496
  • [30] A simple graphical approach for understanding probabilistic inference in Bayesian networks
    Butz, C. J.
    Hua, S.
    Chen, J.
    Yao, H.
    INFORMATION SCIENCES, 2009, 179 (06) : 699 - 716