A probabilistic approach to printed document understanding

被引:24
|
作者
Medvet, Eric [1 ]
Bartoli, Alberto [1 ]
Davanzo, Giorgio [1 ]
机构
[1] Univ Trieste, DEEI, I-34127 Trieste, Italy
关键词
Document understanding; Automatic model upgrading; Invoice analysis; Maximum likelihood;
D O I
10.1007/s10032-010-0137-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose an approach for information extraction for multi-page printed document understanding. The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time. Describing a new class is a very simple task: the operator merely provides a few samples and then, by means of a GUI, clicks on the OCR-generated blocks of a document containing the information to be extracted. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. We estimate the parameters for a new class by applying the maximum likelihood method to the samples of the class. All these parameters depend only on block properties that can be extracted automatically from the operator actions on the GUI. Processing a document of a given class consists in finding the sequence of blocks, which maximizes the corresponding probability for that class. We evaluated experimentally our proposal using 807 multi-page printed documents of different domains (invoices, patents, data-sheets), obtaining very good results-e.g., a success rate often greater than 90% even for classes with just two samples.
引用
收藏
页码:335 / 347
页数:13
相关论文
共 50 条
  • [41] Features for printed document image analysis
    Duong, J
    Emptoz, H
    Côté, M
    16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL III, PROCEEDINGS, 2002, : 245 - 248
  • [42] Probabilistic Latent Document Network Embedding
    Le, Tuan M. V.
    Lauw, Hady W.
    2014 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2014, : 270 - 279
  • [43] Probabilistic aspects in spoken document retrieval
    Macherey, W. (w.macherey@informatik.rwth-aachen.de), 1600, Hindawi Publishing Corporation (2003):
  • [44] Leveraging Probabilistic Segmentation to Document Clustering
    Banerjee, Arko
    2015 EIGHTH INTERNATIONAL CONFERENCE ON CONTEMPORARY COMPUTING (IC3), 2015, : 82 - 87
  • [45] Probabilistic homogeneity for document image segmentation
    Lu, Tan
    Dooms, Ann
    PATTERN RECOGNITION, 2021, 109
  • [46] PROBABILISTIC ESTIMATION OF BRAILLE DOCUMENT PARAMETERS
    Babadi, Majid Yoosefi
    Nasihatkon, Behrooz
    Azimifar, Zohreh
    Fieguth, Paul
    2009 16TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOLS 1-6, 2009, : 2001 - +
  • [47] Probabilistic aspects in spoken document retrieval
    Macherey, W
    Viechtbauer, HJ
    Ney, H
    EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2003, 2003 (02) : 115 - 127
  • [48] PROBABILISTIC MODELS OF DOCUMENT-RETRIEVAL
    BOOKSTEIN, A
    PROCEEDINGS OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1980, 17 : 27 - 29
  • [49] Probabilistic Aspects in Spoken Document Retrieval
    Wolfgang Macherey
    Hans Jörg Viechtbauer
    Hermann Ney
    EURASIP Journal on Advances in Signal Processing, 2003
  • [50] THE CONCEPT OF DOCUMENT COMPONENTS FOR PROBABILISTIC INDEXING
    KWOK, KL
    PROCEEDINGS OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1986, 23 : 158 - 162