A probabilistic approach to printed document understanding

被引:24
|
作者
Medvet, Eric [1 ]
Bartoli, Alberto [1 ]
Davanzo, Giorgio [1 ]
机构
[1] Univ Trieste, DEEI, I-34127 Trieste, Italy
关键词
Document understanding; Automatic model upgrading; Invoice analysis; Maximum likelihood;
D O I
10.1007/s10032-010-0137-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose an approach for information extraction for multi-page printed document understanding. The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time. Describing a new class is a very simple task: the operator merely provides a few samples and then, by means of a GUI, clicks on the OCR-generated blocks of a document containing the information to be extracted. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. We estimate the parameters for a new class by applying the maximum likelihood method to the samples of the class. All these parameters depend only on block properties that can be extracted automatically from the operator actions on the GUI. Processing a document of a given class consists in finding the sequence of blocks, which maximizes the corresponding probability for that class. We evaluated experimentally our proposal using 807 multi-page printed documents of different domains (invoices, patents, data-sheets), obtaining very good results-e.g., a success rate often greater than 90% even for classes with just two samples.
引用
收藏
页码:335 / 347
页数:13
相关论文
共 50 条
  • [1] A probabilistic approach to printed document understanding
    Eric Medvet
    Alberto Bartoli
    Giorgio Davanzo
    International Journal on Document Analysis and Recognition (IJDAR), 2011, 14 : 335 - 347
  • [2] An Approach for Printed Document Labeling
    Adak, Chandranath
    2014 FIRST INTERNATIONAL CONFERENCE ON AUTOMATION, CONTROL, ENERGY & SYSTEMS (ACES-14), 2014, : 23 - 26
  • [3] A PROBABILISTIC LEARNING APPROACH FOR DOCUMENT INDEXING
    FUHR, N
    BUCKLEY, C
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1991, 9 (03) : 223 - 248
  • [4] NMF-based approach to font classification to printed English alphabets for document image understanding
    Lee, CW
    Jung, KC
    MODELING DECISIONS FOR ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2005, 3558 : 354 - 364
  • [5] An approach for reconstructing mathematical formulas in printed document
    Tian Xudong
    Xu Lijuan
    Li Na
    ICCSE'2006: Proceedings of the First International Conference on Computer Science & Education: ADVANCED COMPUTER TECHNOLOGY, NEW EDUCATION, 2006, : 31 - 34
  • [6] An approach for processing mathematical expressions in printed document
    Chaudhuri, BB
    Garain, U
    DOCUMENT ANALYSIS SYSTEMS: THEORY AND PRACTICE, 1999, 1655 : 310 - 321
  • [7] The organisation and visualisation of document corpora:: A probabilistic approach
    Girolami, M
    Vinokourov, A
    Kabán, A
    11TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATION, PROCEEDINGS, 2000, : 558 - 564
  • [8] A probabilistic relational approach for web document clustering
    Fersini, E.
    Messina, E.
    Archetti, F.
    INFORMATION PROCESSING & MANAGEMENT, 2010, 46 (02) : 117 - 130
  • [9] Understanding probabilistic expectations - a behavioral approach
    Xiao, Wei
    JOURNAL OF ECONOMIC DYNAMICS & CONTROL, 2022, 139
  • [10] A Hybrid Probabilistic Approach for Table Understanding
    Sun, Kexuan
    Rayudu, Harsha
    Pujara, Jay
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 4366 - 4374