Recognition techniques for extracting information from semi-structured documents

被引:0
|
作者
Della Ventura, A [1 ]
Gagliardi, I [1 ]
Zonta, B [1 ]
机构
[1] CNR, ITIM, I-20131 Milan, Italy
来源
关键词
OCR; automatic indexing; information retrieval; document analysis; image analysis; pattern matching; linguistic analysis;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Archives of optical documents are more and more massively employed, the demand driven also by the new norms sanctioning the legal value of digital documents, provided they are stored on supports that are physically unalterable. On the supply side there is now a vast and technologically advanced market, where optical memories have solved the problem of the duration and permanence of data at costs comparable to those for magnetic memories. The remaining bottleneck in these systems is the indexing. The indexing of documents with a variable structure, while still not completely automated, can be machine supported to a large degree with evident advantages both in the organization of the work, and in extracting information, providing data that is much more detailed and potentially significant for the user. We present here a system for the automatic registration of correspondence to and from a public office. The system is based on a general methodology for the extraction, indexing, archiving, and retrieval of significant information from semi-structured documents. This information, in our prototype application, is distributed among the database fields of sender, addressee, subject, date, and body of the document.
引用
收藏
页码:130 / 137
页数:8
相关论文
共 50 条
  • [31] On the information content of semi-structured databases
    Levene, Mark
    Acta Cybernetica, 1998, 13 (03): : 257 - 275
  • [32] Supporting Semantic Search on Heterogeneous Semi-structured Documents
    Mrabet, Yassine
    Bennacer, Nacera
    Pernelle, Nathalie
    Thiam, Mouhamadou
    ADVANCED INFORMATION SYSTEMS ENGINEERING, PROCEEDINGS, 2010, 6051 : 224 - +
  • [33] Characteristic sets of strings common to semi-structured documents
    Ikeda, D
    DISCOVERY SCIENCE, PROCEEDINGS, 1999, 1721 : 139 - 147
  • [34] Filtering Semi-Structured Documents Based on Faceted Feedback
    Zhang, Lanbo
    Zhang, Yi
    Xing, Qianli
    PROCEEDINGS OF THE 34TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR'11), 2011, : 645 - 654
  • [35] A semantic network approach to semi-structured documents repositories
    Christophides, V
    Dorr, M
    Fundulaki, I
    RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES, 1997, 1324 : 305 - 324
  • [36] Toward structured retrieval in semi-structured information spaces
    Huffman, SB
    Baudin, C
    IJCAI-97 - PROCEEDINGS OF THE FIFTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 AND 2, 1997, : 751 - 756
  • [37] FlashRelate: Extracting Relational Data from Semi-structured Spreadsheets Using Examples
    Barowy, Daniel W.
    Gulwani, Sumit
    Hart, Ted
    Zorn, Benjamin
    ACM SIGPLAN NOTICES, 2015, 50 (06) : 218 - 228
  • [38] Unsupervised Extraction of Product Information from Semi-structured Sources
    Walther, Maximilian
    13TH IEEE INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND INFORMATICS (CINTI 2012), 2012, : 257 - 262
  • [39] Bootstrapping Information Extraction from Semi-structured Web Pages
    Carlson, Andrew
    Schafer, Charles
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PART I, PROCEEDINGS, 2008, 5211 : 195 - +
  • [40] Tool for extracting semi-structured data to a big data load
    Furtado, Joao Carlos
    Bulsing, Gabriel Merten
    Kroth, Eduardo
    Benitez Nara, Elpidio Oscar
    Kipper, Liane Malhmann
    REVISTA BRASILEIRA DE COMPUTACAO APLICADA, 2015, 7 (03): : 43 - 52