Recognition techniques for extracting information from semi-structured documents

被引:0
|
作者
Della Ventura, A [1 ]
Gagliardi, I [1 ]
Zonta, B [1 ]
机构
[1] CNR, ITIM, I-20131 Milan, Italy
来源
关键词
OCR; automatic indexing; information retrieval; document analysis; image analysis; pattern matching; linguistic analysis;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Archives of optical documents are more and more massively employed, the demand driven also by the new norms sanctioning the legal value of digital documents, provided they are stored on supports that are physically unalterable. On the supply side there is now a vast and technologically advanced market, where optical memories have solved the problem of the duration and permanence of data at costs comparable to those for magnetic memories. The remaining bottleneck in these systems is the indexing. The indexing of documents with a variable structure, while still not completely automated, can be machine supported to a large degree with evident advantages both in the organization of the work, and in extracting information, providing data that is much more detailed and potentially significant for the user. We present here a system for the automatic registration of correspondence to and from a public office. The system is based on a general methodology for the extraction, indexing, archiving, and retrieval of significant information from semi-structured documents. This information, in our prototype application, is distributed among the database fields of sender, addressee, subject, date, and body of the document.
引用
收藏
页码:130 / 137
页数:8
相关论文
共 50 条
  • [1] RETRACTED: Extracting Information from Semi-structured Web Documents: A Framework (Retracted Article)
    Memon, Nasrullah
    Qureshi, Abdul Rasool
    Hicks, David
    Harkiolakis, Nicholas
    ADVANCED WEB AND NETWORK TECHNOLOGIES, AND APPLICATIONS, 2008, 4977 : 54 - +
  • [2] Extracting information from semi-structured Internet sources
    Jeong, JS
    Oh, DI
    ISIE 2001: IEEE INTERNATIONAL SYMPOSIUM ON INDUSTRIAL ELECTRONICS PROCEEDINGS, VOLS I-III, 2001, : 1378 - 1381
  • [3] Extracting information from semi-structured internet sources
    Div. of Info. Tech. Eng., College of Engineering, SoonChunHyang University, Asan, Korea, Republic of
    IEEE Int Symp Ind Electron, (1378-1381):
  • [4] Retracted: Extracting information fro m semi-structured web documents: A framework
    Department of Computer Science and Engineering, Aalborg University, Niels Bohrs Vej 8, Esbjerg
    DK-6700, Denmark
    不详
    不详
    Lect. Notes Comput. Sci., 2008, (54-64):
  • [5] Information extraction from semi-structured web documents
    Yun, Bo-Hyun
    Seo, Chang-Ho
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2006, 4092 : 586 - 598
  • [6] A strategy for extracting information from semi-structured web pages
    Shaker, Mahmoud
    Ibrahim, Hamidah
    Mustapha, Aida
    Abdullah, Lili Nurliyana
    INTERNATIONAL JOURNAL OF WEB INFORMATION SYSTEMS, 2010, 6 (04) : 304 - 318
  • [7] A Framework for Extracting Information from Semi-Structured Web Data Sources
    Shaker, Malunoud
    Ibrahim, Hamidah
    Mustapha, Aida
    Abdullah, Lili Nurliyana
    THIRD 2008 INTERNATIONAL CONFERENCE ON CONVERGENCE AND HYBRID INFORMATION TECHNOLOGY, VOL 1, PROCEEDINGS, 2008, : 27 - 31
  • [8] Supplementing domain knowledge to BERT with semi-structured information of documents
    Chen, Jing
    Wei, Zhihua
    Wang, Jiaqi
    Wang, Rui
    Gong, Chuanyang
    Zhang, Hongyun
    Miao, Duoqian
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 235
  • [9] An approach to semantic information retrieval in heterogeneous semi-structured documents
    Mrabet, Yassine
    Bennacer, Nacéra
    Pernelle, Nathalie
    Thiam, Mouhamadou
    CORIA 2010: Actes de la COnference en Recherche d'Information et Applications - Proceedings of the Conference on Information Retrieval and Applications, 2010, : 195 - 210
  • [10] Adding Structure to Semi-Structured Documents
    Moens, Marie-Francine
    LEGAL KNOWLEDGE AND INFORMATION SYSTEMS: JURIX 2009: THE TWENTY-SECOND ANNUAL CONFERENCE, 2009, 205 : IX - IX