SEMI-STRUCTURED DOCUMENT EXTRACTION BASED ON DOCUMENT ELEMENT BLOCK MODEL

被引:0
|
作者
Lv, Tao [1 ,4 ]
Liu, Jiang [1 ,4 ]
Lu, Fan [2 ]
Zhang, Peng [2 ]
Wang, Xinyan [3 ]
Wang, Cong [1 ,4 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Software Engn, Beijing 100876, Peoples R China
[2] Minist Sci & Technol, Beijing 100862, Peoples R China
[3] Air Force Gen Hosp, Beijing 100142, Peoples R China
[4] Beijing Univ Posts & Telecommun, Key Lab Trustworthy Distributed Comp & Serv, Beijing 100876, Peoples R China
关键词
Semi-structured document; Document extraction; Regular expression;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A large number of documents related to its specific business are produced continually by enterprises and institutions in their daily work, To get useful information from these semi-structured documents we have proposed document element block model(DEBM) and applied it in the semi-structured document extraction. The model makes full use of the information contains in the document, not only the structural information, but also the content. DEBM extracts document element block from template documents and target documents, and then generate corresponding regular expression rules based on the document element block characteristic of template document, after that process each type of document elements of a set of blocks extracted document elements according to the corresponding elements block position by regular expression matching. The experiments show that extraction based on DEBM achieved good results and compared to traditional regular expressions and template matching, the approach based on DEBM performs better. The result shows that we propose a simple, efficient model to extract semi-structured documents,
引用
收藏
页码:461 / 465
页数:5
相关论文
共 50 条
  • [31] Low-Dimensionality Information Extraction Model for Semi-structured Documents
    Belhadj, Djedjiga
    Belaïd, Abdel
    Belaïd, Yolande
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2023, 14184 LNCS : 76 - 85
  • [32] Knowledge extraction from semi-structured data based on fuzzy techniques
    Ceravolo, P
    Nocerino, MC
    Viviani, M
    KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 3, PROCEEDINGS, 2004, 3215 : 328 - 334
  • [33] Incremental Ontology-Based Extraction and Alignment in Semi-structured Documents
    Thiam, Mouhamadou
    Bennacer, Nacera
    Pernelle, Nathalie
    Lo, Moussa
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2009, 5690 : 611 - +
  • [34] Graph-based Retrieval Model for Semi-structured Data
    Park, Juneyoung
    Yi, Mun Y.
    2016 INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2016, : 361 - 364
  • [35] A model based on Influence Diagrams for structured document retrieval
    Xu, JM
    Zhao, S
    Chai, BF
    PROCEEDINGS OF 2005 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-9, 2005, : 3225 - 3231
  • [36] A storage and retrieval model based on XML for semi-structured information
    Gao, L
    Chen, HP
    Gu, JG
    Wang, JC
    Fang, HP
    Li, XH
    Proceedings of 2005 International Conference on Machine Learning and Cybernetics, Vols 1-9, 2005, : 33 - 38
  • [37] Analyzing semi-structured data for ontological information extraction
    Han, H
    Elmasri, R
    IC'2001: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INTERNET COMPUTING, VOLS I AND II, 2001, : 21 - 27
  • [38] Semi-structured Data Extraction and Schema Knowledge Mining
    陈恩红
    High Technology Letters, 2001, (01) : 1 - 5
  • [39] Business information extraction from semi-structured webpages
    Sung, NH
    Chang, YS
    EXPERT SYSTEMS WITH APPLICATIONS, 2004, 26 (04) : 575 - 582
  • [40] Information extraction from semi-structured web documents
    Yun, Bo-Hyun
    Seo, Chang-Ho
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2006, 4092 : 586 - 598