SEMI-STRUCTURED DOCUMENT EXTRACTION BASED ON DOCUMENT ELEMENT BLOCK MODEL

被引:0
|
作者
Lv, Tao [1 ,4 ]
Liu, Jiang [1 ,4 ]
Lu, Fan [2 ]
Zhang, Peng [2 ]
Wang, Xinyan [3 ]
Wang, Cong [1 ,4 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Software Engn, Beijing 100876, Peoples R China
[2] Minist Sci & Technol, Beijing 100862, Peoples R China
[3] Air Force Gen Hosp, Beijing 100142, Peoples R China
[4] Beijing Univ Posts & Telecommun, Key Lab Trustworthy Distributed Comp & Serv, Beijing 100876, Peoples R China
关键词
Semi-structured document; Document extraction; Regular expression;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A large number of documents related to its specific business are produced continually by enterprises and institutions in their daily work, To get useful information from these semi-structured documents we have proposed document element block model(DEBM) and applied it in the semi-structured document extraction. The model makes full use of the information contains in the document, not only the structural information, but also the content. DEBM extracts document element block from template documents and target documents, and then generate corresponding regular expression rules based on the document element block characteristic of template document, after that process each type of document elements of a set of blocks extracted document elements according to the corresponding elements block position by regular expression matching. The experiments show that extraction based on DEBM achieved good results and compared to traditional regular expressions and template matching, the approach based on DEBM performs better. The result shows that we propose a simple, efficient model to extract semi-structured documents,
引用
收藏
页码:461 / 465
页数:5
相关论文
共 50 条
  • [1] List data extraction in semi-structured document
    Xu, H
    Li, JZ
    Xu, P
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2005, 2005, 3806 : 584 - 585
  • [2] A semi-structured document model for text mining
    Yang, JW
    Chen, XO
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2002, 17 (05) : 603 - 610
  • [3] A semi-structured document model for text mining
    Jianwu Yang
    Xiaoou Chen
    Journal of Computer Science and Technology, 2002, 17 : 603 - 610
  • [4] Learning element similarity matrix for semi-structured document analysis
    Yang, Jianwu
    Cheung, William K.
    Chen, Xiaoou
    KNOWLEDGE AND INFORMATION SYSTEMS, 2009, 19 (01) : 53 - 78
  • [5] Learning element similarity matrix for semi-structured document analysis
    Jianwu Yang
    William K. Cheung
    Xiaoou Chen
    Knowledge and Information Systems, 2009, 19
  • [6] Spatial Dependency Parsing for Semi-Structured Document Information Extraction
    Hwang, Wonseok
    Yim, Jinyeong
    Park, Seunghyun
    Yang, Sohee
    Seo, Minjoon
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 330 - 343
  • [7] Bayesian network model for semi-structured document classification
    Denoyer, L
    Gallinari, P
    INFORMATION PROCESSING & MANAGEMENT, 2004, 40 (05) : 807 - 827
  • [8] A document model based on relevance modeling techniques for semi-structured information warehouses
    Pérez, JM
    Berlanga, R
    Aramburu, MJ
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2004, 3180 : 318 - 327
  • [9] Multimedia retrieval based on geometric distance in semi-structured document
    Fakhfakh, Sana
    Tmar, Mohamed
    Mahdi, Walid
    WEBIST 2014 - Proceedings of the 10th International Conference on Web Information Systems and Technologies, 2014, 1 : 220 - 225
  • [10] Semi-structured document categorization with a semantic kernel
    Aseervatham, Sujeevan
    Bennani, Younes
    PATTERN RECOGNITION, 2009, 42 (09) : 2067 - 2076