Enhanced information retrieval by using HTML']HTML tags

被引:0
|
作者
Werner, L [1 ]
Böttcher, S [1 ]
Beckmann, R [1 ]
机构
[1] Univ Gesamthsch Paderborn, C LAB, D-4790 Paderborn, Germany
关键词
typographical information; text classification; HTMEL tags;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Whenever digital libraries or knowledge management systems are to be automatically filled with web pages from the internet, document classification of the web pages is one of the major challenges. We present an approach which uses HTML tags in order to improve the quality of the hypertext document classification. Our approach uses weighting of HTML tags for separating relevant information in hypertext documents from the noise. We have evaluated our approach on the basis of a document classification algorithm. The results show that our weighting approach yields a classification which is approximately 35% better than a classification without the use of the HTML tagging information.
引用
收藏
页码:24 / 29
页数:6
相关论文
共 50 条
  • [31] Multimedia information extraction from HTML']HTML product catalogues
    Labsky, Martin
    Praks, Pavel
    Svatek, Vojtech
    Svab, Ondrej
    DATESO 2005 - DATABASES, TEXTS, SPECIFICATIONS, OBJECTS, 2005, : 84 - 93
  • [32] Layout based information extraction from HTML']HTML documents
    Buraet, Radek
    ICDAR 2007: NINTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2007, : 624 - 628
  • [33] Categorizing and extracting information from multilingual HTML']HTML documents
    Lim, SJ
    Ng, YK
    9TH INTERNATIONAL DATABASE ENGINEERING & APPLICATION SYMPOSIUM, PROCEEDINGS, 2005, : 415 - 422
  • [34] A Web Information Extraction method Based on HTML']HTML Parser
    Zhang, Zhiming
    Huang, Shuaishuai
    Li, Ping
    ADVANCED TECHNOLOGIES IN MANUFACTURING, ENGINEERING AND MATERIALS, PTS 1-3, 2013, 774-776 : 1802 - 1806
  • [35] Information extraction from HTML']HTML pages and its integration
    Itai, K
    Takasu, A
    Adachi, J
    2003 SYMPOSIUM ON APPLICATIONS AND THE INTERNET WORKSHOPS, PROCEEDINGS, 2003, : 276 - 281
  • [36] Expanding information accessibility: Combining process and procedural information, and using multiple access routes in HTML']HTML help
    Williams, C
    IEEE PROFESSIONAL COMMUNICATION SOCIETY INTERNATIONAL PROFESSIONAL COMMUNICATION CONFERENCE AND ACM SPECIAL INTEREST GROUP ON DOCUMENTATION CONFERENCE, 2000, : 571 - 579
  • [37] Using Internet explorer's HTML']HTMLParser - Loading and parsing HTML']HTML
    Tucker, A
    DR DOBBS JOURNAL, 1999, 24 (08): : 82 - +
  • [38] Genetic mining of HTML']HTML structures for effective Web-document retrieval
    Kim, S
    Zhang, BT
    APPLIED INTELLIGENCE, 2003, 18 (03) : 243 - 256
  • [39] Exploring HTML']HTML Tags and Metadata to Improve the Expressiveness of Web Search Engine's Queries
    Escudeiro, Nuno Filipe
    Escudeiro, Paula Maria
    SECOND INTERNATIONAL CONFERENCE ON COMPUTER AND ELECTRICAL ENGINEERING, VOL 1, PROCEEDINGS, 2009, : 569 - +
  • [40] Mastering HTML']HTML and XHTML']HTML
    Staples, J
    TECHNICAL COMMUNICATION, 2004, 51 (01) : 126 - 128