THESUS: Organizing Web document collections based on link semantics

被引:0
|
作者
Maria Halkidi
Benjamin Nguyen
Iraklis Varlamis
Michalis Vazirgiannis
机构
[1] Athens University of Economics and Business,76 Patision Street
[2] INRIA,Domaine de Voluceau
来源
The VLDB Journal | 2003年 / 12卷
关键词
World Wide Web; Link analysis; Similarity measure; Document clustering; Link management; Semantics;
D O I
暂无
中图分类号
学科分类号
摘要
The requirements for effective search and management of the WWW are stronger than ever. Currently Web documents are classified based on their content not taking into account the fact that these documents are connected to each other by links. We claim that a page’s classification is enriched by the detection of its incoming links’ semantics. This would enable effective browsing and enhance the validity of search results in the WWW context. Another aspect that is underaddressed and strictly related to the tasks of browsing and searching is the similarity of documents at the semantic level. The above observations lead us to the adoption of a hierarchy of concepts (ontology) and a thesaurus to exploit links and provide a better characterization of Web documents. The enhancement of document characterization makes operations such as clustering and labeling very interesting. To this end, we devised a system called THESUS. The system deals with an initial sets of Web documents, extracts keywords from all pages’ incoming links, and converts them to semantics by mapping them to a domain’s ontology. Then a clustering algorithm is applied to discover groups of Web documents. The effectiveness of the clustering process is based on the use of a novel similarity measure between documents characterized by sets of terms. Web documents are organized into thematic subsets based on their semantics. The subsets are then labeled, thereby enabling easier management (browsing, searching, querying) of the Web. In this article, we detail the process of this system and give an experimental analysis of its results.
引用
收藏
页码:320 / 332
页数:12
相关论文
共 50 条
  • [1] THESUS: Organizing Web document collections based on link semantics
    Halkidi, M
    Nguyen, B
    Varlamis, I
    Vazirgiannis, M
    VLDB JOURNAL, 2003, 12 (04): : 320 - 332
  • [2] THESUS, a closer view on Web content management enhanced with link semantics
    Varlamis, I
    Vazirgiannis, M
    Halkidi, M
    Nguyen, B
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2004, 16 (06) : 685 - 700
  • [3] The research of Self-Organizing Maps based on Document Collections
    Ding, Yi
    Fu, Xian
    FRONTIERS OF ADVANCED MATERIALS AND ENGINEERING TECHNOLOGY, PTS 1-3, 2012, 430-432 : 1232 - 1235
  • [4] Document similarity arithmetic based on Web structure semantics
    Huang, Jian-cai
    ADVANCING SCIENCE THROUGH COMPUTATION, 2008, : 496 - 499
  • [5] Managing very large document collections using semantics
    GuoRen Wang
    HongJun Lu
    Ge Yu
    Bin YuBao
    Journal of Computer Science and Technology, 2003, 18 : 403 - 406
  • [6] Managing very large document collections using semantics
    Wang, GR
    Lu, HJ
    Yu, G
    Bao, YB
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2003, 18 (03) : 403 - 406
  • [7] Statistical aspects of the WEBSOM system in organizing document collections
    Kaski, S
    Lagus, K
    Honkela, T
    Kohonen, T
    MINING AND MODELING MASSIVE DATA SETS IN SCIENCE, ENGINEERING, AND BUSINESS WITH A SUBTHEME IN ENVIRONMENTAL STATISTICS, 1997, 29 (01): : 281 - 290
  • [8] WEBSOM - Self-organizing maps of document collections
    Kaski, S
    Honkela, T
    Lagus, K
    Kohonen, T
    NEUROCOMPUTING, 1998, 21 (1-3) : 101 - 117
  • [9] Self-organizing maps of massive document collections
    Kohonen, T
    IJCNN 2000: PROCEEDINGS OF THE IEEE-INNS-ENNS INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOL II, 2000, : 3 - 9
  • [10] Formulation of complex queries over web-based document collections
    van Zwol, R
    Apers, PMG
    6TH WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL XVIII, PROCEEDINGS: INFORMATION SYSTEMS, CONCEPTS AND APPLICATIONS OF SYSTEMICS, CYBERNETICS AND INFORMATICS, 2002, : 200 - 207