THESUS: Organizing Web document collections based on link semantics

被引:0
|
作者
Maria Halkidi
Benjamin Nguyen
Iraklis Varlamis
Michalis Vazirgiannis
机构
[1] Athens University of Economics and Business,76 Patision Street
[2] INRIA,Domaine de Voluceau
来源
The VLDB Journal | 2003年 / 12卷
关键词
World Wide Web; Link analysis; Similarity measure; Document clustering; Link management; Semantics;
D O I
暂无
中图分类号
学科分类号
摘要
The requirements for effective search and management of the WWW are stronger than ever. Currently Web documents are classified based on their content not taking into account the fact that these documents are connected to each other by links. We claim that a page’s classification is enriched by the detection of its incoming links’ semantics. This would enable effective browsing and enhance the validity of search results in the WWW context. Another aspect that is underaddressed and strictly related to the tasks of browsing and searching is the similarity of documents at the semantic level. The above observations lead us to the adoption of a hierarchy of concepts (ontology) and a thesaurus to exploit links and provide a better characterization of Web documents. The enhancement of document characterization makes operations such as clustering and labeling very interesting. To this end, we devised a system called THESUS. The system deals with an initial sets of Web documents, extracts keywords from all pages’ incoming links, and converts them to semantics by mapping them to a domain’s ontology. Then a clustering algorithm is applied to discover groups of Web documents. The effectiveness of the clustering process is based on the use of a novel similarity measure between documents characterized by sets of terms. Web documents are organized into thematic subsets based on their semantics. The subsets are then labeled, thereby enabling easier management (browsing, searching, querying) of the Web. In this article, we detail the process of this system and give an experimental analysis of its results.
引用
收藏
页码:320 / 332
页数:12
相关论文
共 50 条
  • [21] Towards better entity resolution techniques for Web document collections
    Yerva, Surender Reddy
    Miklos, Zoltan
    Aberer, Karl
    2010 IEEE 26TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOPS (ICDE 2010), 2010, : 209 - 214
  • [22] Creating synthetic temporal document collections for web archive benchmarking
    Norvag, Kjetil
    Nybo, Albert Overskeid
    ADVANCES IN WEB INTELLIGENCE AND DATA MINING, 2006, 23 : 171 - +
  • [23] Using contextual semantics to automate the web document search and analysis
    Wang, L
    Song, W
    Cheung, D
    PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS ENGINEERING, VOL II, 2000, : 19 - 23
  • [24] Pattern based browsing in document collections
    Feldman, R
    Klosgen, W
    Ben-Yehuda, Y
    Kedar, G
    Reznikov, V
    PRINCIPLES OF DATA MINING AND KNOWLEDGE DISCOVERY, 1997, 1263 : 112 - 122
  • [25] A single-link method algorithm for clustering large document collections
    Kishida, K
    LIBRARY AND INFORMATION SCIENCE, 2002, (47): : 27 - 38
  • [26] Web document clustering using semantic link analysis
    Arch-int, Somjit
    INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE FOR MODELLING, CONTROL & AUTOMATION JOINTLY WITH INTERNATIONAL CONFERENCE ON INTELLIGENT AGENTS, WEB TECHNOLOGIES & INTERNET COMMERCE, VOL 2, PROCEEDINGS, 2006, : 13 - 18
  • [27] LSISOM - A latent semantic indexing approach to Self-Organizing Maps of document collections
    Ampazis, N
    Perantonis, SJ
    NEURAL PROCESSING LETTERS, 2004, 19 (02) : 157 - 173
  • [28] Browsing document collections: Automatically organizing digital libraries and hypermedia using the Gray code
    Losee, RM
    INFORMATION PROCESSING & MANAGEMENT, 1997, 33 (02) : 175 - 192
  • [29] LSISOM — A Latent Semantic Indexing Approach to Self-Organizing Maps of Document Collections
    Nikolaos Ampazis
    Stavros J. Perantonis
    Neural Processing Letters, 2004, 19 : 157 - 173
  • [30] Exploration of document collections with self-organizing maps: A novel approach to similarity representation
    Merkl, D
    PRINCIPLES OF DATA MINING AND KNOWLEDGE DISCOVERY, 1997, 1263 : 101 - 111