Organizing hidden-Web databases by clustering visible Web documents

被引:0
|
作者
Barbosa, Luciano [1 ]
Freire, Juliana [1 ]
Silva, Altigran [2 ]
机构
[1] Univ Utah, Salt Lake City, UT 84112 USA
[2] Univ Fed Amazonas, Manaus, Amazonas, Brazil
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we address the problem of organizing hidden-Web databases. Given a heterogeneous set of Web forms that serve as entry points to hidden-Web databases, our goal is to cluster the forms according to the database domains to which they belong. We propose a new clustering approach that models Web forms as a set of hyperlinked objects and considers visible information in the form context-both within and in the neighborhood of forms-as the basis for similarity comparison. Since the clustering is performed over features that can be automatically extracted, the process is scalable. In addition, because it uses a rich set of metadata, our approach is able to handle a wide range of,forms, including content-rich forms that contain multiple attributes, as well as simple keyword-based search inter-faces. An experimental evaluation over real Web data shows that our strategy generates high-quality clusters-measured both in terms of entropy and F-measure. This indicates that our approach provides an effective and general solution to the problem of organizing hidden-Web databases.
引用
收藏
页码:301 / +
页数:2
相关论文
共 50 条
  • [31] Link-Based Clustering Algorithm for Clustering Web Documents
    Ashokkumar, P.
    Don, S.
    JOURNAL OF TESTING AND EVALUATION, 2019, 47 (06) : 4096 - 4107
  • [32] HDBTracker: Monitoring the Aggregates On Dynamic Hidden Web Databases
    Liu, Weimo
    Bin Suhaim, Saad
    Thirumuruganathan, Saravanan
    Zhang, Nan
    Das, Gautam
    Jaoua, Ali
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 7 (13): : 1569 - 1572
  • [33] Aggregate Estimation Over Dynamic Hidden Web Databases
    Liu, Weimo
    Thirumuruganathan, Saravanan
    Zhang, Nan
    Das, Gautam
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 7 (12): : 1107 - 1118
  • [34] Sampling, information extraction and summarisation of Hidden Web databases
    Hedley, Yih-Ling
    Younas, Muhammad
    James, Anne
    Sanderson, Mark
    DATA & KNOWLEDGE ENGINEERING, 2006, 59 (02) : 213 - 230
  • [35] A Method for Web Documents Clustering Based on Dynamic Concept
    Wang, Yunhua
    Ke, Huiyan
    2011 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT), VOLS 1-4, 2012, : 2183 - 2187
  • [36] Clustering documents into a web directory for bootstrapping a supervised classification
    Adami, G
    Avesani, P
    Sona, D
    DATA & KNOWLEDGE ENGINEERING, 2005, 54 (03) : 301 - 325
  • [37] Research of Web Documents Clustering Based on Dynamic Concept
    WANG Yun-hua 1
    2.National Engineering Research Center for Multimedia Software
    Wuhan University Journal of Natural Sciences, 2004, (05) : 547 - 552
  • [38] Contextual adaptive clustering of Web and text documents with personalization
    Ciesielski, Krzysztof
    Klopotek, Mieczyslaw A.
    Wierzchon, Slawomir T.
    MINING COMPLEX DATA, 2008, 4944 : 116 - 130
  • [39] Improving Suffix Tree Clustering Algorithm for Web Documents
    Zhuang, Yan
    Chen, Youguang
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON LOGISTICS, ENGINEERING, MANAGEMENT AND COMPUTER SCIENCE (LEMCS 2015), 2015, 117 : 1557 - 1561
  • [40] Mining Evolving Web Sessions and Clustering Dynamic Web Documents for Similarity-Aware Web Content Management
    Xiao, Jitian
    ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2008, 5139 : 99 - 110