Using clustering for web information extraction

被引:0
|
作者
Phong, Le [1 ]
Vuong, Bao [1 ]
Gao, Xiaoying [1 ]
机构
[1] Victoria Univ Wellington, Sch Math Stat & Comp Sci, POB 600, Wellington, New Zealand
关键词
information extraction; clustering; Smith-Waterman algorithm;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper introduces an approach that achieves automated data extraction from semi-structured Web pages by clustering. Both HTML tags and the textual features of text tokens are considered for similarity comparison. The first clustering process groups similar text tokens into the same text clusters, and the second clustering process groups similar data tuples into tuple clusters. A tuple cluster is a strong candidate of a repetitive data region.
引用
收藏
页码:415 / +
页数:2
相关论文
共 50 条
  • [1] Clustering Web Documents with Tables for Information Extraction
    Shchekotykhin, Kostyantyn
    Jannach, Dietmar
    Friedrich, Gerhard
    K-CAP'07: PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON KNOWLEDGE CAPTURE, 2007, : 169 - 170
  • [2] Web Information Extraction Based on Clustering GHMM
    Liu, Yongxin
    Liu, Zhijng
    PROCEEDINGS OF THE 2008 INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN, VOL 1, 2008, : 545 - 548
  • [3] Web Content Extraction Using Clustering with Web Structure
    Huang, Xiaotao
    Gao, Yan
    Huang, Liqun
    Zhang, Zhizhao
    Li, Yuhua
    Wang, Fen
    Kang, Ling
    ADVANCES IN NEURAL NETWORKS, PT I, 2017, 10261 : 95 - 103
  • [4] Using keyword extraction for Web site clustering
    Tonella, P
    Ricca, F
    Pianta, E
    Girardi, C
    FIFTH IEEE INTERNATIONAL WORKSHOP ON WEB SITE EVOLUTION THEME: ARCHITECTURE, PROCEEDINGS, 2003, : 41 - 48
  • [5] A Method of Automatic Web Information Extraction Based on Page Clustering
    Yang, Tianqi
    Qiu, Taofen
    2011 9TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION (WCICA 2011), 2011, : 390 - 393
  • [6] Web document clustering by using automatic keyphrase extraction
    Flan, Juhyun
    Kim, Taehwan
    Choi, Joongmin
    PROCEEDING OF THE 2007 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY, WORKSHOPS, 2007, : 56 - 59
  • [7] The Ex Project: Web Information Extraction Using Extraction Ontologies
    Labsky, Martin
    Svatek, Vojtech
    Nekvasil, Marek
    Rak, Dusan
    KNOWLEDGE DISCOVERY ENHANCED WITH SEMANTIC AND SOCIAL INFORMATION, 2009, 220 : 71 - 88
  • [8] CLUSTERING WEB SEARCH RESULTS USING SEMANTIC INFORMATION
    Wen, Han
    Huang, Guo-Shun
    Li, Zhao
    PROCEEDINGS OF 2009 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-6, 2009, : 1504 - +
  • [9] A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information
    Tsekouras, George E.
    Gavalas, Damianos
    Filios, Stefanos
    Niros, Antonios D.
    Bafaloukas, George
    ARTIFICIAL INTELLIGENCE: THEORIES, MODELS AND APPLICATIONS, SETN 2008, 2008, 5138 : 419 - 424
  • [10] STAVIES: A system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques
    Papadakis, NK
    Skoutas, D
    Raftopoulos, K
    Varvarigou, TA
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (12) : 1638 - 1652