Using clustering for web information extraction

被引:0
|
作者
Phong, Le [1 ]
Vuong, Bao [1 ]
Gao, Xiaoying [1 ]
机构
[1] Victoria Univ Wellington, Sch Math Stat & Comp Sci, POB 600, Wellington, New Zealand
关键词
information extraction; clustering; Smith-Waterman algorithm;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper introduces an approach that achieves automated data extraction from semi-structured Web pages by clustering. Both HTML tags and the textual features of text tokens are considered for similarity comparison. The first clustering process groups similar text tokens into the same text clusters, and the second clustering process groups similar data tuples into tuple clusters. A tuple cluster is a strong candidate of a repetitive data region.
引用
收藏
页码:415 / +
页数:2
相关论文
共 50 条
  • [31] Clustering for Web information hierarchy mining
    Kao, HY
    Ho, JM
    Chen, MS
    IEEE/WIC INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, PROCEEDINGS, 2003, : 698 - 701
  • [32] Clustering in User Information Retrieval on Web
    Sharma, Sachin
    Mangat, Veenu
    2013 INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL AND BUSINESS INTELLIGENCE (ISCBI), 2013, : 287 - 290
  • [33] Web Services for a Chemical Information Clustering
    Kim, Jungkee
    2011 6TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCES AND CONVERGENCE INFORMATION TECHNOLOGY (ICCIT), 2012, : 140 - 143
  • [34] NLP and Ontology based Clustering - An Integrated approach for Optimal Information Extraction from Social Web
    Dhuria, Shabina
    Taneja, Harmunish
    Taneja, Kavita
    PROCEEDINGS OF THE 10TH INDIACOM - 2016 3RD INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT, 2016, : 1765 - 1770
  • [35] The Web-OEM approach to Web information extraction
    Iocchi, L
    JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 1999, 22 (04) : 259 - 269
  • [36] Extraction and organization of encyclopedic knowledge information using the World Wide Web
    Fujii, Atsushi
    Ishikawa, Tetsuya
    Systems and Computers in Japan, 2005, 36 (14): : 81 - 90
  • [37] PIES: A web information extraction system using ontology and tag patterns
    Park, BK
    Han, H
    Song, IY
    ADVANCES IN WEB-AGE INFORMATION MANAGEMENT, PROCEEDINGS, 2005, 3739 : 688 - 693
  • [38] Using common schemas for information extraction from heterogeneous Web catalogs
    Vlach, R
    Kazakos, W
    ADVANCES IN DATABASES AND INFORMATION SYSTEMS, PROCEEDINGS, 2003, 2798 : 118 - 132
  • [39] Blog post and comment extraction using information quantity of web format
    Cao, Donglin
    Liao, Xiangwen
    Xu, Hongbo
    Bai, Shuo
    INFORMATION RETRIEVAL TECHNOLOGY, 2008, 4993 : 298 - 309
  • [40] Extraction of Context Information from Web Content Using Entity Linking
    Hirata, Norifumi
    Shiramatsu, Shun
    Ozono, Tadachika
    Shintani, Toramatsu
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2013, 13 (02): : 18 - 23