Automatically Extracting Web Data Records

被引:0
|
作者
Mundluru, Dheerendranath [1 ]
Raghavan, Vijay V. [1 ]
Wu, Zonghuan [1 ]
机构
[1] IMshopping Inc, Santa Clara, CA USA
来源
ACTIVE MEDIA TECHNOLOGY | 2010年 / 6335卷
关键词
Structured data extraction; Web content mining;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
It is essential for Web applications such as e-commerce portals to enrich their existing content offerings by aggregating relevant structured data (e.g., product reviews) from external Web resources. To meet this goal, in this paper, we present an algorithm for automatically extracting data records from Web pages. The algorithm uses a robust string matching technique for accurately identifying the records in the Webpage. Our experiments on diverse datasets (including datasets from third-party research projects) show that the proposed algorithm is highly effective and performs considerably better than two other state-of-the-art automatic data extraction systems. We made the proposed system publicly accessible in order for the readers to evaluate it.
引用
收藏
页码:510 / +
页数:2
相关论文
共 50 条
  • [31] Hidden Web Query Technique for Extracting the Data From Deep Web Data Base
    Das, Nripendra Narayan
    Kumar, Ela
    WORLD CONGRESS ON ENGINEERING AND COMPUTER SCIENCE, WCECS 2012, VOL I, 2012, : 410 - 414
  • [32] Mining web pages for data records
    Liu, B
    Grossman, R
    Zhai, YH
    IEEE INTELLIGENT SYSTEMS, 2004, 19 (06) : 49 - 55
  • [33] Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets
    Gao, Yihan
    Huang, Silu
    Parameswaran, Aditya
    SIGMOD'18: PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2018, : 943 - 958
  • [34] Extracting smoking data from GP electronic health records
    Masters, Nigel J.
    BRITISH JOURNAL OF GENERAL PRACTICE, 2021, 71 (703): : 58 - 59
  • [35] Extracting users' interests from web log data
    Murata, Tsuyoshi
    Saito, Kota
    2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, : 343 - +
  • [36] Extracting a statistical data matrix from electronic patient records
    Gall, W
    Heinzl, H
    Sachs, P
    COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2001, 66 (2-3) : 153 - 166
  • [37] Adaptively extracting structured data from Web pages
    Guo, Yingnan
    Zhang, Jiajun
    Chen, Xing
    2019 IEEE INTL CONF ON PARALLEL & DISTRIBUTED PROCESSING WITH APPLICATIONS, BIG DATA & CLOUD COMPUTING, SUSTAINABLE COMPUTING & COMMUNICATIONS, SOCIAL COMPUTING & NETWORKING (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2019), 2019, : 1524 - 1525
  • [38] Automatically Extracting Axioms in Classical Planning
    Miura, Shuwa
    Fukunaga, Alex
    THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 4973 - 4974
  • [39] NEIL: Extracting Visual Knowledge from Web Data
    Chen, Xinlei
    Shrivastava, Abhinav
    Gupta, Abhinav
    2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, : 1409 - 1416
  • [40] UNIVERSALEXTRACT - EXTRACTING DEEP WEB DATA USING ONTOLOGY
    Hong, Jer Lang
    Yin, Brian Ho Hoe
    UNCERTAINTY MODELLING IN KNOWLEDGE ENGINEERING AND DECISION MAKING, 2016, 10 : 377 - 383