Extracting Records from the Web Using a Signal Processing Approach

被引:4
|
作者
Velloso, Roberto Panerai [1 ]
Dorneles, Carina F. [1 ]
机构
[1] Univ Fed Santa Catarina, Florianopolis, SC, Brazil
关键词
web mining; record extraction; structure detection; information retrieval; record alignment; ALGORITHM;
D O I
10.1145/3132847.3132875
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Extracting records from web pages enables a number of important applications and has immense value due to the amount and diversity of available information that can be extracted. This problem, although vastly studied, remains open because it is not a trivial one. Due to the scale of data, a feasible approach must be both automatic and efficient (and of course effective). We present here a novel approach, fully automatic and computationally efficient, using signal processing techniques to detect regularities and patterns in the structure of web pages. Our approach segments the web page, detects the data regions within it, identifies the records boundaries and aligns the records. Results show high f-score and linearithmic time complexity behaviour.
引用
收藏
页码:197 / 206
页数:10
相关论文
共 50 条
  • [1] Visually Extracting Data Records from the Deep Web
    Anderson, Neil
    Hong, Jun
    PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'13 COMPANION), 2013, : 1233 - 1238
  • [2] Finding and Extracting Data Records from Web Pages
    Manuel Álvarez
    Alberto Pan
    Juan Raposo
    Fernando Bellas
    Fidel Cacheda
    Journal of Signal Processing Systems, 2010, 59 : 123 - 137
  • [3] Finding and Extracting Data Records from Web Pages
    Alvarez, Manuel
    Pan, Alberto
    Raposo, Juan
    Bellas, Fernando
    Cacheda, Fidel
    JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2010, 59 (01): : 123 - 137
  • [4] Finding and extracting data records from web pages
    Alvarez, Manuel
    Pan, Alberto
    Raposo, Juan
    Bellas, Fernando
    Cacheda, Fidel
    EMBEDDED AND UBIQUITOUS COMPUTING, PROCEEDINGS, 2007, 4808 : 466 - 478
  • [5] Extracting a climate signal from 169 glacier records
    Oerlemans, J
    SCIENCE, 2005, 308 (5722) : 675 - 677
  • [6] Automatically Extracting Web Data Records
    Mundluru, Dheerendranath
    Raghavan, Vijay V.
    Wu, Zonghuan
    ACTIVE MEDIA TECHNOLOGY, 2010, 6335 : 510 - +
  • [7] Signal Processing for Metagenomics: Extracting Information from the Soup
    Rosen, Gail L.
    Sokhansanj, Bahrad A.
    Polikar, Robi
    Bruns, Mary Ann
    Russell, Jacob
    Garbarine, Elaine
    Essinger, Steve
    Yok, Non
    CURRENT GENOMICS, 2009, 10 (07) : 493 - 510
  • [8] Extracting social determinants of health from inpatient electronic medical records using natural language processing
    Martin, Elliot A.
    D'Souza, Adam G.
    Saini, Vineet
    Tang, Karen
    Quan, Hude
    Eastwood, Cathy A.
    JOURNAL OF EPIDEMIOLOGY AND POPULATION HEALTH, 2024, 72 (06):
  • [9] NET - A system for extracting Web data from flat and nested data records
    Liu, B
    Zhai, YH
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2005, 2005, 3806 : 487 - 495
  • [10] Extracting lists of data records from semi-structured web pages
    Alvarez, Manuel
    Pan, Alberto
    Raposo, Juan
    Bellas, Fernando
    Cacheda, Fidel
    DATA & KNOWLEDGE ENGINEERING, 2008, 64 (02) : 491 - 509