Extracting Records from the Web Using a Signal Processing Approach

被引:4
|
作者
Velloso, Roberto Panerai [1 ]
Dorneles, Carina F. [1 ]
机构
[1] Univ Fed Santa Catarina, Florianopolis, SC, Brazil
关键词
web mining; record extraction; structure detection; information retrieval; record alignment; ALGORITHM;
D O I
10.1145/3132847.3132875
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Extracting records from web pages enables a number of important applications and has immense value due to the amount and diversity of available information that can be extracted. This problem, although vastly studied, remains open because it is not a trivial one. Due to the scale of data, a feasible approach must be both automatic and efficient (and of course effective). We present here a novel approach, fully automatic and computationally efficient, using signal processing techniques to detect regularities and patterns in the structure of web pages. Our approach segments the web page, detects the data regions within it, identifies the records boundaries and aligns the records. Results show high f-score and linearithmic time complexity behaviour.
引用
收藏
页码:197 / 206
页数:10
相关论文
共 50 条
  • [41] Using Signal Processing and Semantic Web Technologies to Analyze Byzantine Iconography
    Karagiannis, Georgios
    Vavliakis, Konstantinos
    Sotiropoulou, Sophia
    Daniilia, Sister
    Damtsios, Argirios
    Alexiadis, Dimitrios
    Salpistis, Christos
    IEEE INTELLIGENT SYSTEMS, 2009, 24 (03) : 73 - 81
  • [42] Distance learning in communications signal processing using MATLAB web server
    Yan, P
    Valkama, M
    Renfors, M
    NORSIG 2004: PROCEEDINGS OF THE 6TH NORDIC SIGNAL PROCESSING SYMPOSIUM, 2004, 46 : 244 - 247
  • [43] Development and validation of a classification approach for extracting severity automatically from electronic health records
    Boland, Mary Regina
    Tatonetti, Nicholas P.
    Hripcsak, George
    JOURNAL OF BIOMEDICAL SEMANTICS, 2015, 6
  • [44] Extracting Room Prices from Web Tables - an Ontology-Aware Approach
    Buttinger, Christina
    Feilmayr, Christina
    Guttenbrunner, Michael
    Parzer, Stefan
    Proell, Birgit
    INFORMATION AND COMMUNICATION TECHNOLOGIES IN TOURISM 2010, 2010, : 223 - 234
  • [45] Signal processing and the World Wide Web
    Johnson, Don H., 1600, IEEE, Piscataway, NJ, United States (12):
  • [46] Development and validation of a classification approach for extracting severity automatically from electronic health records
    Mary Regina Boland
    Nicholas P Tatonetti
    George Hripcsak
    Journal of Biomedical Semantics, 6
  • [47] A General Approach to Extracting Full Names and Abbreviations for Chinese Entities from the Web
    Guang Jiang
    Cao Cungen
    Sui Yuefei
    Han Lu
    Shi Wang
    INTELLIGENT INFORMATION PROCESSING V, 2010, 340 : 271 - 280
  • [48] Extracting riches from the Web: Web mining/personalization
    Drogan, M
    Hsu, J
    7TH WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL XVI, PROCEEDINGS: SYSTEMICS AND INFORMATION SYSTEMS, TECHNOLOGIES AND APPLICATION, 2003, : 214 - 219
  • [49] Extracting Relations from Chinese Web Documents Using Kernel Methods
    Qiu, Jing
    Liao, Lejian
    PROCEEDINGS OF THE 8TH IEEE/ACIS INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE, 2009, : 352 - 356
  • [50] Query Interface Schema Extracting from Deep Web using Ontology
    Sun, Yong
    Wang, Shang
    Li, Zhenyuan
    Liu, Chang
    Peng, Tao
    Qiu, Yuhang
    2021 INTERNATIONAL CONFERENCE ON IMAGE, VIDEO PROCESSING, AND ARTIFICIAL INTELLIGENCE, 2021, 12076