OXPath: A language for scalable data extraction, automation, and crawling on the deep web

被引:0
|
作者
Tim Furche
Georg Gottlob
Giovanni Grasso
Christian Schallhart
Andrew Sellers
机构
[1] Oxford University,Department of Computer Science
来源
The VLDB Journal | 2013年 / 22卷
关键词
Web extraction; Crawling; Data extraction; Automation; XPath; DOM; AJAX; Web applications;
D O I
暂无
中图分类号
学科分类号
摘要
The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extracting data thus revealed—matching all the above requirements. OXPath’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoretical complexity and demonstrate that OXPath’s resource consumption is dominated by page rendering in the underlying browser. With an extensive study of sublanguages and properties of OXPath, we pinpoint the effect of specific features on evaluation performance. Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin.
引用
收藏
页码:47 / 72
页数:25
相关论文
共 50 条
  • [41] Scalable information extraction for web queries
    Hsu, Meichun
    Xiong, Yuhong
    INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING, 2010, 5 (3-4) : 176 - 184
  • [42] Naive Bayes based Language-Specific Web Crawling
    Srisukha, Ekkasit
    Jinarat, Supakpong
    Haruechaiyasak, Choochart
    Rungsawang, Arnon
    ECTI-CON 2008: PROCEEDINGS OF THE 2008 5TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING/ELECTRONICS, COMPUTER, TELECOMMUNICATIONS AND INFORMATION TECHNOLOGY, VOLS 1 AND 2, 2008, : 113 - +
  • [43] Design and implementation of crawling algorithm to collect deep web information for web archiving
    Oh, Hyo-Jung
    Won, Dong-Hyun
    Kim, Chonghyuck
    Park, Sung-Hee
    Kim, Yong
    DATA TECHNOLOGIES AND APPLICATIONS, 2018, 52 (02) : 266 - 277
  • [44] Deep web data extraction based on visual information processing
    Liu J.
    Lin L.
    Cai Z.
    Wang J.
    Kim H.-J.
    Journal of Ambient Intelligence and Humanized Computing, 2024, 15 (02) : 1481 - 1491
  • [45] Deep Web Data Extraction Using Query String Formation
    Sharma, Meenskashi
    Supriya
    PROCEEDINGS OF THE 2014 INTERNATIONAL CONFERENCE ON RELIABILTY, OPTIMIZATION, & INFORMATION TECHNOLOGY (ICROIT 2014), 2014, : 166 - 169
  • [46] Accuracy Crawler: An Accurate Crawler for Deep Web Data Extraction
    Mishra, Prafful
    Khurana, Anshul
    2018 INTERNATIONAL CONFERENCE ON CONTROL, POWER, COMMUNICATION AND COMPUTING TECHNOLOGIES (ICCPCCT), 2018, : 25 - 29
  • [47] A Visual Based Page Segmentation for Deep Web Data Extraction
    Palekar, Vikas R.
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON SOFT COMPUTING FOR PROBLEM SOLVING (SOCPROS 2011), VOL 2, 2012, 131 : 791 - 804
  • [48] Crawling Deep Web Using a New Set Covering Algorithm
    Wang, Yan
    Lu, Jianguo
    Chen, Jessica
    ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2009, 5678 : 326 - 337
  • [49] Ontology-based focused crawling of Deep Web sources
    Fang, Wei
    Cui, Zhiming
    Zhao, Pengpeng
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2007, 4798 : 514 - 519
  • [50] Focused Deep Web Entrance Crawling by Form Feature Classification
    Wang, Lin
    Hawbani, Ammar
    Wang, Xingfu
    BIG DATA COMPUTING AND COMMUNICATIONS, 2015, 9196 : 79 - 87