OXPath: A language for scalable data extraction, automation, and crawling on the deep web

被引:0
|
作者
Tim Furche
Georg Gottlob
Giovanni Grasso
Christian Schallhart
Andrew Sellers
机构
[1] Oxford University,Department of Computer Science
来源
The VLDB Journal | 2013年 / 22卷
关键词
Web extraction; Crawling; Data extraction; Automation; XPath; DOM; AJAX; Web applications;
D O I
暂无
中图分类号
学科分类号
摘要
The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extracting data thus revealed—matching all the above requirements. OXPath’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoretical complexity and demonstrate that OXPath’s resource consumption is dominated by page rendering in the underlying browser. With an extensive study of sublanguages and properties of OXPath, we pinpoint the effect of specific features on evaluation performance. Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin.
引用
收藏
页码:47 / 72
页数:25
相关论文
共 50 条
  • [31] Scalable Web Data Extraction for Xtree Analysis: Algorithms and Performance Evaluation
    Rajkumar, K. Varada
    Nithya, Kolluru Sri
    Narasimha, Chelamkuri Teja Sai
    Shariff, Vahiduddin
    Manasa, Vemuri Jaya
    Tirumanadham, N. S. Koti Mani Kumar
    2024 SECOND INTERNATIONAL CONFERENCE ON INVENTIVE COMPUTING AND INFORMATICS, ICICI 2024, 2024, : 447 - 455
  • [32] Client-side deep web data extraction
    Alvarez, M
    Pan, A
    Raposo, J
    Viña, A
    PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON E-COMMERCE TECHNOLOGY FOR DYNAMIC E-BUSINESS, 2004, : 158 - 161
  • [33] Research on Adaptive Wrapper in Deep Web Data Extraction
    Liu, Donglan
    Ma, Lei
    Liu, Xin
    INTERNET OF VEHICLES - SAFE AND INTELLIGENT MOBILITY, IOV 2015, 2015, 9502 : 409 - 423
  • [34] CRAWLING DEEP WEB CONTENT THROUGH QUERY FORMS
    Liu, Jun
    Wu, Zhaohui
    Jiang, Lu
    Zheng, Qinghua
    Liu, Xiao
    WEBIST 2009: PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS AND TECHNOLOGIES, 2009, : 634 - +
  • [35] A Task-specific Approach for Crawling the Deep Web
    Alvarez, Manuel
    Raposo, Juan
    Cacheda, Fidel
    Pan, Alberto
    ENGINEERING LETTERS, 2006, 13 (02)
  • [36] A Survey on Content Based Crawling for Deep and Surface Web
    Agrawal, Nishchay
    Johari, Suchi
    2019 FIFTH INTERNATIONAL CONFERENCE ON IMAGE INFORMATION PROCESSING (ICIIP 2019), 2019, : 491 - 496
  • [37] Data-parallel web crawling models
    Cambazoglu, BB
    Turk, A
    Aykanat, C
    COMPUTER AND INFORMATION SCIENCES - ISCIS 2004, PROCEEDINGS, 2004, 3280 : 801 - 809
  • [38] Efficient Deep Web Crawling Using Reinforcement Learning
    Jiang, Lu
    Wu, Zhaohui
    Feng, Qian
    Liu, Jun
    Zheng, Qinghua
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT I, PROCEEDINGS, 2010, 6118 : 428 - +
  • [39] Crawling and Extracting Process Data from the Web
    Liu, Yaling
    Agah, Arvin
    ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2009, 5678 : 545 - 552
  • [40] A Collaborative Environment for Web Crawling and Web Data Analysis in ENEAGRID
    Santomauro, Giuseppe
    Ponti, Giovanni
    Ambrosino, Fiorenzo
    Bracco, Giovanni
    Colavincenzo, Antonio
    De Rosa, Matteo
    Funel, Agostino
    Giammattei, Dante
    Guarnieri, Guido
    Migliori, Silvio
    PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON DATA SCIENCE, TECHNOLOGY AND APPLICATIONS (DATA), 2017, : 287 - 295