OXPath: A language for scalable data extraction, automation, and crawling on the deep web

被引:0
|
作者
Tim Furche
Georg Gottlob
Giovanni Grasso
Christian Schallhart
Andrew Sellers
机构
[1] Oxford University,Department of Computer Science
来源
The VLDB Journal | 2013年 / 22卷
关键词
Web extraction; Crawling; Data extraction; Automation; XPath; DOM; AJAX; Web applications;
D O I
暂无
中图分类号
学科分类号
摘要
The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extracting data thus revealed—matching all the above requirements. OXPath’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoretical complexity and demonstrate that OXPath’s resource consumption is dominated by page rendering in the underlying browser. With an extensive study of sublanguages and properties of OXPath, we pinpoint the effect of specific features on evaluation performance. Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin.
引用
收藏
页码:47 / 72
页数:25
相关论文
共 50 条
  • [1] OXPATH: A language for scalable data extraction, automation, and crawling on the deep web
    Furche, Tim
    Gottlob, Georg
    Grasso, Giovanni
    Schallhart, Christian
    Sellers, Andrew
    VLDB JOURNAL, 2013, 22 (01): : 47 - 72
  • [2] OXPath: A Language for Scalable, Memory-efficient Data Extraction from Web Applications
    Furche, Tim
    Gottlob, Georg
    Grasso, Giovanni
    Schallhart, Christian
    Sellers, Andrew
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2011, 4 (11): : 1016 - 1027
  • [3] LANGUAGE BASED WEB CRAWLING ON BIG DATA
    Girgin, Canan
    Gonultas, Hayati
    Pembe Muhtaroglu, F. Canan
    Demir, Seniz
    Akin, Ahmet A.
    Obali, Murat
    2014 22ND SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2014, : 1528 - 1531
  • [4] Crawling ranked deep Web data sources
    Yan Wang
    Jianguo Lu
    Jessica Chen
    Yaxin Li
    World Wide Web, 2017, 20 : 89 - 110
  • [5] Crawling ranked deep Web data sources
    Wang, Yan
    Lu, Jianguo
    Chen, Jessica
    Li, Yaxin
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2017, 20 (01): : 89 - 110
  • [6] DEEP WEB CRAWLING FOR INSIGHTS FROM POLAR DATA
    Khalsa, Siri Jodha S.
    Mattmann, Chris A.
    Duerr, Ruth
    2017 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS), 2017, : 376 - 379
  • [7] Deep Web Data Extraction
    Hong, Jer Lang
    IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC 2010), 2010, : 3420 - 3427
  • [8] Deep Web crawling: a survey
    Hernandez, Inma
    Rivero, Carlos R.
    Ruiz, David
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2019, 22 (04): : 1577 - 1610
  • [9] Deep Web crawling: a survey
    Inma Hernández
    Carlos R. Rivero
    David Ruiz
    World Wide Web, 2019, 22 : 1577 - 1610
  • [10] Deep Web navigation in Web data extraction
    Baumgartner, Robert
    Ceresna, Michal
    Ledermueller, Gerald
    INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE FOR MODELLING, CONTROL & AUTOMATION JOINTLY WITH INTERNATIONAL CONFERENCE ON INTELLIGENT AGENTS, WEB TECHNOLOGIES & INTERNET COMMERCE, VOL 2, PROCEEDINGS, 2006, : 698 - +