OXPath: A language for scalable data extraction, automation, and crawling on the deep web

被引:0
|
作者
Tim Furche
Georg Gottlob
Giovanni Grasso
Christian Schallhart
Andrew Sellers
机构
[1] Oxford University,Department of Computer Science
来源
The VLDB Journal | 2013年 / 22卷
关键词
Web extraction; Crawling; Data extraction; Automation; XPath; DOM; AJAX; Web applications;
D O I
暂无
中图分类号
学科分类号
摘要
The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extracting data thus revealed—matching all the above requirements. OXPath’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoretical complexity and demonstrate that OXPath’s resource consumption is dominated by page rendering in the underlying browser. With an extensive study of sublanguages and properties of OXPath, we pinpoint the effect of specific features on evaluation performance. Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin.
引用
收藏
页码:47 / 72
页数:25
相关论文
共 50 条
  • [21] An algorithm of deep web crawler's crawling
    Xiang Peisu
    Tian Ke
    Huang Qinzhen
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE INFORMATION COMPUTING AND AUTOMATION, VOLS 1-3, 2008, : 1259 - +
  • [22] Learning Deep Web Crawling with Diverse Features
    Jiang, Lu
    Wu, Zhaohui
    Zheng, Qinghua
    Liu, Jun
    2009 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 1, 2009, : 572 - 575
  • [23] CHALLENGES IN WEB CRAWLING FOR DATA COLLECTION
    Cholakov, Georgi
    Doychev, Emil
    Koeva, Svetla
    MATHEMATICS AND INFORMATICS, 2024, 67 (01): : 7 - 17
  • [24] A workflow language for web automation
    Montoto, Paula
    Pan, Alberto
    Raposo, Juan
    Losada, Jose
    Bellas, Fernando
    Carneiro, Vector
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2008, 14 (11) : 1838 - 1856
  • [25] Focused Web Crawling: A Framework for Crawling of Country Based Financial Data
    Dey, Manas Kanti
    Chowdhury, Hasan Md Suhag
    Shamanta, Debakar
    Ahmed, Khandakar Entenam Unayes
    2010 2ND IEEE INTERNATIONAL CONFERENCE ON INFORMATION AND FINANCIAL ENGINEERING (ICIFE), 2010, : 409 - 412
  • [26] The Data Extraction Technology in Deep Web Data Integration System
    Xu, Jianchao
    Peng, Yuanyuan
    2011 AASRI CONFERENCE ON APPLIED INFORMATION TECHNOLOGY (AASRI-AIT 2011), VOL 1, 2011, : 31 - 34
  • [27] Semantic Deep Web: Automatic Attribute Extraction from the Deep Web Data Sources
    An, Yoo Jung
    Geller, James
    Wu, Yi-Ta
    Chun, Soon Ae
    APPLIED COMPUTING 2007, VOL 1 AND 2, 2007, : 1667 - 1672
  • [28] GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources
    Huang, Chih-Yuan
    Chang, Hao
    ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2016, 5 (08)
  • [29] Synonyms extraction using Web content focused crawling
    Chen, Chien-Hsing
    Hsu, Chung-Chian
    INFORMATION RETRIEVAL TECHNOLOGY, 2008, 4993 : 286 - 297
  • [30] A web data extraction description language and its implementation
    Wu, IC
    Su, JY
    Chen, TB
    Proceedings of the 29th Annual International Computer Software and Applications Conference, 2005, : 293 - 298