OXPath: A language for scalable data extraction, automation, and crawling on the deep web

被引：0

作者：

Tim Furche

Georg Gottlob

Giovanni Grasso

Christian Schallhart

Andrew Sellers

机构：

[1] Oxford University,Department of Computer Science

来源：

The VLDB Journal | 2013年 / 22卷

关键词：

Web extraction; Crawling; Data extraction; Automation; XPath; DOM; AJAX; Web applications;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extracting data thus revealed—matching all the above requirements. OXPath’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoretical complexity and demonstrate that OXPath’s resource consumption is dominated by page rendering in the underlying browser. With an extensive study of sublanguages and properties of OXPath, we pinpoint the effect of specific features on evaluation performance. Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin.

引用

页码：47 / 72

页数：25

共 50 条

[21] An algorithm of deep web crawler's crawling
Xiang Peisu
Tian Ke
Huang Qinzhen
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE INFORMATION COMPUTING AND AUTOMATION, VOLS 1-3, 2008, : 1259 - +
[22] Learning Deep Web Crawling with Diverse Features
Jiang, Lu
Wu, Zhaohui
Zheng, Qinghua
Liu, Jun
2009 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 1, 2009, : 572 - 575
[23] CHALLENGES IN WEB CRAWLING FOR DATA COLLECTION
Cholakov, Georgi
Doychev, Emil
Koeva, Svetla
MATHEMATICS AND INFORMATICS, 2024, 67 (01): : 7 - 17
[24] A workflow language for web automation
Montoto, Paula
Pan, Alberto
Raposo, Juan
Losada, Jose
Bellas, Fernando
Carneiro, Vector
JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2008, 14 (11) : 1838 - 1856
[25] Focused Web Crawling: A Framework for Crawling of Country Based Financial Data
Dey, Manas Kanti
Chowdhury, Hasan Md Suhag
Shamanta, Debakar
Ahmed, Khandakar Entenam Unayes
2010 2ND IEEE INTERNATIONAL CONFERENCE ON INFORMATION AND FINANCIAL ENGINEERING (ICIFE), 2010, : 409 - 412
[26] The Data Extraction Technology in Deep Web Data Integration System
Xu, Jianchao
Peng, Yuanyuan
2011 AASRI CONFERENCE ON APPLIED INFORMATION TECHNOLOGY (AASRI-AIT 2011), VOL 1, 2011, : 31 - 34
[27] Semantic Deep Web: Automatic Attribute Extraction from the Deep Web Data Sources
An, Yoo Jung
Geller, James
Wu, Yi-Ta
Chun, Soon Ae
APPLIED COMPUTING 2007, VOL 1 AND 2, 2007, : 1667 - 1672
[28] GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources
Huang, Chih-Yuan
Chang, Hao
ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2016, 5 (08)
[29] Synonyms extraction using Web content focused crawling
Chen, Chien-Hsing
Hsu, Chung-Chian
INFORMATION RETRIEVAL TECHNOLOGY, 2008, 4993 : 286 - 297
[30] A web data extraction description language and its implementation
Wu, IC
Su, JY
Chen, TB
Proceedings of the 29th Annual International Computer Software and Applications Conference, 2005, : 293 - 298

← 1 2 3 4 5 →