OXPath: A language for scalable data extraction, automation, and crawling on the deep web

被引：0

作者：

Tim Furche

Georg Gottlob

Giovanni Grasso

Christian Schallhart

Andrew Sellers

机构：

[1] Oxford University,Department of Computer Science

来源：

The VLDB Journal | 2013年 / 22卷

关键词：

Web extraction; Crawling; Data extraction; Automation; XPath; DOM; AJAX; Web applications;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extracting data thus revealed—matching all the above requirements. OXPath’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoretical complexity and demonstrate that OXPath’s resource consumption is dominated by page rendering in the underlying browser. With an extensive study of sublanguages and properties of OXPath, we pinpoint the effect of specific features on evaluation performance. Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin.

引用

页码：47 / 72

页数：25

共 50 条

[41] Scalable information extraction for web queries
Hsu, Meichun
Xiong, Yuhong
INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING, 2010, 5 (3-4) : 176 - 184
[42] Naive Bayes based Language-Specific Web Crawling
Srisukha, Ekkasit
Jinarat, Supakpong
Haruechaiyasak, Choochart
Rungsawang, Arnon
ECTI-CON 2008: PROCEEDINGS OF THE 2008 5TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING/ELECTRONICS, COMPUTER, TELECOMMUNICATIONS AND INFORMATION TECHNOLOGY, VOLS 1 AND 2, 2008, : 113 - +
[43] Design and implementation of crawling algorithm to collect deep web information for web archiving
Oh, Hyo-Jung
Won, Dong-Hyun
Kim, Chonghyuck
Park, Sung-Hee
Kim, Yong
DATA TECHNOLOGIES AND APPLICATIONS, 2018, 52 (02) : 266 - 277
[44] Deep web data extraction based on visual information processing
Liu J.
Lin L.
Cai Z.
Wang J.
Kim H.-J.
Journal of Ambient Intelligence and Humanized Computing, 2024, 15 (02) : 1481 - 1491
[45] Deep Web Data Extraction Using Query String Formation
Sharma, Meenskashi
Supriya
PROCEEDINGS OF THE 2014 INTERNATIONAL CONFERENCE ON RELIABILTY, OPTIMIZATION, & INFORMATION TECHNOLOGY (ICROIT 2014), 2014, : 166 - 169
[46] Accuracy Crawler: An Accurate Crawler for Deep Web Data Extraction
Mishra, Prafful
Khurana, Anshul
2018 INTERNATIONAL CONFERENCE ON CONTROL, POWER, COMMUNICATION AND COMPUTING TECHNOLOGIES (ICCPCCT), 2018, : 25 - 29
[47] A Visual Based Page Segmentation for Deep Web Data Extraction
Palekar, Vikas R.
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON SOFT COMPUTING FOR PROBLEM SOLVING (SOCPROS 2011), VOL 2, 2012, 131 : 791 - 804
[48] Crawling Deep Web Using a New Set Covering Algorithm
Wang, Yan
Lu, Jianguo
Chen, Jessica
ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2009, 5678 : 326 - 337
[49] Ontology-based focused crawling of Deep Web sources
Fang, Wei
Cui, Zhiming
Zhao, Pengpeng
KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2007, 4798 : 514 - 519
[50] Focused Deep Web Entrance Crawling by Form Feature Classification
Wang, Lin
Hawbani, Ammar
Wang, Xingfu
BIG DATA COMPUTING AND COMMUNICATIONS, 2015, 9196 : 79 - 87

← 1 2 3 4 5 →