OXPath: A language for scalable data extraction, automation, and crawling on the deep web

被引：0

作者：

Tim Furche

Georg Gottlob

Giovanni Grasso

Christian Schallhart

Andrew Sellers

机构：

[1] Oxford University,Department of Computer Science

来源：

The VLDB Journal | 2013年 / 22卷

关键词：

Web extraction; Crawling; Data extraction; Automation; XPath; DOM; AJAX; Web applications;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extracting data thus revealed—matching all the above requirements. OXPath’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoretical complexity and demonstrate that OXPath’s resource consumption is dominated by page rendering in the underlying browser. With an extensive study of sublanguages and properties of OXPath, we pinpoint the effect of specific features on evaluation performance. Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin.

引用

页码：47 / 72

页数：25

共 50 条

[1] OXPATH: A language for scalable data extraction, automation, and crawling on the deep web
Furche, Tim
Gottlob, Georg
Grasso, Giovanni
Schallhart, Christian
Sellers, Andrew
VLDB JOURNAL, 2013, 22 (01): : 47 - 72
[2] OXPath: A Language for Scalable, Memory-efficient Data Extraction from Web Applications
Furche, Tim
Gottlob, Georg
Grasso, Giovanni
Schallhart, Christian
Sellers, Andrew
PROCEEDINGS OF THE VLDB ENDOWMENT, 2011, 4 (11): : 1016 - 1027
[3] LANGUAGE BASED WEB CRAWLING ON BIG DATA
Girgin, Canan
Gonultas, Hayati
Pembe Muhtaroglu, F. Canan
Demir, Seniz
Akin, Ahmet A.
Obali, Murat
2014 22ND SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2014, : 1528 - 1531
[4] Crawling ranked deep Web data sources
Yan Wang
Jianguo Lu
Jessica Chen
Yaxin Li
World Wide Web, 2017, 20 : 89 - 110
[5] Crawling ranked deep Web data sources
Wang, Yan
Lu, Jianguo
Chen, Jessica
Li, Yaxin
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2017, 20 (01): : 89 - 110
[6] DEEP WEB CRAWLING FOR INSIGHTS FROM POLAR DATA
Khalsa, Siri Jodha S.
Mattmann, Chris A.
Duerr, Ruth
2017 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS), 2017, : 376 - 379
[7] Deep Web Data Extraction
Hong, Jer Lang
IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC 2010), 2010, : 3420 - 3427
[8] Deep Web crawling: a survey
Hernandez, Inma
Rivero, Carlos R.
Ruiz, David
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2019, 22 (04): : 1577 - 1610
[9] Deep Web crawling: a survey
Inma Hernández
Carlos R. Rivero
David Ruiz
World Wide Web, 2019, 22 : 1577 - 1610
[10] Deep Web navigation in Web data extraction
Baumgartner, Robert
Ceresna, Michal
Ledermueller, Gerald
INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE FOR MODELLING, CONTROL & AUTOMATION JOINTLY WITH INTERNATIONAL CONFERENCE ON INTELLIGENT AGENTS, WEB TECHNOLOGIES & INTERNET COMMERCE, VOL 2, PROCEEDINGS, 2006, : 698 - +

← 1 2 3 4 5 →