Deep Reinforcement Learning for Web Crawling

被引：4

作者：

Avrachenkov, Konstantin ^{[1
]}

Borkar, Vivek ^{[2
]}

Patil, Kishor ^{[1
]}

机构：

[1] Inria Sophia Antipolis, F-06902 Valbonne, France

[2] Indian Inst Technol, Mumbai 400076, Maharashtra, India

来源：

2021 SEVENTH INDIAN CONTROL CONFERENCE (ICC) | 2021年

关键词：

Reinforcement Learning; Adaptive Web Crawling; Thompson Sampling; Multi-armed Restless Bandits;

D O I：

10.1109/ICC54714.2021.9703160

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

A search engine uses a web crawler to crawl the pages from the world wide web (WWW) and aims to maintain its local cache as fresh as possible. Unfortunately, the rates at which different pages change in WWW are highly non-uniform and also, unknown in many real-life scenarios. In addition, the finite available bandwidth and possible server restrictions on crawling frequency make it very difficult for the crawler to find the optimal scheduling policy that maximises the freshness of the local cache. We model this problem in a multi-armed restless bandits framework, where each arm represents a web page or an aggregate of statistically identical web pages. The objective is to find the scheduling policy that gives the exact indices of the pages to be crawled at a particular instance. We provide an online learning scheme using deep reinforcement learning (DRL) framework which learns the unknown page change dynamics on the fly along with the optimal crawling policy. Finally, we run numerical simulations to compare our approach with state-of-the-art algorithms such as static optimisation and Thompson sampling. We observe better performance for DRL.

引用

页码：201 / 206

页数：6

共 50 条

[31] Design and implementation of crawling algorithm to collect deep web information for web archiving
Oh, Hyo-Jung
Won, Dong-Hyun
Kim, Chonghyuck
Park, Sung-Hee
Kim, Yong
DATA TECHNOLOGIES AND APPLICATIONS, 2018, 52 (02) : 266 - 277
[32] Web crawling
Olston C.
Najork M.
Foundations and Trends in Information Retrieval, 2010, 4 (03): : 175 - 246
[33] Crawling Deep Web Using a New Set Covering Algorithm
Wang, Yan
Lu, Jianguo
Chen, Jessica
ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2009, 5678 : 326 - 337
[34] Ontology-based focused crawling of Deep Web sources
Fang, Wei
Cui, Zhiming
Zhao, Pengpeng
KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2007, 4798 : 514 - 519
[35] FICA: A novel intelligent crawling algorithm based on reinforcement learning
School of Electrical and Computer Engineering, University College of Engineering, University of Tehran, Tehran, Iran
Web Intell. Agent Syst., 2009, 4 (363-373):
[36] Focused Deep Web Entrance Crawling by Form Feature Classification
Wang, Lin
Hawbani, Ammar
Wang, Xingfu
BIG DATA COMPUTING AND COMMUNICATIONS, 2015, 9196 : 79 - 87
[37] Deep Web adaptive crawling based on minimum executable pattern
Jun Liu
Lu Jiang
Zhaohui Wu
Qinghua Zheng
Journal of Intelligent Information Systems, 2011, 36 : 197 - 215
[38] Deep Web adaptive crawling based on minimum executable pattern
Liu, Jun
Jiang, Lu
Wu, Zhaohui
Zheng, Qinghua
JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2011, 36 (02) : 197 - 215
[39] Board forum crawling: A web crawling method for web forum
Guo, Yan
Li, Kui
Zhang, Kai
Zhang, Gang
2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, : 745 - +
[40] From Reinforcement Learning to Deep Reinforcement Learning: An Overview
Agostinelli, Forest
Hocquet, Guillaume
Singh, Sameer
Baldi, Pierre
BRAVERMAN READINGS IN MACHINE LEARNING: KEY IDEAS FROM INCEPTION TO CURRENT STATE, 2018, 11100 : 298 - 328

← 1 2 3 4 5 →