Deep Reinforcement Learning for Web Crawling

被引:4
|
作者
Avrachenkov, Konstantin [1 ]
Borkar, Vivek [2 ]
Patil, Kishor [1 ]
机构
[1] Inria Sophia Antipolis, F-06902 Valbonne, France
[2] Indian Inst Technol, Mumbai 400076, Maharashtra, India
关键词
Reinforcement Learning; Adaptive Web Crawling; Thompson Sampling; Multi-armed Restless Bandits;
D O I
10.1109/ICC54714.2021.9703160
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
A search engine uses a web crawler to crawl the pages from the world wide web (WWW) and aims to maintain its local cache as fresh as possible. Unfortunately, the rates at which different pages change in WWW are highly non-uniform and also, unknown in many real-life scenarios. In addition, the finite available bandwidth and possible server restrictions on crawling frequency make it very difficult for the crawler to find the optimal scheduling policy that maximises the freshness of the local cache. We model this problem in a multi-armed restless bandits framework, where each arm represents a web page or an aggregate of statistically identical web pages. The objective is to find the scheduling policy that gives the exact indices of the pages to be crawled at a particular instance. We provide an online learning scheme using deep reinforcement learning (DRL) framework which learns the unknown page change dynamics on the fly along with the optimal crawling policy. Finally, we run numerical simulations to compare our approach with state-of-the-art algorithms such as static optimisation and Thompson sampling. We observe better performance for DRL.
引用
收藏
页码:201 / 206
页数:6
相关论文
共 50 条
  • [31] Design and implementation of crawling algorithm to collect deep web information for web archiving
    Oh, Hyo-Jung
    Won, Dong-Hyun
    Kim, Chonghyuck
    Park, Sung-Hee
    Kim, Yong
    DATA TECHNOLOGIES AND APPLICATIONS, 2018, 52 (02) : 266 - 277
  • [32] Web crawling
    Olston C.
    Najork M.
    Foundations and Trends in Information Retrieval, 2010, 4 (03): : 175 - 246
  • [33] Crawling Deep Web Using a New Set Covering Algorithm
    Wang, Yan
    Lu, Jianguo
    Chen, Jessica
    ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2009, 5678 : 326 - 337
  • [34] Ontology-based focused crawling of Deep Web sources
    Fang, Wei
    Cui, Zhiming
    Zhao, Pengpeng
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2007, 4798 : 514 - 519
  • [35] FICA: A novel intelligent crawling algorithm based on reinforcement learning
    School of Electrical and Computer Engineering, University College of Engineering, University of Tehran, Tehran, Iran
    Web Intell. Agent Syst., 2009, 4 (363-373):
  • [36] Focused Deep Web Entrance Crawling by Form Feature Classification
    Wang, Lin
    Hawbani, Ammar
    Wang, Xingfu
    BIG DATA COMPUTING AND COMMUNICATIONS, 2015, 9196 : 79 - 87
  • [37] Deep Web adaptive crawling based on minimum executable pattern
    Jun Liu
    Lu Jiang
    Zhaohui Wu
    Qinghua Zheng
    Journal of Intelligent Information Systems, 2011, 36 : 197 - 215
  • [38] Deep Web adaptive crawling based on minimum executable pattern
    Liu, Jun
    Jiang, Lu
    Wu, Zhaohui
    Zheng, Qinghua
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2011, 36 (02) : 197 - 215
  • [39] Board forum crawling: A web crawling method for web forum
    Guo, Yan
    Li, Kui
    Zhang, Kai
    Zhang, Gang
    2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, : 745 - +
  • [40] From Reinforcement Learning to Deep Reinforcement Learning: An Overview
    Agostinelli, Forest
    Hocquet, Guillaume
    Singh, Sameer
    Baldi, Pierre
    BRAVERMAN READINGS IN MACHINE LEARNING: KEY IDEAS FROM INCEPTION TO CURRENT STATE, 2018, 11100 : 298 - 328