Deep Reinforcement Learning for Web Crawling

被引:4
|
作者
Avrachenkov, Konstantin [1 ]
Borkar, Vivek [2 ]
Patil, Kishor [1 ]
机构
[1] Inria Sophia Antipolis, F-06902 Valbonne, France
[2] Indian Inst Technol, Mumbai 400076, Maharashtra, India
关键词
Reinforcement Learning; Adaptive Web Crawling; Thompson Sampling; Multi-armed Restless Bandits;
D O I
10.1109/ICC54714.2021.9703160
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
A search engine uses a web crawler to crawl the pages from the world wide web (WWW) and aims to maintain its local cache as fresh as possible. Unfortunately, the rates at which different pages change in WWW are highly non-uniform and also, unknown in many real-life scenarios. In addition, the finite available bandwidth and possible server restrictions on crawling frequency make it very difficult for the crawler to find the optimal scheduling policy that maximises the freshness of the local cache. We model this problem in a multi-armed restless bandits framework, where each arm represents a web page or an aggregate of statistically identical web pages. The objective is to find the scheduling policy that gives the exact indices of the pages to be crawled at a particular instance. We provide an online learning scheme using deep reinforcement learning (DRL) framework which learns the unknown page change dynamics on the fly along with the optimal crawling policy. Finally, we run numerical simulations to compare our approach with state-of-the-art algorithms such as static optimisation and Thompson sampling. We observe better performance for DRL.
引用
收藏
页码:201 / 206
页数:6
相关论文
共 50 条
  • [1] Efficient Deep Web Crawling Using Reinforcement Learning
    Jiang, Lu
    Wu, Zhaohui
    Feng, Qian
    Liu, Jun
    Zheng, Qinghua
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT I, PROCEEDINGS, 2010, 6118 : 428 - +
  • [2] Expanding reinforcement learning approaches for efficient crawling of the web
    Nezhad, HRM
    Barfourosh, AA
    7TH WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL IX, PROCEEDINGS: COMPUTER SCIENCE AND ENGINEERING: II, 2003, : 71 - 76
  • [3] Learning Deep Web Crawling with Diverse Features
    Jiang, Lu
    Wu, Zhaohui
    Zheng, Qinghua
    Liu, Jun
    2009 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 1, 2009, : 572 - 575
  • [4] Deep Web crawling: a survey
    Hernandez, Inma
    Rivero, Carlos R.
    Ruiz, David
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2019, 22 (04): : 1577 - 1610
  • [5] Deep Web crawling: a survey
    Inma Hernández
    Carlos R. Rivero
    David Ruiz
    World Wide Web, 2019, 22 : 1577 - 1610
  • [6] An effective approach of web crawling for deep web
    Wang, Shunyan
    Wu, Binghua
    Zhong, Luo
    DCABES 2007 Proceedings, Vols I and II, 2007, : 855 - 858
  • [7] NewNet-Crawling Deep Web
    Rai, Pradeep
    Singh, Shubha
    Yadav, Abhishek Singh
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2010, 10 (05): : 129 - 132
  • [8] Focused Crawling Through Reinforcement Learning
    Han, Miyoung
    Wuillemin, Pierre-Henri
    Senellart, Pierre
    WEB ENGINEERING, ICWE 2018, 2018, 10845 : 259 - 276
  • [9] Coordinated crawling via reinforcement learning
    Mishra, Shruti
    van Rees, Wim M.
    Mahadevan, L.
    JOURNAL OF THE ROYAL SOCIETY INTERFACE, 2020, 17 (169)
  • [10] Deep crawling in the semantic web: In search of deep knowledge
    Navas-Delgado, I
    Roldan-Garcia, MD
    Aldana-Montes, JF
    WEB INFORMATION SYSTEMS - WISE 2004, PROCEEDINGS, 2004, 3306 : 541 - 546