Deep Reinforcement Learning for Web Crawling

被引:4
|
作者
Avrachenkov, Konstantin [1 ]
Borkar, Vivek [2 ]
Patil, Kishor [1 ]
机构
[1] Inria Sophia Antipolis, F-06902 Valbonne, France
[2] Indian Inst Technol, Mumbai 400076, Maharashtra, India
关键词
Reinforcement Learning; Adaptive Web Crawling; Thompson Sampling; Multi-armed Restless Bandits;
D O I
10.1109/ICC54714.2021.9703160
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
A search engine uses a web crawler to crawl the pages from the world wide web (WWW) and aims to maintain its local cache as fresh as possible. Unfortunately, the rates at which different pages change in WWW are highly non-uniform and also, unknown in many real-life scenarios. In addition, the finite available bandwidth and possible server restrictions on crawling frequency make it very difficult for the crawler to find the optimal scheduling policy that maximises the freshness of the local cache. We model this problem in a multi-armed restless bandits framework, where each arm represents a web page or an aggregate of statistically identical web pages. The objective is to find the scheduling policy that gives the exact indices of the pages to be crawled at a particular instance. We provide an online learning scheme using deep reinforcement learning (DRL) framework which learns the unknown page change dynamics on the fly along with the optimal crawling policy. Finally, we run numerical simulations to compare our approach with state-of-the-art algorithms such as static optimisation and Thompson sampling. We observe better performance for DRL.
引用
收藏
页码:201 / 206
页数:6
相关论文
共 50 条
  • [21] CRAWLING DEEP WEB CONTENT THROUGH QUERY FORMS
    Liu, Jun
    Wu, Zhaohui
    Jiang, Lu
    Zheng, Qinghua
    Liu, Xiao
    WEBIST 2009: PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS AND TECHNOLOGIES, 2009, : 634 - +
  • [22] A Task-specific Approach for Crawling the Deep Web
    Alvarez, Manuel
    Raposo, Juan
    Cacheda, Fidel
    Pan, Alberto
    ENGINEERING LETTERS, 2006, 13 (02)
  • [23] A Survey on Content Based Crawling for Deep and Surface Web
    Agrawal, Nishchay
    Johari, Suchi
    2019 FIFTH INTERNATIONAL CONFERENCE ON IMAGE INFORMATION PROCESSING (ICIIP 2019), 2019, : 491 - 496
  • [24] DEEP WEB CRAWLING FOR INSIGHTS FROM POLAR DATA
    Khalsa, Siri Jodha S.
    Mattmann, Chris A.
    Duerr, Ruth
    2017 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS), 2017, : 376 - 379
  • [25] Analysis of Fire Accident Factors on Construction Sites Using Web Crawling and Deep Learning Approach
    Kim, Jaehong
    Youm, Sangpil
    Shan, Yongwei
    Kim, Jonghoon
    SUSTAINABILITY, 2021, 13 (21)
  • [26] A survey on deep learning and deep reinforcement learning in robotics with a tutorial on deep reinforcement learning
    Morales, Eduardo F.
    Murrieta-Cid, Rafael
    Becerra, Israel
    Esquivel-Basaldua, Marco A.
    INTELLIGENT SERVICE ROBOTICS, 2021, 14 (05) : 773 - 805
  • [27] A survey on deep learning and deep reinforcement learning in robotics with a tutorial on deep reinforcement learning
    Eduardo F. Morales
    Rafael Murrieta-Cid
    Israel Becerra
    Marco A. Esquivel-Basaldua
    Intelligent Service Robotics, 2021, 14 : 773 - 805
  • [28] Optimal path strategy for the web computing under deep reinforcement learning
    Mu Shengdong
    Wang Fengyu
    Xiong Zhengxian
    Zhuang Xiao
    Zhang Lunfeng
    INTERNATIONAL JOURNAL OF WEB INFORMATION SYSTEMS, 2020, 16 (05) : 529 - 544
  • [29] Deep Reinforcement Learning for Autonomous Driving in Amazon Web Services DeepRacer
    Petryshyn, Bohdan
    Postupaiev, Serhii
    Ben Bari, Soufiane
    Ostreika, Armantas
    INFORMATION, 2024, 15 (02)
  • [30] The Advance of Reinforcement Learning and Deep Reinforcement Learning
    Lyu, Le
    Shen, Yang
    Zhang, Sicheng
    2022 IEEE INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING, BIG DATA AND ALGORITHMS (EEBDA), 2022, : 644 - 648