Web crawling

被引:151
|
作者
Olston C. [1 ]
Najork M. [2 ]
机构
[1] Yahoo Research, Sunnyvale, CA, 94089
[2] Microsoft Research, Mountain View, CA, 94043
来源
关键词
D O I
10.1561/1500000017
中图分类号
学科分类号
摘要
This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work. © 2010 C. Olston and M. Najork.
引用
收藏
页码:175 / 246
页数:71
相关论文
共 50 条
  • [21] Using Web Pages Dynamicity to Prioritise Web Crawling
    Alderratia, Nisreen
    Elsheh, Mohammed
    PROCEEDINGS OF THE 2019 2ND INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND MACHINE INTELLIGENCE (MLMI 2019), 2019, : 40 - 44
  • [22] Utilizing RSS feeds for crawling the Web
    Adam, George
    Bouras, Christos
    Poulopoulos, Vassilis
    2009 FOURTH INTERNATIONAL CONFERENCE ON INTERNET AND WEB APPLICATIONS AND SERVICES, 2009, : 211 - 216
  • [23] NewNet-Crawling Deep Web
    Rai, Pradeep
    Singh, Shubha
    Yadav, Abhishek Singh
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2010, 10 (05): : 129 - 132
  • [24] Predictive Crawling for Commercial Web Content
    Han, Shuguang
    Brodowsky, Bernhard
    Gajda, Przemek
    Novikov, Sergey
    Bendersky, Michael
    Najork, Marc
    Dua, Robin
    Popescul, Alexandrin
    WEB CONFERENCE 2019: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2019), 2019, : 627 - 637
  • [25] Parallel web crawling for customer analytics
    Zhou, Jinfeng
    Wei, Jinliang
    Ratnam, Malini
    Xu, Bugao
    TEXTILE RESEARCH JOURNAL, 2025,
  • [26] Information Retrieval in Web Crawling: A Survey
    Saini, Chandni
    Arora, Vinay
    2016 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2016, : 2635 - 2643
  • [27] Deep Reinforcement Learning for Web Crawling
    Avrachenkov, Konstantin
    Borkar, Vivek
    Patil, Kishor
    2021 SEVENTH INDIAN CONTROL CONFERENCE (ICC), 2021, : 201 - 206
  • [28] A Novel Crawling Algorithm for Web Pages
    Golshani, Mohammad Amin
    Derhami, Vali
    ZarehBidoki, AliMohammad
    INFORMATION RETRIEVAL TECHNOLOGY, 2011, 7097 : 263 - 272
  • [29] An Efficient Focused Web Crawling Approach
    Aggarwal, Kompal
    SOFTWARE ENGINEERING (CSI 2015), 2019, 731 : 131 - 138
  • [30] Crawling images with web browser support
    Vagac, Michal
    Melichercik, Miroslav
    Marko, Matus
    Trhan, Peter
    Michalikova, Alzbeta
    Kliment, Rene
    Drapka, Radoslav
    2015 IEEE 13TH INTERNATIONAL SCIENTIFIC CONFERENCE ON INFORMATICS, 2015, : 286 - 289