Web crawling

被引:151
|
作者
Olston C. [1 ]
Najork M. [2 ]
机构
[1] Yahoo Research, Sunnyvale, CA, 94089
[2] Microsoft Research, Mountain View, CA, 94043
来源
关键词
D O I
10.1561/1500000017
中图分类号
学科分类号
摘要
This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work. © 2010 C. Olston and M. Najork.
引用
收藏
页码:175 / 246
页数:71
相关论文
共 50 条
  • [41] Web-crawling up the tree of life
    Morell, V
    SCIENCE, 1996, 273 (5275) : 568 - 570
  • [42] Clustering-Based Incremental Web Crawling
    Tan, Qingzhao
    Mitra, Prasenjit
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2010, 28 (04)
  • [43] LANGUAGE BASED WEB CRAWLING ON BIG DATA
    Girgin, Canan
    Gonultas, Hayati
    Pembe Muhtaroglu, F. Canan
    Demir, Seniz
    Akin, Ahmet A.
    Obali, Murat
    2014 22ND SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2014, : 1528 - 1531
  • [44] An algorithm of deep web crawler's crawling
    Xiang Peisu
    Tian Ke
    Huang Qinzhen
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE INFORMATION COMPUTING AND AUTOMATION, VOLS 1-3, 2008, : 1259 - +
  • [45] Learnable Focused Meta Crawling Through Web
    Kumar, Mukesh
    Vig, Renu
    2ND INTERNATIONAL CONFERENCE ON COMMUNICATION, COMPUTING & SECURITY [ICCCS-2012], 2012, 1 : 606 - 611
  • [46] Learning Deep Web Crawling with Diverse Features
    Jiang, Lu
    Wu, Zhaohui
    Zheng, Qinghua
    Liu, Jun
    2009 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 1, 2009, : 572 - 575
  • [47] Crawling ranked deep Web data sources
    Yan Wang
    Jianguo Lu
    Jessica Chen
    Yaxin Li
    World Wide Web, 2017, 20 : 89 - 110
  • [48] Crawling ranked deep Web data sources
    Wang, Yan
    Lu, Jianguo
    Chen, Jessica
    Li, Yaxin
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2017, 20 (01): : 89 - 110
  • [49] A model for learning words in a language by crawling the web
    Sygys.com, PO Box 26, Montezuma NM 87731, United States
    不详
    Int. Conf. Comput. Appl. Ind. Eng., CAINE, 1600, (183-188):
  • [50] Data-parallel web crawling models
    Cambazoglu, BB
    Turk, A
    Aykanat, C
    COMPUTER AND INFORMATION SCIENCES - ISCIS 2004, PROCEEDINGS, 2004, 3280 : 801 - 809