Web crawling

被引:151
|
作者
Olston C. [1 ]
Najork M. [2 ]
机构
[1] Yahoo Research, Sunnyvale, CA, 94089
[2] Microsoft Research, Mountain View, CA, 94043
来源
关键词
D O I
10.1561/1500000017
中图分类号
学科分类号
摘要
This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work. © 2010 C. Olston and M. Najork.
引用
收藏
页码:175 / 246
页数:71
相关论文
共 50 条
  • [31] A New Framework for Focused Web Crawling
    PENG Tao
    WuhanUniversityJournalofNaturalSciences, 2006, (05) : 1394 - 1397
  • [32] A New Hidden Web Crawling Approach
    Saoudi, L.
    Boukerram, A.
    Mhamedi, S.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2015, 6 (10) : 293 - 297
  • [33] PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING
    Liu, Hongyu
    Milios, Evangelos
    COMPUTATIONAL INTELLIGENCE, 2012, 28 (03) : 289 - 328
  • [34] Sentiment-Focused Web Crawling
    Vural, A. Gural
    Cambazoglu, B. Barla
    Karagoz, Pinar
    ACM TRANSACTIONS ON THE WEB, 2014, 8 (04)
  • [35] CHALLENGES IN WEB CRAWLING FOR DATA COLLECTION
    Cholakov, Georgi
    Doychev, Emil
    Koeva, Svetla
    MATHEMATICS AND INFORMATICS, 2024, 67 (01): : 7 - 17
  • [36] Web-collaborative filtering: recommending music by crawling the Web
    Cohen, WW
    Fan, W
    COMPUTER NETWORKS-THE INTERNATIONAL JOURNAL OF COMPUTER AND TELECOMMUNICATIONS NETWORKING, 2000, 33 (1-6): : 685 - +
  • [37] The Open Web Index Crawling and Indexing the Web for Public Use
    Hendriksen, Gijs
    Dinzinger, Michael
    Farzana, Sheikh Mastura
    Fathima, Noor Afshan
    Froebe, Maik
    Schmidt, Sebastian
    Zerhoudi, Saber
    Granitzer, Michael
    Hagen, Matthias
    Hiemstra, Djoerd
    Potthast, Martin
    Stein, Benno
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT V, 2024, 14612 : 130 - 143
  • [38] Web Page Segmentation and its Application for Web Information Crawling
    Feng, Hanyang
    Zhang, Wenzhe
    Wu, Hesheng
    Wang, Chong-Jun
    2016 IEEE 28TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2016), 2016, : 598 - 605
  • [39] A Collaborative Environment for Web Crawling and Web Data Analysis in ENEAGRID
    Santomauro, Giuseppe
    Ponti, Giovanni
    Ambrosino, Fiorenzo
    Bracco, Giovanni
    Colavincenzo, Antonio
    De Rosa, Matteo
    Funel, Agostino
    Giammattei, Dante
    Guarnieri, Guido
    Migliori, Silvio
    PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON DATA SCIENCE, TECHNOLOGY AND APPLICATIONS (DATA), 2017, : 287 - 295
  • [40] Web Page Download Scheduling Policies for Green Web Crawling
    Hatzi, Vassiliki
    Barla Cambazoglu, B.
    Koutsopoulos, Iordanis
    2014 22ND INTERNATIONAL CONFERENCE ON SOFTWARE, TELECOMMUNICATIONS AND COMPUTER NETWORKS (SOFTCOM), 2014,