Efficient Multi-threaded Crawling Using In Memory Data Structures

被引:0
|
作者
Abdeen, Mohammad A. R. [1 ]
机构
[1] Islamic Univ Madinah, Fac Comp & Informat Syst, Madinah, Saudi Arabia
关键词
Web Crawlers; Distributed Applications; Multi-threading; In-memory Data Structures; Performance Evaluation;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Crawling the internet is an important task for any search engine. A crawler is a software program that sends HTTP requests to various webservers available on the world datasphere and downloads their contents. As the size of the internet has gone through a big bang in the last decade, designing efficient parallel crawlers became a necessity. One of the factors that degrades the crawler performance is the disk access every time a file is written. As the process of crawling the web requires the download of tens or hundreds of millions of webpages, much time will be consumed in disk writes due to the seek times. This work presents an efficient multi-threaded crawler that incorporates an in-memory data structure to reduce the overall disk write times. The results show that the proposed technique can increase the throughput by about 50% at selected values of size of the in-memory data structure over the normal multi-threaded crawler with no in-memory data structure. In addition, the results show that this design can achieve an average crawler speed of 22 pages/sec which supersedes previously reported work.
引用
收藏
页码:88 / 92
页数:5
相关论文
共 50 条
  • [11] On multi-threaded paging
    Feuerstein, E
    de Loma, AS
    ALGORITHMS AND COMPUTATION, 1996, 1178 : 417 - 426
  • [12] Multi-threaded Output in CMS using ROOT
    Riley, Daniel
    Jones, Christopher
    23RD INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP 2018), 2019, 214
  • [13] Multi-Threaded Circuit Simulation using OpenMP
    Zwolinski, Mark
    2010 FIRST IEEE LATIN AMERICAN SYMPOSIUM ON CIRCUITS AND SYSTEMS (LASCAS), 2010, : 188 - 191
  • [14] Multi-threaded design for a software distributed shared memory system
    Ueng, JC
    Shieh, CK
    Mac, SC
    Lai, AC
    Liang, TY
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 1999, E82D (12) : 1512 - 1523
  • [15] An efficient multi-level trace toolkit for multi-threaded applications
    Danjean, V
    Namyst, R
    Wacrenier, PA
    EURO-PAR 2005 PARALLEL PROCESSING, PROCEEDINGS, 2005, 3648 : 166 - 175
  • [16] Dynamic Terrain Data Visualization Using Virtual Paging in Multi-threaded Environment
    Porwal, Sudhir
    Rathi, Virendra Singh
    COMPUTATIONAL INTELLIGENCE AND INFORMATION TECHNOLOGY, 2011, 250 : 503 - 505
  • [17] A predictable multi-threaded main-memory storage manager
    Song Guang-hua
    Yang Chang-sheng
    Shi Jiao-ying
    Journal of Zhejiang University-SCIENCE A, 2001, 2 (4): : 416 - 420
  • [18] A PREDICTABLE MULTI-THREADED MAIN-MEMORY STORAGE MANAGER
    宋广华
    杨长生
    石教英
    Journal of Zhejiang University Science, 2001, (04) : 57 - 61
  • [19] SAC - A functional array language for efficient multi-threaded execution
    Grelck, Clemens
    Scholz, Sven-Bodo
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2006, 34 (04) : 383 - 427
  • [20] NAS integer sort on multi-threaded shared memory machines
    Grün, T
    Hillebrand, MA
    EURO-PAR '98 PARALLEL PROCESSING, 1998, 1470 : 999 - 1009