Using Web Pages Dynamicity to Prioritise Web Crawling

被引:1
|
作者
Alderratia, Nisreen [1 ]
Elsheh, Mohammed [1 ]
机构
[1] Libyan Acad Misurata, Third Ring Rd, Misurata, Libya
关键词
Web crawler; importance metric; dynamicity;
D O I
10.1145/3366750.3366757
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Web crawling is a process performed to collect web pages from the web, in order to be indexed and used for displaying the search results according to users' requirements. In addition, web crawlers must continually revisit web pages, to keep the search engine database updated. Moreover, it is fundamental to determine in the crawling process, the most important pages to be recrawled first. This is to avoid the time limitation and network issues that face the web crawling process. Thus, this research attempts to introduce a method that is used to indicate the crawler, specifically, in order to identify in what order it should recrawl web pages that have been crawled before, as to acquire more important and valuable pages earlier than others. In addition, the researchers proposed a web crawling strategy which is based on the topic similarity, accompanied with the dynamicity of web pages, where the crawler was downloading relevant pages and recrawling them recursively. Also, every time a change emerged in one of the pages, its counter increased. Therefore, if the page was relevant and changed frequently it would be considered an important page and was given a high priority in the crawling process. The obtained results indicated that using web pages' dynamicity is an effective way for prioritising web pages in the crawling process, in order to obtain the highest dynamic pages first, as there is a high possibility of being changed in terms of their content, before the least dynamic ones.
引用
收藏
页码:40 / 44
页数:5
相关论文
共 50 条
  • [1] A Novel Crawling Algorithm for Web Pages
    Golshani, Mohammad Amin
    Derhami, Vali
    ZarehBidoki, AliMohammad
    INFORMATION RETRIEVAL TECHNOLOGY, 2011, 7097 : 263 - 272
  • [2] Crawling web pages with support for client-side dynamism
    Alvarez, Manuel
    Pan, Alberto
    Raposo, Juan
    Hidalgo, Justo
    ADVANCES IN WEB-AGE INFORMATION MANAGEMENT, PROCEEDINGS, 2006, 4016 : 252 - 262
  • [3] Using the web infrastructure to preserve web pages
    Nelson, Michael L.
    McCown, Frank
    Smith, Joan A.
    Klein, Martin
    INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES, 2007, 6 (04) : 327 - 349
  • [4] Web crawling
    Olston C.
    Najork M.
    Foundations and Trends in Information Retrieval, 2010, 4 (03): : 175 - 246
  • [5] An Extended Method for Finding Related Web Pages with Focused Crawling Techniques
    Furuse, Kazutaka
    Ohmura, Hiroaki
    Chen, Hanxiong
    Kitagawa, Hiroyuki
    KNOWLEDGE-BASED AND INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT II: 15TH INTERNATIONAL CONFERENCE, KES 2011, 2011, 6882 : 21 - 30
  • [6] The evolution of link-attributes for pages and its implications on web crawling
    Meng, T
    Yan, HF
    Wang, JM
    Li, XM
    IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2004), PROCEEDINGS, 2004, : 578 - 581
  • [7] Board forum crawling: A web crawling method for web forum
    Guo, Yan
    Li, Kui
    Zhang, Kai
    Zhang, Gang
    2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, : 745 - +
  • [8] A method for indexing web pages using web bots
    Szymanski, BK
    Chung, MS
    2001 INTERNATIONAL CONFERENCES ON INFO-TECH AND INFO-NET PROCEEDINGS, CONFERENCE A-G: INFO-TECH & INFO-NET: A KEY TO BETTER LIFE, 2001, : C1 - C6
  • [9] Verification of the web applications using sink web pages
    Popescu, Doru Anastasiu
    Danauta, Catrinel Maria
    PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON VIRTUAL LEARNING, ICVL 2011, 2011, : 485 - 491
  • [10] Semantic analysis of web pages using web patterns
    Kudelka, Milos
    Snasel, Vaclav
    Lehecka, Ondrej
    E-Qawasmeh, Eyas
    2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, : 329 - +