Using Web Pages Dynamicity to Prioritise Web Crawling

被引:1
|
作者
Alderratia, Nisreen [1 ]
Elsheh, Mohammed [1 ]
机构
[1] Libyan Acad Misurata, Third Ring Rd, Misurata, Libya
关键词
Web crawler; importance metric; dynamicity;
D O I
10.1145/3366750.3366757
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Web crawling is a process performed to collect web pages from the web, in order to be indexed and used for displaying the search results according to users' requirements. In addition, web crawlers must continually revisit web pages, to keep the search engine database updated. Moreover, it is fundamental to determine in the crawling process, the most important pages to be recrawled first. This is to avoid the time limitation and network issues that face the web crawling process. Thus, this research attempts to introduce a method that is used to indicate the crawler, specifically, in order to identify in what order it should recrawl web pages that have been crawled before, as to acquire more important and valuable pages earlier than others. In addition, the researchers proposed a web crawling strategy which is based on the topic similarity, accompanied with the dynamicity of web pages, where the crawler was downloading relevant pages and recrawling them recursively. Also, every time a change emerged in one of the pages, its counter increased. Therefore, if the page was relevant and changed frequently it would be considered an important page and was given a high priority in the crawling process. The obtained results indicated that using web pages' dynamicity is an effective way for prioritising web pages in the crawling process, in order to obtain the highest dynamic pages first, as there is a high possibility of being changed in terms of their content, before the least dynamic ones.
引用
收藏
页码:40 / 44
页数:5
相关论文
共 50 条
  • [41] Designing interactive Web pages using ActiveX
    Rebello, NS
    Sushenko, K
    COMPUTERS IN PHYSICS, 1997, 11 (04): : 317 - 322
  • [42] Compression of concatenated web pages using XBW
    Sestak, Radovan
    Lansky, Jan
    SOFSEM 2008: THEORY AND PRACTICE OF COMPUTER SCIENCE, 2008, 4910 : 743 - 754
  • [43] Classifying web pages using adaptive ontology
    Noh, S
    Seo, H
    Choi, J
    Choi, K
    Jung, G
    2003 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS, VOLS 1-5, CONFERENCE PROCEEDINGS, 2003, : 2144 - 2149
  • [44] Recycling course web pages for the semantic web
    Motz, Regina
    Sosa, Raquel
    Rodriguez, Andrea
    LA-WEB 06: FOURTH LATIN AMERICAN WEB CONGRESS, PROCEEDINGS, 2006, : 82 - +
  • [45] Framework technology for web pages and web servers
    Burr, Tim
    Object Magazine, 1996, 6 (03):
  • [46] Ranking Billions of Web Pages Using Diodes
    Kaul, Rohit
    Yun, Yeogirl
    Kim, Seong-Gon
    COMMUNICATIONS OF THE ACM, 2009, 52 (08) : 132 - 136
  • [47] Automatic partitioning of web pages using clustering
    Romero, R
    Berger, A
    MOBILE HUMAN-COMPUTER INTERACTION - MOBILEHCI 2004, PROCEEDINGS, 2004, 3160 : 388 - 393
  • [48] Designing interactive web pages using ActiveX
    Sanjay Rebello, N.
    Sushenko, Konstantin
    1997, Am Inst Phys, Woodbury (11):
  • [49] Categorizing Web pages using modified ART
    Vlajic, N
    Card, HC
    UNIVERSITY AND INDUSTRY - PARTNERS IN SUCCESS, CONFERENCE PROCEEDINGS VOLS 1-2, 1998, : 313 - 316
  • [50] No bad Web pages: reader empowerment and the Web
    Brooks, TA
    INFORMATION RESEARCH-AN INTERNATIONAL ELECTRONIC JOURNAL, 2006, 11 (03):