Using Web Pages Dynamicity to Prioritise Web Crawling

被引:1
|
作者
Alderratia, Nisreen [1 ]
Elsheh, Mohammed [1 ]
机构
[1] Libyan Acad Misurata, Third Ring Rd, Misurata, Libya
关键词
Web crawler; importance metric; dynamicity;
D O I
10.1145/3366750.3366757
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Web crawling is a process performed to collect web pages from the web, in order to be indexed and used for displaying the search results according to users' requirements. In addition, web crawlers must continually revisit web pages, to keep the search engine database updated. Moreover, it is fundamental to determine in the crawling process, the most important pages to be recrawled first. This is to avoid the time limitation and network issues that face the web crawling process. Thus, this research attempts to introduce a method that is used to indicate the crawler, specifically, in order to identify in what order it should recrawl web pages that have been crawled before, as to acquire more important and valuable pages earlier than others. In addition, the researchers proposed a web crawling strategy which is based on the topic similarity, accompanied with the dynamicity of web pages, where the crawler was downloading relevant pages and recrawling them recursively. Also, every time a change emerged in one of the pages, its counter increased. Therefore, if the page was relevant and changed frequently it would be considered an important page and was given a high priority in the crawling process. The obtained results indicated that using web pages' dynamicity is an effective way for prioritising web pages in the crawling process, in order to obtain the highest dynamic pages first, as there is a high possibility of being changed in terms of their content, before the least dynamic ones.
引用
收藏
页码:40 / 44
页数:5
相关论文
共 50 条
  • [21] An effective approach of web crawling for deep web
    Wang, Shunyan
    Wu, Binghua
    Zhong, Luo
    DCABES 2007 Proceedings, Vols I and II, 2007, : 855 - 858
  • [22] Sink web pages of web application
    Popescu, Doru Anastasiu
    Szabo, Zoltan
    PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON VIRTUAL LEARNING, ICVL 2010, 2010, : 375 - 379
  • [23] From Web Pages to Web Communities
    Kudelka, Milos
    Snasel, Vaclav
    Horak, Zdenek
    Hassanien, Aboul Ella
    DATESO 2009 - DATABASES, TEXTS, SPECIFICATIONS, OBJECTS: PROCEEDINGS OF THE 9TH ANNUAL INTERNATIONAL WORKSHOP, 2009, 471 : 13 - 22
  • [24] Extricating web pages from deep web using deaima architecture
    Devasirvatham, Weslin
    Thiyagarajan, Joshva Devadas
    THEORETICAL COMPUTER SCIENCE, 2022, 931 : 93 - 103
  • [25] Enhance Web Pages Genre Identification Using Neighboring Pages
    Zhu, Jia
    Zhou, Xiaofang
    Fung, Gabriel
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2011, 2011, 6997 : 282 - +
  • [26] Synonyms extraction using Web content focused crawling
    Chen, Chien-Hsing
    Hsu, Chung-Chian
    INFORMATION RETRIEVAL TECHNOLOGY, 2008, 4993 : 286 - 297
  • [27] Efficient Deep Web Crawling Using Reinforcement Learning
    Jiang, Lu
    Wu, Zhaohui
    Feng, Qian
    Liu, Jun
    Zheng, Qinghua
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT I, PROCEEDINGS, 2010, 6118 : 428 - +
  • [28] Collecting data on textiles from the internet using web crawling and web scraping tools
    Muehlethaler, Cyril
    Albert, Rene
    FORENSIC SCIENCE INTERNATIONAL, 2021, 322
  • [29] Scheduling algorithms for Web crawling
    Castillo, C
    Marin, M
    Rodriguez, A
    Baeza-Yates, R
    WEBMEDIA & LA-WEB 2004, VOL 1, PROCEEDINGS, 2004, : 10 - 17
  • [30] An Architecture for Efficient Web Crawling
    Hernandez, Inma
    Rivero, Carlos R.
    Ruiz, David
    Corchuelo, Rafael
    ADVANCED INFORMATION SYSTEMS ENGINEERING WORKSHOPS, CAISE 2012, 2012, 112 : 228 - 234