Using Web Pages Dynamicity to Prioritise Web Crawling

被引：1

作者：

Alderratia, Nisreen ^{[1
]}

Elsheh, Mohammed ^{[1
]}

机构：

[1] Libyan Acad Misurata, Third Ring Rd, Misurata, Libya

来源：

PROCEEDINGS OF THE 2019 2ND INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND MACHINE INTELLIGENCE (MLMI 2019) | 2019年

关键词：

Web crawler; importance metric; dynamicity;

D O I：

10.1145/3366750.3366757

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Web crawling is a process performed to collect web pages from the web, in order to be indexed and used for displaying the search results according to users' requirements. In addition, web crawlers must continually revisit web pages, to keep the search engine database updated. Moreover, it is fundamental to determine in the crawling process, the most important pages to be recrawled first. This is to avoid the time limitation and network issues that face the web crawling process. Thus, this research attempts to introduce a method that is used to indicate the crawler, specifically, in order to identify in what order it should recrawl web pages that have been crawled before, as to acquire more important and valuable pages earlier than others. In addition, the researchers proposed a web crawling strategy which is based on the topic similarity, accompanied with the dynamicity of web pages, where the crawler was downloading relevant pages and recrawling them recursively. Also, every time a change emerged in one of the pages, its counter increased. Therefore, if the page was relevant and changed frequently it would be considered an important page and was given a high priority in the crawling process. The obtained results indicated that using web pages' dynamicity is an effective way for prioritising web pages in the crawling process, in order to obtain the highest dynamic pages first, as there is a high possibility of being changed in terms of their content, before the least dynamic ones.

引用

页码：40 / 44

页数：5

共 50 条

[31] Crawling toward a Wiser Web
Hayes, Brian
AMERICAN SCIENTIST, 2015, 103 (03) : 184 - 187
[32] Web-crawling reliability
Cothey, V
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2004, 55 (14): : 1228 - 1238
[33] Deep Web crawling: a survey
Hernandez, Inma
Rivero, Carlos R.
Ruiz, David
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2019, 22 (04): : 1577 - 1610
[34] Focused crawling of tagged web resources using ontology
Bedi, Punam
Thukral, Anjali
Banati, Hema
COMPUTERS & ELECTRICAL ENGINEERING, 2013, 39 (02) : 613 - 628
[35] Focused Web Crawling Algorithms
Amrin, Andas
Xia, Chunlei
Dai, Shuguang
JOURNAL OF COMPUTERS, 2015, 10 (04) : 245 - 251
[36] Focused crawling for the hidden web
Liakos, Panagiotis
Ntoulas, Alexandros
Labrinidis, Alexandros
Delis, Alex
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2016, 19 (04): : 605 - 631
[37] Focused crawling for the hidden web
Panagiotis Liakos
Alexandros Ntoulas
Alexandros Labrinidis
Alex Delis
World Wide Web, 2016, 19 : 605 - 631
[38] Web from preprocessor for crawling
Fernando Román Muñoz
Luis Javier García Villalba
Multimedia Tools and Applications, 2015, 74 : 8559 - 8570
[39] Web from preprocessor for crawling
Roman Munoz, Fernando
Garcia Villalba, Luis Javier
MULTIMEDIA TOOLS AND APPLICATIONS, 2015, 74 (19) : 8559 - 8570
[40] Deep Web crawling: a survey
Inma Hernández
Carlos R. Rivero
David Ruiz
World Wide Web, 2019, 22 : 1577 - 1610

← 1 2 3 4 5 →