An Improved Focused Crawler Based on Text Keyword Extraction

被引：0

作者：

Zheng, Zhang ^{[1
]}

Qian, Du ^{[2
]}

机构：

[1] Wuhan Univ Technol, Dept Informat Technol, Wuhan, Hubei, Peoples R China

[2] Wuhan Univ Technol, Affiliat Dept Informat Technol, Wuhan, Hubei, Peoples R China

来源：

PROCEEDINGS OF 2016 5TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT) | 2016年

关键词：

focused crawler; keyword extract; TF-IDF; syntactic dependency analysis;

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

For the shortcoming of the traditional focused crawler, this paper proposed an improved focused crawl method which based on syntactic dependency analysis. This method generates a words collection of the text through TF-IDF algorithm and generates a phrases collection through syntactic dependency analysis firstly. Then evaluate the collection of words and phrases to select set of keywords of the text. Next use the normal search engine to search the keywords set. Part of the search result will be used as seed links in focused crawler. Focused crawler's crawling policy is the best-first search policy, and this policy uses the similarity between keywords and link's anchor text to evaluate the priority. This paper proposed a keyword extraction method based on TF-IDF algorithm and syntactic dependency analysis, the result of this method included phrases and words, because of joining phrases, the relevance of seeds and links will be improved. In this paper, we use the method of combining link's anchor text with context to evaluate the link's priority. The experiment result shows that similarity between crawling pages vs. text of using this method is 14.3 percent higher than using artificial keywords. This method has good performance in the area of the focused crawler which uses text as input and vertical search engines and other application fields.

引用

页码：386 / 390

页数：5

共 50 条

[31] A Focused Crawler Based on Naive Bayes Classifier
Wang, Wenxian
Chen, Xingshu
Zou, Yongbin
Wang, Haizhou
Dai, Zongkun
2010 THIRD INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY AND SECURITY INFORMATICS (IITSI 2010), 2010, : 517 - 521
[32] An intelligent focused crawler based on genetic algorithm
Yu, Chun
Du, Yajun
Liu, Wenjun
Journal of Computational Information Systems, 2014, 10 (18): : 8059 - 8066
[33] The Research of Ontology-Based Focused Crawler
Wu, Cong-Cong
Zhao, Jian-li
Ma, Hui-lin
2012 7TH INTERNATIONAL CONFERENCE ON SYSTEM OF SYSTEMS ENGINEERING (SOSE), 2012, : 736 - 738
[34] Uyghur-Kazakh-Kirghiz Text Keyword Extraction Based on Morpheme Segmentation
Parhat, Sardar
Sattar, Mutallip
Hamdulla, Askar
Kadir, Abdurahman
INFORMATION, 2023, 14 (05)
[35] Variance-based features for keyword extraction in Persian and English text documents
Veisi, H.
Aflaki, N.
Parsafard, P.
SCIENTIA IRANICA, 2020, 27 (03) : 1301 - 1315
[36] Research on Cross Language Text Keyword Extraction Based on Information Entropy and TextRank
Zhang, Xiaoyu
Wang, Yongbin
Wu, Lin
PROCEEDINGS OF 2019 IEEE 3RD INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC 2019), 2019, : 16 - 19
[37] Variance-based features for keyword extraction in Persian and English text documents
Veisi H.
Aflaki N.
Parsafard P.
Scientia Iranica, 2020, 27 (3 D) : 1301 - 1315
[38] Chinese Text Keyword Extraction Based on Doc2vec And TextRank
Wang, Wei
Li, Xiangshun
Yu, Sheng
PROCEEDINGS OF THE 32ND 2020 CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2020), 2020, : 369 - 373
[39] LEARNING-based Focused WEB Crawler
Kumar, Naresh
Aggarwal, Dhruv
IETE JOURNAL OF RESEARCH, 2023, 69 (04) : 2037 - 2045
[40] Focused crawler for events
Farag, Mohamed M. G.
Lee, Sunshin
Fox, Edward A.
INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES, 2018, 19 (01) : 3 - 19

← 1 2 3 4 5 →