An Improved Focused Crawler Based on Text Keyword Extraction

被引:0
|
作者
Zheng, Zhang [1 ]
Qian, Du [2 ]
机构
[1] Wuhan Univ Technol, Dept Informat Technol, Wuhan, Hubei, Peoples R China
[2] Wuhan Univ Technol, Affiliat Dept Informat Technol, Wuhan, Hubei, Peoples R China
关键词
focused crawler; keyword extract; TF-IDF; syntactic dependency analysis;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
For the shortcoming of the traditional focused crawler, this paper proposed an improved focused crawl method which based on syntactic dependency analysis. This method generates a words collection of the text through TF-IDF algorithm and generates a phrases collection through syntactic dependency analysis firstly. Then evaluate the collection of words and phrases to select set of keywords of the text. Next use the normal search engine to search the keywords set. Part of the search result will be used as seed links in focused crawler. Focused crawler's crawling policy is the best-first search policy, and this policy uses the similarity between keywords and link's anchor text to evaluate the priority. This paper proposed a keyword extraction method based on TF-IDF algorithm and syntactic dependency analysis, the result of this method included phrases and words, because of joining phrases, the relevance of seeds and links will be improved. In this paper, we use the method of combining link's anchor text with context to evaluate the link's priority. The experiment result shows that similarity between crawling pages vs. text of using this method is 14.3 percent higher than using artificial keywords. This method has good performance in the area of the focused crawler which uses text as input and vertical search engines and other application fields.
引用
收藏
页码:386 / 390
页数:5
相关论文
共 50 条
  • [31] A Focused Crawler Based on Naive Bayes Classifier
    Wang, Wenxian
    Chen, Xingshu
    Zou, Yongbin
    Wang, Haizhou
    Dai, Zongkun
    2010 THIRD INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY AND SECURITY INFORMATICS (IITSI 2010), 2010, : 517 - 521
  • [32] An intelligent focused crawler based on genetic algorithm
    Yu, Chun
    Du, Yajun
    Liu, Wenjun
    Journal of Computational Information Systems, 2014, 10 (18): : 8059 - 8066
  • [33] The Research of Ontology-Based Focused Crawler
    Wu, Cong-Cong
    Zhao, Jian-li
    Ma, Hui-lin
    2012 7TH INTERNATIONAL CONFERENCE ON SYSTEM OF SYSTEMS ENGINEERING (SOSE), 2012, : 736 - 738
  • [34] Uyghur-Kazakh-Kirghiz Text Keyword Extraction Based on Morpheme Segmentation
    Parhat, Sardar
    Sattar, Mutallip
    Hamdulla, Askar
    Kadir, Abdurahman
    INFORMATION, 2023, 14 (05)
  • [35] Variance-based features for keyword extraction in Persian and English text documents
    Veisi, H.
    Aflaki, N.
    Parsafard, P.
    SCIENTIA IRANICA, 2020, 27 (03) : 1301 - 1315
  • [36] Research on Cross Language Text Keyword Extraction Based on Information Entropy and TextRank
    Zhang, Xiaoyu
    Wang, Yongbin
    Wu, Lin
    PROCEEDINGS OF 2019 IEEE 3RD INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC 2019), 2019, : 16 - 19
  • [37] Variance-based features for keyword extraction in Persian and English text documents
    Veisi H.
    Aflaki N.
    Parsafard P.
    Scientia Iranica, 2020, 27 (3 D) : 1301 - 1315
  • [38] Chinese Text Keyword Extraction Based on Doc2vec And TextRank
    Wang, Wei
    Li, Xiangshun
    Yu, Sheng
    PROCEEDINGS OF THE 32ND 2020 CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2020), 2020, : 369 - 373
  • [39] LEARNING-based Focused WEB Crawler
    Kumar, Naresh
    Aggarwal, Dhruv
    IETE JOURNAL OF RESEARCH, 2023, 69 (04) : 2037 - 2045
  • [40] Focused crawler for events
    Farag, Mohamed M. G.
    Lee, Sunshin
    Fox, Edward A.
    INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES, 2018, 19 (01) : 3 - 19