Fixing the Threshold for Effective Detection of Near Duplicate Web Documents in Web Crawling

被引:0
|
作者
Narayana, V. A. [1 ]
Premchand, P. [1 ]
Govardhan, A. [1 ]
机构
[1] CMR Coll Engn & Technol, Dept Comp Sci & Engn, Hyderabad, Andhra Pradesh, India
关键词
Fingerprint; Similarity score; Near-duplicate; Web crawling and Threshold;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The drastic development of the WWW in recent times has made the concept of Web Crawling receive remarkable significance. The voluminous amounts of web documents swarming the web have posed huge challenges to web search engines making their results less relevant to the users. The presence of duplicate and near duplicate web documents in abundance has created additional overheads for the search engines critically affecting their performance and quality which have to be removed to provide users with the relevant results for their queries. In this paper, we have presented a novel and efficient approach for the detection of near duplicate web pages in web crawling where the keywords are extracted from the crawled pages and the similarity score between two pages is calculated. The documents having similarity score greater than a threshold value are considered as near duplicates. In this paper we have fixed the threshold value.
引用
收藏
页码:169 / 180
页数:12
相关论文
共 50 条
  • [41] Analysis of accounting models for the detection of duplicate requests in web services
    Venkatesan, S.
    Basha, M. S. Saleem
    Chellappan, C.
    Vaish, Anurika
    Dhavachelvan, P.
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2013, 25 (01) : 7 - 24
  • [42] Visual Content Based Clustering of Near Duplicate Web Search Images
    Kalaiarsasi, G.
    Thyagharajan, K. K.
    2013 INTERNATIONAL CONFERENCE ON GREEN COMPUTING, COMMUNICATION AND CONSERVATION OF ENERGY (ICGCE), 2013, : 767 - 771
  • [43] Application of bloom filter for duplicate URL detection in a web crawler
    Kapoor, Aveksha
    Arora, Vinay
    2016 IEEE 2ND INTERNATIONAL CONFERENCE ON COLLABORATION AND INTERNET COMPUTING (IEEE CIC), 2016, : 246 - 255
  • [44] Large-scale duplicate detection for web image search
    Wang, Bin
    Li, Zhiwei
    Li, Mingjing
    Ma, Wei-Ying
    2006 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO - ICME 2006, VOLS 1-5, PROCEEDINGS, 2006, : 353 - +
  • [45] WC-PAD: Web Crawling based Phishing Attack Detection
    Nathezhtha, T.
    Sangeetha, D.
    Vaidehi, V.
    2019 IEEE 53RD INTERNATIONAL CARNAHAN CONFERENCE ON SECURITY TECHNOLOGY (ICCST 2019), 2019,
  • [46] Efficient and effective Web change detection
    Flesca, S
    Masciari, E
    DATA & KNOWLEDGE ENGINEERING, 2003, 46 (02) : 203 - 224
  • [47] Near-Duplicate Segments based news web video event mining
    Zhang, Chengde
    Liu, Dianting
    Wu, Xiao
    Zhao, Guiru
    Shyu, Mei-Ling
    Peng, Qiang
    SIGNAL PROCESSING, 2016, 120 : 26 - 35
  • [48] Optimal threshold control by the robots of web search engines with obsolescence of documents
    Avrachenkov, Konstantin
    Dudin, Alexander
    Klimenok, Valentina
    Nain, Philippe
    Semenova, Olga
    COMPUTER NETWORKS, 2011, 55 (08) : 1880 - 1893
  • [49] INDEXING NEAR-DUPLICATE IMAGES IN WEB SEARCH USING MINHASH ALGORITHM
    Thaiyalnayaki, S.
    Sasikala, J.
    Ponraj, R.
    MATERIALS TODAY-PROCEEDINGS, 2018, 5 (01) : 1943 - 1949
  • [50] Personal Health Information Detection in Unstructured Web Documents
    Razavi, Amir H.
    Ghazinour, Kambiz
    2013 IEEE 26TH INTERNATIONAL SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS (CBMS), 2013, : 155 - 160