A Web Page De-duplication Algorithm Based On Data Cleaning

被引:0
|
作者
Lin, Jian-ming [1 ,2 ]
Liu, Dong-sheng [3 ]
Gao, Shi-wen [4 ,5 ]
Chen, Wei [3 ]
机构
[1] Zhejiang Gongshang Univ, Sch Business Adm, Hangzhou, Zhejiang, Peoples R China
[2] Zhejiang Gongshang Univ, Dept Finance Informat Ctr, Hangzhou, Zhejiang, Peoples R China
[3] Zhejiang Gongshang Univ, Coll Comp Sci & Informat Engn, Hangzhou, Zhejiang, Peoples R China
[4] Nanjing Univ Aeronaut & Astronaut, Coll Mech & Elect Engn, Nanjing 210016, Peoples R China
[5] Aerospa Sci & Technol Corp, Beijing, Peoples R China
基金
美国国家科学基金会;
关键词
web page de-duplication; reshipment statement; data cleaning; feature codes;
D O I
10.1109/JCAI.2009.181
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Duplicated web pages responded by search engines not only waste valuable storage, but also aggravate burdens of users' browse. Web page de-duplication can effectively improve the information retrieval. This paper proposes pretreatment of web pages to improve the effectiveness and efficiency of web page de-duplication based on feature code according to the principle of data clearing. This paper features that ranking feature code to reduce the comparison times of the system and space and time complexity. Experiments show that this method has a promising prospect in eliminating large-scale duplicated web pages.
引用
收藏
页码:544 / +
页数:3
相关论文
共 50 条
  • [21] An Undirected Graph Traversal based Grouping Prediction Method for Data De-duplication
    Wang, Longxiang
    Zhang, Xingjun
    Zhu, Guofeng
    Zhu, Yueguang
    Dong, Xiaoshe
    2013 14TH ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD 2013), 2013, : 3 - 8
  • [22] Object-based data de-duplication method for OpenXML compound files
    School of Computer Science & Technology, Beijing Institute of Technology, Beijing
    100086, China
    不详
    101149, China
    Jisuanji Yanjiu yu Fazhan, 7 (1546-1557):
  • [23] A strategy of de-duplication based on the similarity of adjacent chunks
    Zhou B.
    Tan J.-H.
    2017, Taru Publications (20) : 1577 - 1580
  • [24] Introspection-based Memory De-duplication and Migration
    Chiang, Jui-Hao
    Li, Han-Lin
    Chiueh, Tzi-cker
    ACM SIGPLAN NOTICES, 2013, 48 (07) : 51 - 61
  • [25] DATA DE-DUPLICATION WITH ADAPTIVE CHUNKING AND ACCELERATED MODIFICATION IDENTIFYING
    Zhang, Xingjun
    Zhu, Guofeng
    Wang, Endong
    Fowler, Scott
    Dong, Xiaoshe
    COMPUTING AND INFORMATICS, 2016, 35 (03) : 586 - 614
  • [26] Data De-duplication Using Cuckoo Hashing in Cloud Storage
    Sridharan, J.
    Valliyammai, C.
    Karthika, R. N.
    Kulasekaran, L. Nihil
    SOFT COMPUTING IN DATA ANALYTICS, SCDA 2018, 2019, 758 : 707 - 715
  • [27] FBBM: A new backup method with data de-duplication capability
    Yang, Tianming
    Feng, Dan
    Liu, Jingning
    Wan, Yaping
    MUE: 2008 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND UBIQUITOUS ENGINEERING, PROCEEDINGS, 2008, : 30 - +
  • [28] A data de-duplication access framework for solid state drives
    Department of Electronic Engineering, National Taiwan University of Science and Technology, Taipei, 106, Taiwan
    J. Inf. Sci. Eng., 2012, 5 (941-954):
  • [29] A method for organizing metadata of storage nodes with data de-duplication
    Wang, Guohua
    Zhao, Yuelong
    Li, Tianxiang
    Liao, Jinggui
    Journal of Computational Information Systems, 2014, 10 (09): : 3845 - 3854
  • [30] Semantic Analysis of Big Data by Applying De-duplication techniques
    Garg, Sanjeev
    Bala, Anju
    2016 INTERNATIONAL CONFERENCE ON INVENTIVE COMPUTATION TECHNOLOGIES (ICICT), VOL 3, 2015, : 660 - 665