A Web Page De-duplication Algorithm Based On Data Cleaning

被引:0
|
作者
Lin, Jian-ming [1 ,2 ]
Liu, Dong-sheng [3 ]
Gao, Shi-wen [4 ,5 ]
Chen, Wei [3 ]
机构
[1] Zhejiang Gongshang Univ, Sch Business Adm, Hangzhou, Zhejiang, Peoples R China
[2] Zhejiang Gongshang Univ, Dept Finance Informat Ctr, Hangzhou, Zhejiang, Peoples R China
[3] Zhejiang Gongshang Univ, Coll Comp Sci & Informat Engn, Hangzhou, Zhejiang, Peoples R China
[4] Nanjing Univ Aeronaut & Astronaut, Coll Mech & Elect Engn, Nanjing 210016, Peoples R China
[5] Aerospa Sci & Technol Corp, Beijing, Peoples R China
基金
美国国家科学基金会;
关键词
web page de-duplication; reshipment statement; data cleaning; feature codes;
D O I
10.1109/JCAI.2009.181
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Duplicated web pages responded by search engines not only waste valuable storage, but also aggravate burdens of users' browse. Web page de-duplication can effectively improve the information retrieval. This paper proposes pretreatment of web pages to improve the effectiveness and efficiency of web page de-duplication based on feature code according to the principle of data clearing. This paper features that ranking feature code to reduce the comparison times of the system and space and time complexity. Experiments show that this method has a promising prospect in eliminating large-scale duplicated web pages.
引用
收藏
页码:544 / +
页数:3
相关论文
共 50 条
  • [1] The Research of Web Page De-duplication Based on Web Pages Reshipment Statement
    Wang, Min-yan
    Liu, Dong-sheng
    FIRST INTERNATIONAL WORKSHOP ON DATABASE TECHNOLOGY AND APPLICATIONS, PROCEEDINGS, 2009, : 271 - 274
  • [2] Optimization for data de-duplication algorithm based on file content
    Xuejun NIE
    Leihua QIN
    Jingli ZHOU
    Ke LIU
    Jianfeng ZHU
    Yu WANG
    Frontiers of Optoelectronics in China, 2010, 3 (03) : 308 - 316
  • [3] Optimization for data de-duplication algorithm based on file content
    Nie, Xuejun
    Qin, Leihua
    Zhou, Jingli
    Liu, Ke
    Zhu, Jianfeng
    Wang, Yu
    FRONTIERS OF OPTOELECTRONICS, 2010, 3 (03) : 308 - 316
  • [4] Application for data de-duplication algorithm based on mobile devices
    Xingchen, Ge
    Ning, Deng
    Jian, Yin
    Journal of Networks, 2013, 8 (11) : 2498 - 2505
  • [5] Secure Static Data De-duplication
    Pawar, Rohit
    Zanwar, Payal
    Bora, Shruti
    Kullkarni, Shweta
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2016, 16 (03): : 69 - 73
  • [6] User-aware de-duplication algorithm
    School of Computer, Wuhan University, Wuhan
    430072, China
    不详
    518219, China
    不详
    410000, China
    Ruan Jian Xue Bao, 10 (2581-2595):
  • [7] Research on Chunking Algorithms of Data De-duplication
    Bo, Cai
    Li, Zhang Feng
    Can, Wang
    PROCEEDINGS OF THE 2012 INTERNATIONAL CONFERENCE ON COMMUNICATION, ELECTRONICS AND AUTOMATION ENGINEERING, 2013, 181 : 1019 - 1025
  • [8] Data De-duplication on Similar File Detection
    Zhu, Yueguang
    Zhang, Xingjun
    Zhao, Runting
    Dong, Xiaoshe
    2014 EIGHTH INTERNATIONAL CONFERENCE ON INNOVATIVE MOBILE AND INTERNET SERVICES IN UBIQUITOUS COMPUTING (IMIS), 2014, : 66 - 73
  • [9] An incremental clustering scheme for data de-duplication
    Gianni Costa
    Giuseppe Manco
    Riccardo Ortale
    Data Mining and Knowledge Discovery, 2010, 20 : 152 - 187
  • [10] An incremental clustering scheme for data de-duplication
    Costa, Gianni
    Manco, Giuseppe
    Ortale, Riccardo
    DATA MINING AND KNOWLEDGE DISCOVERY, 2010, 20 (01) : 152 - 187