A Web Page De-duplication Algorithm Based On Data Cleaning

被引:0
|
作者
Lin, Jian-ming [1 ,2 ]
Liu, Dong-sheng [3 ]
Gao, Shi-wen [4 ,5 ]
Chen, Wei [3 ]
机构
[1] Zhejiang Gongshang Univ, Sch Business Adm, Hangzhou, Zhejiang, Peoples R China
[2] Zhejiang Gongshang Univ, Dept Finance Informat Ctr, Hangzhou, Zhejiang, Peoples R China
[3] Zhejiang Gongshang Univ, Coll Comp Sci & Informat Engn, Hangzhou, Zhejiang, Peoples R China
[4] Nanjing Univ Aeronaut & Astronaut, Coll Mech & Elect Engn, Nanjing 210016, Peoples R China
[5] Aerospa Sci & Technol Corp, Beijing, Peoples R China
基金
美国国家科学基金会;
关键词
web page de-duplication; reshipment statement; data cleaning; feature codes;
D O I
10.1109/JCAI.2009.181
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Duplicated web pages responded by search engines not only waste valuable storage, but also aggravate burdens of users' browse. Web page de-duplication can effectively improve the information retrieval. This paper proposes pretreatment of web pages to improve the effectiveness and efficiency of web page de-duplication based on feature code according to the principle of data clearing. This paper features that ranking feature code to reduce the comparison times of the system and space and time complexity. Experiments show that this method has a promising prospect in eliminating large-scale duplicated web pages.
引用
收藏
页码:544 / +
页数:3
相关论文
共 50 条
  • [31] Hashing Fingerprints for Identity De-duplication
    Wang, Yi
    Yuen, Pong C.
    Cheung, Yiu-ming
    PROCEEDINGS OF THE 2013 IEEE INTERNATIONAL WORKSHOP ON INFORMATION FORENSICS AND SECURITY (WIFS'13), 2013, : 49 - 54
  • [32] A Data De-duplication Access Framework for Solid State Drives
    Wu, Chin-Hsien
    Wu, Hau-Shan
    JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2012, 28 (05) : 941 - 954
  • [33] Data Secure De-Duplication and Recovery Based on Public Key Encryption With Keyword Search
    Li, Le
    Zheng, Dong
    Zhang, Haoyu
    Qin, Baodong
    IEEE ACCESS, 2023, 11 : 28688 - 28698
  • [34] An Effective RAID Data Layout for Object-Based De-duplication Backup System
    Yan Fang
    Tan Yu'an
    Zhang Quanxin
    Wu Fei
    Cheng Zijing
    Zheng Jun
    CHINESE JOURNAL OF ELECTRONICS, 2016, 25 (05) : 832 - 840
  • [35] An efficient technique for cloud storage using secured de-duplication algorithm
    Mohan, Prakash
    Sundaram, Manikandan
    Satpathy, Sambit
    Das, Sanchali
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2021, 41 (02) : 2969 - 2980
  • [36] GeoDD: End-to-End Spatial Data De-duplication System
    Trokhymovych, Mykola
    Kosovan, Oleksandr
    DATA SCIENCE AND ALGORITHMS IN SYSTEMS, 2022, VOL 2, 2023, 597 : 717 - 727
  • [37] An Effective RAID Data Layout for Object-Based De-duplication Backup System
    YAN Fang
    TAN Yu'an
    ZHANG Quanxin
    WU Fei
    CHENG Zijing
    ZHENG Jun
    Chinese Journal of Electronics, 2016, 25 (05) : 832 - 840
  • [38] De-Duplication Of Passports Using Aadhaar
    Prathilothamai, M.
    Nair, Priyanka Sunil
    2017 INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND INFORMATICS (ICCCI), 2017,
  • [39] Stormy weather, redundancy, and de-duplication
    Ojala, Marydee
    ONLINE, 2008, 32 (05): : 5 - 5
  • [40] De-duplication in File Sharing Network
    Yadav, Divakar
    Dani, Deepali
    Kumari, Preeti
    CONTEMPORARY COMPUTING, 2011, 168 : 551 - 553