A Web Page De-duplication Algorithm Based On Data Cleaning

被引:0
|
作者
Lin, Jian-ming [1 ,2 ]
Liu, Dong-sheng [3 ]
Gao, Shi-wen [4 ,5 ]
Chen, Wei [3 ]
机构
[1] Zhejiang Gongshang Univ, Sch Business Adm, Hangzhou, Zhejiang, Peoples R China
[2] Zhejiang Gongshang Univ, Dept Finance Informat Ctr, Hangzhou, Zhejiang, Peoples R China
[3] Zhejiang Gongshang Univ, Coll Comp Sci & Informat Engn, Hangzhou, Zhejiang, Peoples R China
[4] Nanjing Univ Aeronaut & Astronaut, Coll Mech & Elect Engn, Nanjing 210016, Peoples R China
[5] Aerospa Sci & Technol Corp, Beijing, Peoples R China
基金
美国国家科学基金会;
关键词
web page de-duplication; reshipment statement; data cleaning; feature codes;
D O I
10.1109/JCAI.2009.181
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Duplicated web pages responded by search engines not only waste valuable storage, but also aggravate burdens of users' browse. Web page de-duplication can effectively improve the information retrieval. This paper proposes pretreatment of web pages to improve the effectiveness and efficiency of web page de-duplication based on feature code according to the principle of data clearing. This paper features that ranking feature code to reduce the comparison times of the system and space and time complexity. Experiments show that this method has a promising prospect in eliminating large-scale duplicated web pages.
引用
收藏
页码:544 / +
页数:3
相关论文
共 50 条
  • [41] VMDedup: Memory De-duplication in Hypervisor
    Shaikh, Furquan
    Yao, Fangzhou
    Gupta, Indranil
    Campbell, Roy H.
    2014 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E), 2014, : 379 - 384
  • [42] A proficient cost reduction framework for de-duplication of records in data integration
    Asif Sohail
    Muhammad Murtaza Yousaf
    BMC Medical Informatics and Decision Making, 16
  • [43] Logical Data Deletion in High-Performance De-duplication Backup
    Yang, Tianming
    Tang, Zhen
    Wan, Yaping
    Sun, Wei
    MECHATRONICS AND INDUSTRIAL INFORMATICS, PTS 1-4, 2013, 321-324 : 2519 - +
  • [44] De-Duplication Complexity of Fingerprint Data in Large-Scale Applications
    Nalla Pattabhi Ramaiah
    C.Krishna Mohan
    Journal of Electronic Science and Technology, 2014, (02) : 224 - 228
  • [45] De-Duplication Complexity of Fingerprint Data in Large-Scale Applications
    Nalla Pattabhi Ramaiah
    C.Krishna Mohan
    JournalofElectronicScienceandTechnology, 2014, 12 (02) : 224 - 228
  • [46] A proficient cost reduction framework for de-duplication of records in data integration
    Sohail, Asif
    Yousaf, Muhammad Murtaza
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2016, 16
  • [47] Data De-duplication and Event Processing for Security Applications on an Embedded Processor
    Nagarajaiah, Harsha
    Upadhyaya, Shambhu
    Gopal, Vinodh
    2012 31ST INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS (SRDS 2012), 2012, : 418 - 423
  • [48] Finite State Automata Based Cryptosystem for Secure Data Sharing and De-duplication in Cloud Computing
    Basappa B. Kodada
    Demian Antony D’Mello
    D. K. Santhosh Kumar
    SN Computer Science, 5 (6)
  • [49] Secure data de-duplication based on threshold blind signature and bloom filter in internet of things
    Mi, Bo
    Li, Yang
    Darong, Huang
    Wei, Tiancheng
    Zou, Qianqian
    IEEE Access, 2020, 8 : 167113 - 167122
  • [50] 3DNBS: A Data De-duplication Disk-based Network Backup System
    Yang, Tianming
    Feng, Dan
    Liu, Jingning
    Wan, Yaping
    Niu, Zhongying
    Ke, Yuchang
    NAS: 2009 IEEE INTERNATIONAL CONFERENCE ON NETWORKING, ARCHITECTURE, AND STORAGE, 2009, : 287 - 294