A Web Page De-duplication Algorithm Based On Data Cleaning

被引：0

作者：

Lin, Jian-ming ^{[1
,2
]}

Liu, Dong-sheng ^{[3
]}

Gao, Shi-wen ^{[4
,5
]}

Chen, Wei ^{[3
]}

机构：

[1] Zhejiang Gongshang Univ, Sch Business Adm, Hangzhou, Zhejiang, Peoples R China

[2] Zhejiang Gongshang Univ, Dept Finance Informat Ctr, Hangzhou, Zhejiang, Peoples R China

[3] Zhejiang Gongshang Univ, Coll Comp Sci & Informat Engn, Hangzhou, Zhejiang, Peoples R China

[4] Nanjing Univ Aeronaut & Astronaut, Coll Mech & Elect Engn, Nanjing 210016, Peoples R China

[5] Aerospa Sci & Technol Corp, Beijing, Peoples R China

来源：

FIRST IITA INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, PROCEEDINGS | 2009年

基金：

美国国家科学基金会;

关键词：

web page de-duplication; reshipment statement; data cleaning; feature codes;

D O I：

10.1109/JCAI.2009.181

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Duplicated web pages responded by search engines not only waste valuable storage, but also aggravate burdens of users' browse. Web page de-duplication can effectively improve the information retrieval. This paper proposes pretreatment of web pages to improve the effectiveness and efficiency of web page de-duplication based on feature code according to the principle of data clearing. This paper features that ranking feature code to reduce the comparison times of the system and space and time complexity. Experiments show that this method has a promising prospect in eliminating large-scale duplicated web pages.

引用

页码：544 / +

页数：3

共 50 条

[21] An Undirected Graph Traversal based Grouping Prediction Method for Data De-duplication
Wang, Longxiang
Zhang, Xingjun
Zhu, Guofeng
Zhu, Yueguang
Dong, Xiaoshe
2013 14TH ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD 2013), 2013, : 3 - 8
[22] Object-based data de-duplication method for OpenXML compound files
School of Computer Science & Technology, Beijing Institute of Technology, Beijing
100086, China
不详
101149, China
Jisuanji Yanjiu yu Fazhan, 7 (1546-1557):
[23] A strategy of de-duplication based on the similarity of adjacent chunks
Zhou B.
Tan J.-H.
2017, Taru Publications (20) : 1577 - 1580
[24] Introspection-based Memory De-duplication and Migration
Chiang, Jui-Hao
Li, Han-Lin
Chiueh, Tzi-cker
ACM SIGPLAN NOTICES, 2013, 48 (07) : 51 - 61
[25] DATA DE-DUPLICATION WITH ADAPTIVE CHUNKING AND ACCELERATED MODIFICATION IDENTIFYING
Zhang, Xingjun
Zhu, Guofeng
Wang, Endong
Fowler, Scott
Dong, Xiaoshe
COMPUTING AND INFORMATICS, 2016, 35 (03) : 586 - 614
[26] Data De-duplication Using Cuckoo Hashing in Cloud Storage
Sridharan, J.
Valliyammai, C.
Karthika, R. N.
Kulasekaran, L. Nihil
SOFT COMPUTING IN DATA ANALYTICS, SCDA 2018, 2019, 758 : 707 - 715
[27] FBBM: A new backup method with data de-duplication capability
Yang, Tianming
Feng, Dan
Liu, Jingning
Wan, Yaping
MUE: 2008 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND UBIQUITOUS ENGINEERING, PROCEEDINGS, 2008, : 30 - +
[28] A data de-duplication access framework for solid state drives
Department of Electronic Engineering, National Taiwan University of Science and Technology, Taipei, 106, Taiwan
J. Inf. Sci. Eng., 2012, 5 (941-954):
[29] A method for organizing metadata of storage nodes with data de-duplication
Wang, Guohua
Zhao, Yuelong
Li, Tianxiang
Liao, Jinggui
Journal of Computational Information Systems, 2014, 10 (09): : 3845 - 3854
[30] Semantic Analysis of Big Data by Applying De-duplication techniques
Garg, Sanjeev
Bala, Anju
2016 INTERNATIONAL CONFERENCE ON INVENTIVE COMPUTATION TECHNOLOGIES (ICICT), VOL 3, 2015, : 660 - 665

← 1 2 3 4 5 →