Privacy Protection Practice for Data Mining with Multiple Data Sources: An Example with Data Clustering

被引:3
|
作者
O'Shaughnessy, Pauline [1 ]
Lin, Yan-Xia [1 ]
机构
[1] Univ Wollongong, Sch Math & Appl Stat, Wollongong, NSW 2522, Australia
关键词
data masking; multiplicative noise; data mining; sample size calculation;
D O I
10.3390/math10244744
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
In the age of data, data mining provides feasible tools with which to handle large datasets consisting of data from multiple sources. However, there is limited research on retrieving statistical information from data when data are confidential and cannot be shared directly. In this paper, we address this problem and propose a framework for performing data analysis using data from multiple sources without revealing true values for privacy purposes. The proposed framework includes three steps. First, data custodians individually mask data before publishing; then, the masked data collection is used to reconstruct the density function of the original dataset, from which resampled values are generated; last, existing data mining techniques are applied directly to the resampled data. This framework utilises the technique of reconstructing an original density function from noise-masked data using the moment-based density estimation method, which plays an essential role. Simulation studies show that the proposed framework performs well; analysis results from the resampled data are comparable to those of the original data when the density of the original data is estimated well. The proposed framework is demonstrated in data clustering analysis using the example of a real-life Australian soybean dataset. Results from the k-means algorithms with two and three fitted clusters are presented to show that cluster analysis using resampled data can well replicate that of the original data.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Protection or privacy? Data mining and personal data
    Hand, DJ
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2006, 3918 : 1 - 10
  • [2] Privacy Protection in Data Mining
    Fu, Chunchang
    Zhang, Nan
    2010 INTERNATIONAL CONFERENCE ON MANAGEMENT SCIENCE AND ENGINEERING (MSE 2010), VOL 2, 2010, : 92 - 93
  • [3] Uncertain data mining: An example in clustering location data
    Chau, Michael
    Cheng, Reynold
    Kao, Ben
    Ng, Jackey
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2006, 3918 : 199 - 204
  • [4] Privacy protection in data mining: A perturbation approach for categorical data
    Li, Xiao-Bai
    Sarkar, Sumit
    INFORMATION SYSTEMS RESEARCH, 2006, 17 (03) : 254 - 270
  • [5] Use of Multiple Data Sources in Collaborative Data Mining
    Anton, Carmen
    Matei, Oliviu
    Avram, Anca
    INTELLIGENT SYSTEMS APPLICATIONS IN SOFTWARE ENGINEERING, VOL 1, 2019, 1046 : 189 - 198
  • [6] Review on mining data from multiple data sources
    Wang, Ruili
    Ji, Wanting
    Liu, Mingzhe
    Wang, Xun
    Weng, Jian
    Deng, Song
    Gao, Suying
    Yuan, Chang-an
    PATTERN RECOGNITION LETTERS, 2018, 109 : 120 - 128
  • [7] Towards comprehensive privacy protection in data clustering
    Zhang, Nan
    Advances in Knowledge Discovery and Data Mining, Proceedings, 2007, 4426 : 1096 - 1104
  • [8] A privacy protection technique for publishing data mining models and research data
    Fu Y.
    Chen Z.
    Koru G.
    Gangopadhyay A.
    ACM Transactions on Management Information Systems, 2010, 1 (01)
  • [9] A survey on mining multiple data sources
    Ramkumar, T.
    Hariharan, S.
    Selvamuthukumaran, S.
    WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2013, 3 (01) : 1 - 11
  • [10] Mining Multiple Large Data Sources
    Adhikari, Animesh
    Ramachandrarao, Pralhad
    Prasad, Bhanu
    Adhikari, Jhimli
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2010, 7 (03) : 241 - 249