Privacy Protection Practice for Data Mining with Multiple Data Sources: An Example with Data Clustering

被引:2
|
作者
O'Shaughnessy, Pauline [1 ]
Lin, Yan-Xia [1 ]
机构
[1] Univ Wollongong, Sch Math & Appl Stat, Wollongong, NSW 2522, Australia
关键词
data masking; multiplicative noise; data mining; sample size calculation;
D O I
10.3390/math10244744
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
In the age of data, data mining provides feasible tools with which to handle large datasets consisting of data from multiple sources. However, there is limited research on retrieving statistical information from data when data are confidential and cannot be shared directly. In this paper, we address this problem and propose a framework for performing data analysis using data from multiple sources without revealing true values for privacy purposes. The proposed framework includes three steps. First, data custodians individually mask data before publishing; then, the masked data collection is used to reconstruct the density function of the original dataset, from which resampled values are generated; last, existing data mining techniques are applied directly to the resampled data. This framework utilises the technique of reconstructing an original density function from noise-masked data using the moment-based density estimation method, which plays an essential role. Simulation studies show that the proposed framework performs well; analysis results from the resampled data are comparable to those of the original data when the density of the original data is estimated well. The proposed framework is demonstrated in data clustering analysis using the example of a real-life Australian soybean dataset. Results from the k-means algorithms with two and three fitted clusters are presented to show that cluster analysis using resampled data can well replicate that of the original data.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] Privacy-Preserving Data Mining in Homogeneous Collaborative Clustering
    Ouda, Mohamed
    Salem, Sameh
    Ali, Ihab
    Saad, El-Sayed
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2015, 12 (06) : 604 - 612
  • [32] MINING HIDDEN TREASURES IN MULTIPLE DATA SOURCES
    不详
    GERONTOLOGIST, 2009, 49 : 172 - 172
  • [33] Clustering-assisted privacy perseveration model for data mining
    Mohana, S.
    Nithya, T. M.
    Bushra, Sardar Khan Nikkath
    Vasanthi, Ramakrishnan
    Guruprakash, K. S.
    Rajesh, Sudha
    INTERNATIONAL JOURNAL OF AD HOC AND UBIQUITOUS COMPUTING, 2024, 47 (02) : 108 - 125
  • [34] Data mining with clustering
    Klimek, Petr
    E & M EKONOMIE A MANAGEMENT, 2008, 11 (02): : 120 - 126
  • [35] Privacy protection data publishing method for data privacy differences
    Yu Y.
    Zhou D.
    Li H.
    Wu X.
    Huazhong Keji Daxue Xuebao (Ziran Kexue Ban)/Journal of Huazhong University of Science and Technology (Natural Science Edition), 2020, 48 (09): : 57 - 63
  • [36] Open Data, Data Protection, and Group Privacy
    Luciano Floridi
    Philosophy & Technology, 2014, 27 (1) : 1 - 3
  • [37] Architecture-centric data mining middleware supporting multiple data sources and mining techniques
    Lee, Sai Peck
    Hen, Lai Ee
    ICSOFT 2007: PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES, VOL ISDM/WSEHST/DC, 2007, : 224 - 227
  • [38] Data Privacy Protection Using Multiple Cloud Storages
    Zhang Wei
    Sun Xinwei
    Xu Tao
    PROCEEDINGS 2013 INTERNATIONAL CONFERENCE ON MECHATRONIC SCIENCES, ELECTRIC ENGINEERING AND COMPUTER (MEC), 2013, : 1768 - 1772
  • [39] Clustering for data mining: A data recovery approach
    Leslie Rutkowski
    Psychometrika, 2007, 72 : 109 - 110
  • [40] The Effect of Clustering on Data Privacy
    Canbay, Pelin
    Sever, Hayri
    2015 IEEE 14TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2015, : 277 - 282