Privacy Protection Practice for Data Mining with Multiple Data Sources: An Example with Data Clustering

被引:2
|
作者
O'Shaughnessy, Pauline [1 ]
Lin, Yan-Xia [1 ]
机构
[1] Univ Wollongong, Sch Math & Appl Stat, Wollongong, NSW 2522, Australia
关键词
data masking; multiplicative noise; data mining; sample size calculation;
D O I
10.3390/math10244744
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
In the age of data, data mining provides feasible tools with which to handle large datasets consisting of data from multiple sources. However, there is limited research on retrieving statistical information from data when data are confidential and cannot be shared directly. In this paper, we address this problem and propose a framework for performing data analysis using data from multiple sources without revealing true values for privacy purposes. The proposed framework includes three steps. First, data custodians individually mask data before publishing; then, the masked data collection is used to reconstruct the density function of the original dataset, from which resampled values are generated; last, existing data mining techniques are applied directly to the resampled data. This framework utilises the technique of reconstructing an original density function from noise-masked data using the moment-based density estimation method, which plays an essential role. Simulation studies show that the proposed framework performs well; analysis results from the resampled data are comparable to those of the original data when the density of the original data is estimated well. The proposed framework is demonstrated in data clustering analysis using the example of a real-life Australian soybean dataset. Results from the k-means algorithms with two and three fitted clusters are presented to show that cluster analysis using resampled data can well replicate that of the original data.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] Privacy in data mining
    Domingo-Ferrer, J
    Torra, V
    DATA MINING AND KNOWLEDGE DISCOVERY, 2005, 11 (02) : 117 - 119
  • [22] Mining multiple clustering data for knowledge discovery
    Quan, Thanh Tho
    Hui, Siu Cheung
    Fong, Alvis
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2003, 2843 : 452 - 459
  • [23] Mining multiple clustering data for knowledge discovery
    Quan, TT
    Hui, SC
    Fong, A
    DISCOVERY SCIENCE, PROCEEDINGS, 2003, 2843 : 452 - 459
  • [24] Mining Credit Interest Rate Data from Multiple Data Sources
    Hryhorkiv, Vasyl
    Buiak, Lesia
    Verstiak, Andrii
    Hryhorkiv, Mariia
    Verstiak, Oksana
    Berdnuk, Andrii
    2019 9TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER INFORMATION TECHNOLOGIES (ACIT'2019), 2019, : 265 - 268
  • [25] Privacy-preserving data mining for open government data from heterogeneous sources
    Lee, Jae-Seong
    Jun, Seung-Pyo
    GOVERNMENT INFORMATION QUARTERLY, 2021, 38 (01)
  • [26] Application Research of Data Mining Technology in Personal Privacy Protection and Material Data Analysis
    Liu, Jianguo
    Zhou, Sheng
    INTEGRATED FERROELECTRICS, 2021, 216 (01) : 29 - 42
  • [27] Identity disclosure protection: A data reconstruction approach for privacy-preserving data mining
    Zhu, Dan
    Li, Xiao-Bai
    Wu, Shuning
    DECISION SUPPORT SYSTEMS, 2009, 48 (01) : 133 - 140
  • [28] Data protection, privacy
    La Monaca, G.
    Schiralli, I.
    CLINICA TERAPEUTICA, 2010, 161 (02): : 189 - 191
  • [29] Special issue on data mining and data privacy
    Torra, Vicenç
    Narukawa, Yasuo
    Journal of Ambient Intelligence and Humanized Computing, 2023, 14 (11) : 14977 - 14978
  • [30] On data distortion for privacy preserving data mining
    Kabir, Saif M. A.
    Youssef, Amr M.
    Elhakeem, Ahmed K.
    2007 CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING, VOLS 1-3, 2007, : 308 - 311