Machine Learning Model Generation With Copula-Based Synthetic Dataset for Local Differentially Private Numerical Data

被引:6
|
作者
Sei, Yuichi [1 ,2 ]
Onesimu, J. Andrew [3 ]
Ohsuga, Akihiko [1 ]
机构
[1] Univ Electrocommun, Grad Sch Informat & Engn, Dept Informat, Chofu, Tokyo 1828585, Japan
[2] JST, PRESTO, Kawaguchi, Saitama 1020076, Japan
[3] Manipal Acad Higher Educ, Manipal Inst Technol, Dept Comp Sci & Engn, Manipal 576104, India
基金
日本学术振兴会; 日本科学技术振兴机构;
关键词
Data models; Machine learning; Differential privacy; Decision trees; Numerical models; Machine learning algorithms; Generators; Data mining; Privacy; Data collection; Copula; data mining; decision trees; local differential privacy; machine learning; privacy-preserving data collection; DECISION TREE; SECURITY;
D O I
10.1109/ACCESS.2022.3208715
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the development of IoT technology, personal data are being collected in many places. These data can be used to create new services, but consideration must be given to the individual's privacy. We can safely collect personal data while adding noise by applying differential privacy. However, because such data are very noisy, the accuracy of machine learning trained by the data greatly decreased. In this study, our objective is to build a highly accurate machine learning model using these data. We focus on the decision tree machine learning algorithm, and, instead of applying it as is, we use a preprocessing technique wherein pseudodata are generated using a copula while removing the effect of noise added by differential privacy. In detail, the proposed novel protocol consists of three steps: generating a covariance matrix from the differentially private numerical data, generating a discrete cumulative distribution function from differentially private numerical data, and generating copula-based numerical samples. Simulation results using synthetic and real datasets verify the utility of the proposed method not only for the decision tree algorithm but also for other machine learning algorithms such as deep neural networks. This method will help create machine learning models, such as recommendation systems, using differential privacy data.
引用
收藏
页码:101656 / 101671
页数:16
相关论文
共 50 条
  • [31] A Synthetic Dataset Generation for the Uveitis Pathology Based on MedWGAN Model
    Sliman, Heithem
    Megdiche, Imen
    Yangui, Sami
    Drira, Aida
    Drira, Ines
    Lamine, Elyes
    38TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2023, 2023, : 559 - 566
  • [32] Exploring the use of machine learning techniques and synthetic data creation with CoCoBi dataset
    Pihlajamaki, Mika
    Silander, Kaisa
    Kantojarvi, Katri
    Eklund, Niina
    Wahlfors, Tiina
    EUROPEAN JOURNAL OF HUMAN GENETICS, 2024, 32 : 677 - 677
  • [33] PRIVATE FL-GAN: DIFFERENTIAL PRIVACY SYNTHETIC DATA GENERATION BASED ON FEDERATED LEARNING
    Xin, Bangzhou
    Yang, Wei
    Geng, Yangyang
    Chen, Sheng
    Wang, Shaowei
    Huang, Liusheng
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 2927 - 2931
  • [34] An ensemble learning model based on differentially private decision tree
    Niu, Xufeng
    Ma, Wenping
    COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (05) : 5267 - 5280
  • [35] Differentially Private Query Learning: from Data Publishing to Model Publishing
    Zhu, Tianqing
    Xiong, Ping
    Li, Gang
    Zhou, Wanlei
    Yu, Philip S.
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 1117 - 1122
  • [36] An ensemble learning model based on differentially private decision tree
    Xufeng Niu
    Wenping Ma
    Complex & Intelligent Systems, 2023, 9 : 5267 - 5280
  • [37] Spatial Interpolation of Missing Annual Average Daily Traffic Data Using Copula-Based Model
    Ma, Xiaolei
    Ding, Chuan
    Wang, Yunpeng
    Luan, Sen
    Liu, Haode
    IEEE INTELLIGENT TRANSPORTATION SYSTEMS MAGAZINE, 2019, 11 (03) : 158 - 170
  • [38] Investigation of Underlying Distributional Assumption in Nested Logit Model Using Copula-Based Simulation and Numerical Approximation
    Ye, Xin
    TRANSPORTATION RESEARCH RECORD, 2011, (2254) : 36 - 43
  • [39] Synthetic data generation for machine learning model training for energy theft scenarios using cosimulation
    Narayanan, Anantha
    Hardy, Trevor
    IET GENERATION TRANSMISSION & DISTRIBUTION, 2023, 17 (05) : 1035 - 1046
  • [40] Distributionally-robust machine learning using locally differentially-private data
    Farokhi, Farhad
    OPTIMIZATION LETTERS, 2022, 16 (04) : 1167 - 1179