Machine Learning Model Generation With Copula-Based Synthetic Dataset for Local Differentially Private Numerical Data

被引:6
|
作者
Sei, Yuichi [1 ,2 ]
Onesimu, J. Andrew [3 ]
Ohsuga, Akihiko [1 ]
机构
[1] Univ Electrocommun, Grad Sch Informat & Engn, Dept Informat, Chofu, Tokyo 1828585, Japan
[2] JST, PRESTO, Kawaguchi, Saitama 1020076, Japan
[3] Manipal Acad Higher Educ, Manipal Inst Technol, Dept Comp Sci & Engn, Manipal 576104, India
基金
日本学术振兴会; 日本科学技术振兴机构;
关键词
Data models; Machine learning; Differential privacy; Decision trees; Numerical models; Machine learning algorithms; Generators; Data mining; Privacy; Data collection; Copula; data mining; decision trees; local differential privacy; machine learning; privacy-preserving data collection; DECISION TREE; SECURITY;
D O I
10.1109/ACCESS.2022.3208715
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the development of IoT technology, personal data are being collected in many places. These data can be used to create new services, but consideration must be given to the individual's privacy. We can safely collect personal data while adding noise by applying differential privacy. However, because such data are very noisy, the accuracy of machine learning trained by the data greatly decreased. In this study, our objective is to build a highly accurate machine learning model using these data. We focus on the decision tree machine learning algorithm, and, instead of applying it as is, we use a preprocessing technique wherein pseudodata are generated using a copula while removing the effect of noise added by differential privacy. In detail, the proposed novel protocol consists of three steps: generating a covariance matrix from the differentially private numerical data, generating a discrete cumulative distribution function from differentially private numerical data, and generating copula-based numerical samples. Simulation results using synthetic and real datasets verify the utility of the proposed method not only for the decision tree algorithm but also for other machine learning algorithms such as deep neural networks. This method will help create machine learning models, such as recommendation systems, using differential privacy data.
引用
收藏
页码:101656 / 101671
页数:16
相关论文
共 50 条
  • [41] Distributionally-robust machine learning using locally differentially-private data
    Farhad Farokhi
    Optimization Letters, 2022, 16 : 1167 - 1179
  • [42] Comparison of Trivariate Copula-Based Conditional Quantile Regression Versus Machine Learning Methods for Estimating Copper Recovery
    Hernandez, Heber
    Diaz-Viera, Martin Alberto
    Alberdi, Elisabete
    Goti, Aitor
    MATHEMATICS, 2025, 13 (04)
  • [43] Model-based Differentially Private Data Synthesis and Statistical Infer- ence in Multiple Synthetic Datasets
    Liu, Fang
    TRANSACTIONS ON DATA PRIVACY, 2022, 15 (03) : 141 - 175
  • [44] Synthetic dataset generation for object-to-model deep learning in industrial applications
    Wong, Matthew Z.
    Kunii, Kiyohito
    Baylis, Max
    Ong, Wai Hong
    Kroupa, Pavel
    Koller, Swen
    PEERJ COMPUTER SCIENCE, 2019, 2019 (10)
  • [45] A Federated Learning Framework Based on Differentially Private Continuous Data Release
    Cai, Jianping
    Liu, Ximeng
    Ye, Qingqing
    Liu, Yang
    Wang, Yuyang
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2024, 21 (05) : 4879 - 4894
  • [46] BDPL: A Boundary Differentially Private Layer Against Machine Learning Model Extraction Attacks
    Zheng, Huadi
    Ye, Qingqing
    Hu, Haibo
    Fang, Chengfang
    Shi, Jie
    COMPUTER SECURITY - ESORICS 2019, PT I, 2019, 11735 : 66 - 83
  • [47] Cost-based recommendation of parameters for local differentially private data aggregation
    Shahani, Snehkumar
    Venkateswaran, R.
    Abraham, Jibi
    COMPUTERS & SECURITY, 2021, 102
  • [48] Synthetic Data Generation With Machine Learning for Network Intrusion Detection Systems
    Newlin, Marvin
    Reith, Mark
    DeYoung, Mark
    PROCEEDINGS OF THE 18TH EUROPEAN CONFERENCE ON CYBER WARFARE AND SECURITY (ECCWS 2019), 2019, : 785 - 789
  • [49] Machine Learning Methods and Synthetic Data Generation to Predict Large Wildfires
    Perez-Porras, Fernando-Juan
    Trivino-Tarradas, Paula
    Cima-Rodriguez, Carmen
    Merono-de-Larriva, Jose-Emilio
    Garcia-Ferrer, Alfonso
    Mesas-Carrascosa, Francisco-Javier
    SENSORS, 2021, 21 (11)
  • [50] Application of copula-based approach as a new data-driven model for downscaling the mean daily temperature
    Nazeri Tahroudi, Mohammad
    Ramezani, Yousef
    De Michele, Carlo
    Mirabbasi, Rasoul
    INTERNATIONAL JOURNAL OF CLIMATOLOGY, 2023, 43 (01) : 240 - 254