Machine Learning Model Generation With Copula-Based Synthetic Dataset for Local Differentially Private Numerical Data

被引:6
|
作者
Sei, Yuichi [1 ,2 ]
Onesimu, J. Andrew [3 ]
Ohsuga, Akihiko [1 ]
机构
[1] Univ Electrocommun, Grad Sch Informat & Engn, Dept Informat, Chofu, Tokyo 1828585, Japan
[2] JST, PRESTO, Kawaguchi, Saitama 1020076, Japan
[3] Manipal Acad Higher Educ, Manipal Inst Technol, Dept Comp Sci & Engn, Manipal 576104, India
基金
日本学术振兴会; 日本科学技术振兴机构;
关键词
Data models; Machine learning; Differential privacy; Decision trees; Numerical models; Machine learning algorithms; Generators; Data mining; Privacy; Data collection; Copula; data mining; decision trees; local differential privacy; machine learning; privacy-preserving data collection; DECISION TREE; SECURITY;
D O I
10.1109/ACCESS.2022.3208715
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the development of IoT technology, personal data are being collected in many places. These data can be used to create new services, but consideration must be given to the individual's privacy. We can safely collect personal data while adding noise by applying differential privacy. However, because such data are very noisy, the accuracy of machine learning trained by the data greatly decreased. In this study, our objective is to build a highly accurate machine learning model using these data. We focus on the decision tree machine learning algorithm, and, instead of applying it as is, we use a preprocessing technique wherein pseudodata are generated using a copula while removing the effect of noise added by differential privacy. In detail, the proposed novel protocol consists of three steps: generating a covariance matrix from the differentially private numerical data, generating a discrete cumulative distribution function from differentially private numerical data, and generating copula-based numerical samples. Simulation results using synthetic and real datasets verify the utility of the proposed method not only for the decision tree algorithm but also for other machine learning algorithms such as deep neural networks. This method will help create machine learning models, such as recommendation systems, using differential privacy data.
引用
收藏
页码:101656 / 101671
页数:16
相关论文
共 50 条
  • [1] Copula-based synthetic data augmentation for machine-learning emulators
    Meyer, David
    Nagler, Thomas
    Hogan, Robin J.
    GEOSCIENTIFIC MODEL DEVELOPMENT, 2021, 14 (08) : 5205 - 5215
  • [2] Copula-Based Approach to Synthetic Population Generation
    Jeong, Byungduk
    Lee, Wonjoon
    Kim, Deok-Soo
    Shin, Hayong
    PLOS ONE, 2016, 11 (08):
  • [3] Copula-Based Synthetic Data Generation in Firm-Size Variables
    Fujimoto, Shouji
    Ishikawa, Atushi
    Mizuno, Takayuki
    REVIEW OF SOCIONETWORK STRATEGIES, 2022, 16 (02): : 479 - 492
  • [4] Copula-Based Synthetic Data Generation in Firm-Size Variables
    Shouji Fujimoto
    Atushi Ishikawa
    Takayuki Mizuno
    The Review of Socionetwork Strategies, 2022, 16 : 479 - 492
  • [5] Distributed Synthetic Time-Series Data Generation With Local Differentially Private Federated Learning
    Jiang, Xue
    Zhou, Xuebing
    Grossklags, Jens
    IEEE ACCESS, 2024, 12 : 157067 - 157082
  • [6] Copula-based transferable models for synthetic population generation
    Jutras-Dube, Pascal
    Al-Khasawneh, Mohammad B.
    Yang, Zhichao
    Bas, Javier
    Bastin, Fabian
    Cirillo, Cinzia
    TRANSPORTATION RESEARCH PART C-EMERGING TECHNOLOGIES, 2024, 169
  • [7] Differentially private synthetic mixed-type data generation for unsupervised learning
    Tantipongpipat, Uthaipon Tao
    Waites, Chris
    Boob, Digvijay
    Siva, Amaresh Ankit
    Cummings, Rachel
    INTELLIGENT DECISION TECHNOLOGIES-NETHERLANDS, 2021, 15 (04): : 779 - 807
  • [8] Private information in healthcare utilization: specification of a copula-based hurdle model
    Shi, Peng
    Zhang, Wei
    JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY, 2015, 178 (02) : 337 - 361
  • [9] Learning Vine Copula Models for Synthetic Data Generation
    Sun, Yi
    Cuesta-Infante, Alfredo
    Veeramachaneni, Kalyan
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 5049 - 5057
  • [10] Differentially Private Normalizing Flows for Synthetic Tabular Data Generation
    Lee, Jaewoo
    Kim, Minjung
    Jeong, Yonghyun
    Ro, Youngmin
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 7345 - 7353