A distributed multiple sample testing for massive data

被引:3
|
作者
Xie Xiaoyue [1 ,2 ]
Shi Jian [1 ,2 ]
Song Kai [3 ]
机构
[1] Chinese Acad Sci, Acad Math & Syst Sci, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Math Sci, Beijing, Peoples R China
[3] Beijing Inst Technol, Sch Management & Econ, Beijing, Peoples R China
关键词
Distributed scheme; hypothesis testing; fraud detection; classification;
D O I
10.1080/02664763.2021.1911967
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
When the data are stored in a distributed manner, direct application of traditional hypothesis testing procedures is often prohibitive due to communication costs and privacy concerns. This paper mainly develops and investigates a distributed two-node Kolmogorov-Smirnov hypothesis testing scheme, implemented by the divide-and-conquer strategy. In addition, this paper also provides a distributed fraud detection and a distribution-based classification for multi-node machines based on the proposed hypothesis testing scheme. The distributed fraud detection is to detect which node stores fraud data in multi-node machines and the distribution-based classification is to determine whether the multi-node distributions differ and classify different distributions. These methods can improve the accuracy of statistical inference in a distributed storage architecture. Furthermore, this paper verifies the feasibility of the proposed methods by simulation and real example studies.
引用
收藏
页码:555 / 573
页数:19
相关论文
共 50 条
  • [1] Distributed testing on mutual independence of massive multivariate data
    Kuang, Yongxin
    Xie, Junshan
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2023, 52 (15) : 5332 - 5348
  • [2] Sample size calculation for multiple testing in microarray data analysis
    Jung, SH
    Bang, H
    Young, S
    BIOSTATISTICS, 2005, 6 (01) : 157 - 169
  • [3] Distributed inference for two-sample U-statistics in massive data analysis
    Huang, Bingyao
    Liu, Yanyan
    Peng, Liuhua
    SCANDINAVIAN JOURNAL OF STATISTICS, 2023, 50 (03) : 1090 - 1115
  • [4] Decentralized multiple hypothesis testing in Cognitive IOT using massive heterogeneous data
    Jha, Vidyapati
    Tripathi, Priyanka
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2024, 27 (05): : 6889 - 6929
  • [5] Conditional multiple-point geostatistical simulation for unevenly distributed sample data
    Chen, Qiyu
    Liu, Gang
    Ma, Xiaogang
    Zhang, Junqiang
    Zhang, Xialin
    STOCHASTIC ENVIRONMENTAL RESEARCH AND RISK ASSESSMENT, 2019, 33 (4-6) : 973 - 987
  • [6] Conditional multiple-point geostatistical simulation for unevenly distributed sample data
    Qiyu Chen
    Gang Liu
    Xiaogang Ma
    Junqiang Zhang
    Xialin Zhang
    Stochastic Environmental Research and Risk Assessment, 2019, 33 : 973 - 987
  • [7] DISTRIBUTED STATISTICAL INFERENCE FOR MASSIVE DATA
    Chen, Song Xi
    Peng, Liuhua
    ANNALS OF STATISTICS, 2021, 49 (05): : 2851 - 2869
  • [8] S2D: Shared Distributed Datasets, Storing Shared Data for Multiple and Massive Queries Optimization in a Distributed Data Warehouse
    Ratsimbazafy, Rado
    Boussaid, Omar
    Bentayeb, Fadila
    BIG DATA ANALYTICS AND KNOWLEDGE DISCOVERY, DAWAK 2017, 2017, 10440 : 42 - 50
  • [9] Asynchronous and Distributed Data Augmentation for Massive Data Settings
    Zhou, Jiayuan
    Khare, Kshitij
    Srivastava, Sanvesh
    JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2023, 32 (03) : 895 - 907
  • [10] Distributed quantile regression for massive heterogeneous data
    Hu, Aijun
    Jiao, Yuling
    Liu, Yanyan
    Shi, Yueyong
    Wu, Yuanshan
    NEUROCOMPUTING, 2021, 448 : 249 - 262