A distributed multiple sample testing for massive data

被引:3
|
作者
Xie Xiaoyue [1 ,2 ]
Shi Jian [1 ,2 ]
Song Kai [3 ]
机构
[1] Chinese Acad Sci, Acad Math & Syst Sci, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Math Sci, Beijing, Peoples R China
[3] Beijing Inst Technol, Sch Management & Econ, Beijing, Peoples R China
关键词
Distributed scheme; hypothesis testing; fraud detection; classification;
D O I
10.1080/02664763.2021.1911967
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
When the data are stored in a distributed manner, direct application of traditional hypothesis testing procedures is often prohibitive due to communication costs and privacy concerns. This paper mainly develops and investigates a distributed two-node Kolmogorov-Smirnov hypothesis testing scheme, implemented by the divide-and-conquer strategy. In addition, this paper also provides a distributed fraud detection and a distribution-based classification for multi-node machines based on the proposed hypothesis testing scheme. The distributed fraud detection is to detect which node stores fraud data in multi-node machines and the distribution-based classification is to determine whether the multi-node distributions differ and classify different distributions. These methods can improve the accuracy of statistical inference in a distributed storage architecture. Furthermore, this paper verifies the feasibility of the proposed methods by simulation and real example studies.
引用
收藏
页码:555 / 573
页数:19
相关论文
共 50 条
  • [21] STATISTICAL TREATMENT OF NOT NORMAL DISTRIBUTED SAMPLE DATA
    HILLER, KA
    FRIEDL, KH
    SCHMALZ, G
    JOURNAL OF DENTAL RESEARCH, 1995, 74 : 425 - 425
  • [22] Testing Data Integrity in Distributed Systems
    Mittal, Manika
    Sangani, Ronak
    Srivastava, Kriti
    INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING TECHNOLOGIES AND APPLICATIONS (ICACTA), 2015, 45 : 446 - 452
  • [23] Distributed Submodular Cover: Succinctly Summarizing Massive Data
    Mirzasoleiman, Baharan
    Karbasi, Amin
    Badanidiyuru, Ashwinkumar
    Krause, Andreas
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 28 (NIPS 2015), 2015, 28
  • [24] Bag of little bootstraps for massive and distributed longitudinal data
    Zhou, Xinkai
    Zhou, Jin J.
    Zhou, Hua
    STATISTICAL ANALYSIS AND DATA MINING, 2022, 15 (03) : 314 - 321
  • [25] DISTRIBUTED SUFFICIENT DIMENSION REDUCTION FOR HETEROGENEOUS MASSIVE DATA
    Xu, Kelin
    Zhu, Liping
    Fan, Jianqing
    STATISTICA SINICA, 2022, 32 : 2455 - 2476
  • [26] A distributed rendering environment for massive data on computational grids
    Zhu, HB
    Wang, LZ
    Yun, CK
    Cai, WT
    See, S
    THIRD INTERNATIONAL CONFERENCE ON PEER-TO-PEER COMPUTING (P2P2003), PROCEEDINGS, 2003, : 176 - 183
  • [27] Adaptive distributed support vector regression of massive data
    Liang, Shu-na
    Sun, Fei
    Zhang, Qi
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2024, 53 (09) : 3365 - 3382
  • [28] Distributed optimal subsampling for quantile regression with massive data
    Chao, Yue
    Ma, Xuejun
    Zhu, Boya
    JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2024, 233
  • [29] Distributed Bayesian posterior voting strategy for massive data
    Li, Xuerui
    Kang, Lican
    Liu, Yanyan
    Wu, Yuanshan
    ELECTRONIC RESEARCH ARCHIVE, 2022, 30 (05): : 1936 - 1953
  • [30] Estimating the Frequency of Data Items in Massive Distributed Streams
    Anceaume, Emmanuelle
    Busnel, Yann
    Rivetti, Nicolo
    2015 IEEE 4TH SYMPOSIUM ON NETWORK CLOUD COMPUTING AND APPLICATIONS - NCCA 2015, 2015, : 59 - 66