A distributed multiple sample testing for massive data

被引：3

作者：

Xie Xiaoyue ^{[1
,2
]}

Shi Jian ^{[1
,2
]}

Song Kai ^{[3
]}

机构：

[1] Chinese Acad Sci, Acad Math & Syst Sci, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Sch Math Sci, Beijing, Peoples R China

[3] Beijing Inst Technol, Sch Management & Econ, Beijing, Peoples R China

来源：

JOURNAL OF APPLIED STATISTICS | 2023年 / 50卷 / 03期

关键词：

Distributed scheme; hypothesis testing; fraud detection; classification;

D O I：

10.1080/02664763.2021.1911967

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

When the data are stored in a distributed manner, direct application of traditional hypothesis testing procedures is often prohibitive due to communication costs and privacy concerns. This paper mainly develops and investigates a distributed two-node Kolmogorov-Smirnov hypothesis testing scheme, implemented by the divide-and-conquer strategy. In addition, this paper also provides a distributed fraud detection and a distribution-based classification for multi-node machines based on the proposed hypothesis testing scheme. The distributed fraud detection is to detect which node stores fraud data in multi-node machines and the distribution-based classification is to determine whether the multi-node distributions differ and classify different distributions. These methods can improve the accuracy of statistical inference in a distributed storage architecture. Furthermore, this paper verifies the feasibility of the proposed methods by simulation and real example studies.

引用

页码：555 / 573

页数：19

共 50 条

[21] STATISTICAL TREATMENT OF NOT NORMAL DISTRIBUTED SAMPLE DATA
HILLER, KA
FRIEDL, KH
SCHMALZ, G
JOURNAL OF DENTAL RESEARCH, 1995, 74 : 425 - 425
[22] Testing Data Integrity in Distributed Systems
Mittal, Manika
Sangani, Ronak
Srivastava, Kriti
INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING TECHNOLOGIES AND APPLICATIONS (ICACTA), 2015, 45 : 446 - 452
[23] Distributed Submodular Cover: Succinctly Summarizing Massive Data
Mirzasoleiman, Baharan
Karbasi, Amin
Badanidiyuru, Ashwinkumar
Krause, Andreas
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 28 (NIPS 2015), 2015, 28
[24] Bag of little bootstraps for massive and distributed longitudinal data
Zhou, Xinkai
Zhou, Jin J.
Zhou, Hua
STATISTICAL ANALYSIS AND DATA MINING, 2022, 15 (03) : 314 - 321
[25] DISTRIBUTED SUFFICIENT DIMENSION REDUCTION FOR HETEROGENEOUS MASSIVE DATA
Xu, Kelin
Zhu, Liping
Fan, Jianqing
STATISTICA SINICA, 2022, 32 : 2455 - 2476
[26] A distributed rendering environment for massive data on computational grids
Zhu, HB
Wang, LZ
Yun, CK
Cai, WT
See, S
THIRD INTERNATIONAL CONFERENCE ON PEER-TO-PEER COMPUTING (P2P2003), PROCEEDINGS, 2003, : 176 - 183
[27] Adaptive distributed support vector regression of massive data
Liang, Shu-na
Sun, Fei
Zhang, Qi
COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2024, 53 (09) : 3365 - 3382
[28] Distributed optimal subsampling for quantile regression with massive data
Chao, Yue
Ma, Xuejun
Zhu, Boya
JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2024, 233
[29] Distributed Bayesian posterior voting strategy for massive data
Li, Xuerui
Kang, Lican
Liu, Yanyan
Wu, Yuanshan
ELECTRONIC RESEARCH ARCHIVE, 2022, 30 (05): : 1936 - 1953
[30] Estimating the Frequency of Data Items in Massive Distributed Streams
Anceaume, Emmanuelle
Busnel, Yann
Rivetti, Nicolo
2015 IEEE 4TH SYMPOSIUM ON NETWORK CLOUD COMPUTING AND APPLICATIONS - NCCA 2015, 2015, : 59 - 66

← 1 2 3 4 5 →