Distributed Bayesian posterior voting strategy for massive data

被引:1
|
作者
Li, Xuerui [1 ]
Kang, Lican [2 ]
Liu, Yanyan [1 ]
Wu, Yuanshan [3 ]
机构
[1] Wuhan Univ, Sch Math & Stat, Wuhan, Peoples R China
[2] NUS Med Sch, Ctr Quantitat Med Duke, Singapore, Singapore
[3] Zhongnan Univ Econ, Sch Stat & Math, Wuhan, Peoples R China
来源
ELECTRONIC RESEARCH ARCHIVE | 2022年 / 30卷 / 05期
关键词
Hierarchical Bayes formulation; massive data; majority-voting; split-and-conquer; Shrinkage prior; VARIABLE SELECTION; REGRESSION;
D O I
10.3934/era.2022098
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
The emergence of massive data has driven recent interest in developing statistical learning and large-scale algorithms for analysis on distributed platforms. One of the widely used statistical approaches is split-and-conquer (SaC), which was originally performed by aggregating all local solutions through a simple average to reduce the computational burden caused by communication costs. Aiming at lower computation cost and satisfactorily acceptable accuracy, this paper extends SaC to Bayesian variable selection for ultra-high dimensional linear regression and builds BVSaC for aggregation. Suppose ultrahigh-dimensional data are stored in a distributed manner across multiple computing nodes, with each computing resource containing a disjoint subset of data. On each node machine, we perform variable selection and coefficient estimation through a hierarchical Bayes formulation. Then, a weighted majority voting method BVSaC is used to combine the local results to retain good performance. The proposed approach only requires a small portion of computation cost on each local dataset and therefore eases the computational burden, especially in Bayesian computation, meanwhile, pays a little cost to receive accuracy, which in turn increases the feasibility of analyzing extraordinarily large datasets. Simulations and a real-world example show that the proposed approach performed as well as the whole sample hierarchical Bayes method in terms of the accuracy of variable selection and estimation.
引用
收藏
页码:1936 / 1953
页数:18
相关论文
共 50 条
  • [1] Distributed Bayesian Inference in Massive Spatial Data
    Guhaniyogi, Rajarshi
    Li, Cheng
    Savitsky, Terrance
    Srivastava, Sanvesh
    STATISTICAL SCIENCE, 2023, 38 (02) : 262 - 284
  • [2] Voting massive collections of Bayesian network classifiers for data streams
    Bouckaert, Remco R.
    AI 2006: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4304 : 243 - 252
  • [3] On the organization of cluster voting with massive distributed streams
    Alhudhaif, Adi
    Yan, Tong
    Berkovich, Simon
    2014 FIFTH INTERNATIONAL CONFERENCE ON COMPUTING FOR GEOSPATIAL RESEARCH AND APPLICATION (COM.GEO), 2014, : 55 - 62
  • [4] Distributed computing and storage strategy for massive high resolution image data
    Li, Guozhang
    Alfred, Rayner
    Wang, Yetong
    Xing, Kongduo
    INTELLIGENT DECISION TECHNOLOGIES-NETHERLANDS, 2024, 18 (04): : 2901 - 2913
  • [5] Bayesian Bootstraps for Massive Data
    Barrientos, Andres F.
    Pena, Victor
    BAYESIAN ANALYSIS, 2020, 15 (02): : 363 - 388
  • [6] Distributed Real-time Organization and Scheduling Strategy of Massive Grid Data
    Huang, Ying
    Xie, Zhong
    Wu, Liang
    Guo, Mingqiang
    Luo, Xiangang
    INTERNATIONAL JOINT CONFERENCE ON COMPUTATIONAL SCIENCES AND OPTIMIZATION, VOL 2, PROCEEDINGS, 2009, : 184 - 186
  • [7] Nonparametric Bayesian Aggregation for Massive Data
    Shang, Zuofeng
    Hao, Botao
    Cheng, Guang
    JOURNAL OF MACHINE LEARNING RESEARCH, 2019, 20
  • [8] Nonparametric Bayesian aggregation for massive data
    Shang, Zuofeng
    Hao, Botao
    Cheng, Guang
    Journal of Machine Learning Research, 2019, 20
  • [9] Distributed Bayesian Posterior Sampling via Moment Sharing
    Xu, Minjie
    Lakshminarayanan, Balaji
    Teh, Yee Whye
    Zhu, Jun
    Zhang, Bo
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 27 (NIPS 2014), 2014, 27