Sample size determination for multidimensional parameters and the A-optimal subsampling in a big data linear regression model

被引:0
|
作者
Zhang, Sheng [1 ]
Tan, Fei [1 ]
Peng, Hanxiang [1 ]
机构
[1] Indiana Univ Indianapolis, Dept Math Sci, 402 N Blackford St LD 270, Indianapolis, IN 46202 USA
关键词
Asymptotic normality; A-optimalilty; big data; least squares estimate; sample size determination; APPROXIMATION;
D O I
10.1080/00949655.2024.2434669
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
To efficiently approximate the least squares estimator (LSE) in a Big Data linear regression model using a subsampling approach, optimal sampling distributions were derived by minimizing the trace norm of the covariance matrix of a smooth function of the subsampling LSE. An algorithm was developed that significantly reduces the computation time for the subsampling LSE compared to the full-sample LSE. Additionally, the subsampling LSE was shown to be asymptotically normal almost surely for an arbitrary sampling distribution under suitable conditions. Motivated by the need for subsampling in Big Data analysis and data splitting in machine learning, we investigated sample size determination (SSD) for multidimensional parameters and derived analytical formulas for calculating sample sizes. Through extensive simulations and real-world data applications, we assessed the numerical properties of both the subsampling approach and SSD methodology. Our findings revealed that the A-optimal subsampling method significantly outperformed uniform and leverage-score subsampling techniques. Furthermore, the algorithm considerably reduced the computational time required for implementing the full sample LSE. Additionally, the SSD provided a theoretical basis for selecting sample sizes.
引用
收藏
页码:628 / 653
页数:26
相关论文
共 50 条
  • [21] Information-Based Optimal Subdata Selection for Big Data Linear Regression
    Wang, HaiYing
    Yang, Min
    Stufken, John
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2019, 114 (525) : 393 - 405
  • [22] Sample Size Requirements for Estimation of Item Parameters in the Multidimensional Graded Response Model
    Jiang, Shengyu
    Wang, Chun
    Weiss, David J.
    FRONTIERS IN PSYCHOLOGY, 2016, 7
  • [23] A unified approach to sample size and power determination for testing parameters in generalized linear and time-to-event regression models
    Martens, Michael J.
    Logan, Brent R.
    STATISTICS IN MEDICINE, 2021, 40 (05) : 1121 - 1132
  • [25] Predictive Big Data Analytics Using Multiple Linear Regression Model
    Khine, Kyi Lai Lai
    Nyunt, Thi Thi Soe
    BIG DATA ANALYSIS AND DEEP LEARNING APPLICATIONS, 2019, 744 : 9 - 19
  • [26] Determination of Sample Size on Logistic Regression for Sakernas Data in Jayapura Regency in 2015
    Fitri, Fadhilah
    Tantular, Bertho
    PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON MATHEMATICS AND MATHEMATICS EDUCATION 2018 (ICM2E 2018), 2018, 235 : 17 - 19
  • [27] CONTRIBUTION TO PLANNING OF SAMPLE SIZE .3. PLANNING OF SAMPLE SIZE FOR COMPARISON OF REGRESSION COEFFICIENTS IN CASE OF SINGLE LINEAR REGRESSION MODEL I
    HERRENDO.G
    BOCK, J
    BIOMETRISCHE ZEITSCHRIFT, 1973, 15 (05): : 319 - 323
  • [28] Prediction of Oil Production through Linear Regression Model and Big Data Tools
    Alharbi, Rehab
    Alageel, Nojood
    Alsayil, Maryam
    Alharbi, Rahaf
    Alhakamy, A'aeshah
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (12) : 380 - 387
  • [29] THE OPTIMAL SIZE OF A PRELIMINARY TEST OF LINEAR RESTRICTIONS IN A MISSPECIFIED REGRESSION-MODEL
    GILES, DEA
    LIEBERMAN, O
    GILES, JA
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1992, 87 (420) : 1153 - 1157
  • [30] Sample size for clustered count data based on discrete Weibull regression model
    Yoo, Hanna
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2023, 52 (12) : 5850 - 5856