Why Are Big Data Matrices Approximately Low Rank?

被引:120
|
作者
Udell, Madeleine [1 ]
Townsend, Alex [2 ]
机构
[1] Cornell Univ, Dept Operat Res & Informat Engn, Ithaca, NY 14853 USA
[2] Cornell Univ, Dept Math, Ithaca, NY 14853 USA
来源
基金
美国国家科学基金会;
关键词
big data; low rank matrices; Johnson-Lindenstrauss lemma; SINGULAR-VALUES; MICROARRAY DATA; DECOMPOSITION; DISCOVERY;
D O I
10.1137/18M1183480
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Matrices of (approximate) low rank are pervasive in data science, appearing in movie preferences, text documents, survey data, medical records, and genomics. While there is a vast literature on how to exploit low rank structure in these datasets, there is less attention paid to explaining why the low rank structure appears in the first place. Here, we explain the effectiveness of low rank models in data science by considering a simple generative model for these matrices: we suppose that each row or column is associated to a (possibly high dimensional) bounded latent variable, and entries of the matrix are generated by applying a piecewise analytic function to these latent variables. These matrices are in general full rank. However, we show that we can approximate every entry of an m x n matrix drawn from this model to within a fixed absolute error by a low rank matrix whose rank grows as O (log(m+n)). Hence any sufficiently large matrix from such a latent variable model can be approximated, up to a small entrywise error, by a low rank matrix.
引用
收藏
页码:144 / 160
页数:17
相关论文
共 50 条
  • [1] INFERENCE ON LOW-RANK DATA MATRICES WITH APPLICATIONS TO MICROARRAY DATA
    Feng, Xingdong
    He, Xuming
    ANNALS OF APPLIED STATISTICS, 2009, 3 (04): : 1634 - 1654
  • [2] Limitations on low rank approximations for covariance matrices of spatial data
    Stein, Michael L.
    SPATIAL STATISTICS, 2014, 8 : 1 - 19
  • [3] On the compression of low rank matrices
    Cheng, H
    Gimbutas, Z
    Martinsson, PG
    Rokhlin, V
    SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2005, 26 (04): : 1389 - 1404
  • [4] Robust Low-Rank Approximation of Data Matrices With Elementwise Contamination
    Maronna, Ricardo A.
    Yohai, Victor J.
    TECHNOMETRICS, 2008, 50 (03) : 295 - 304
  • [5] WHY BIG DATA = BIG DEAL
    Dhar, Vasant
    BIG DATA, 2014, 2 (02) : 55 - +
  • [8] Low Rank Approximation of a Set of Matrices
    Hasan, Mohammed A.
    2010 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, 2010, : 3517 - 3520
  • [9] Improved Testing of Low Rank Matrices
    Li, Yi
    Wang, Zhengyu
    Woodruff, David P.
    PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'14), 2014, : 691 - 700
  • [10] Powers of low rank sparse matrices
    Cohen, Keren
    THEORETICAL COMPUTER SCIENCE, 2025, 1032