Why Are Big Data Matrices Approximately Low Rank?

被引：120

作者：

Udell, Madeleine ^{[1
]}

Townsend, Alex ^{[2
]}

机构：

[1] Cornell Univ, Dept Operat Res & Informat Engn, Ithaca, NY 14853 USA

[2] Cornell Univ, Dept Math, Ithaca, NY 14853 USA

来源：

SIAM JOURNAL ON MATHEMATICS OF DATA SCIENCE | 2019年 / 1卷 / 01期

基金：

美国国家科学基金会;

关键词：

big data; low rank matrices; Johnson-Lindenstrauss lemma; SINGULAR-VALUES; MICROARRAY DATA; DECOMPOSITION; DISCOVERY;

D O I：

10.1137/18M1183480

中图分类号：

O29 [应用数学];

学科分类号：

070104 ;

摘要：

Matrices of (approximate) low rank are pervasive in data science, appearing in movie preferences, text documents, survey data, medical records, and genomics. While there is a vast literature on how to exploit low rank structure in these datasets, there is less attention paid to explaining why the low rank structure appears in the first place. Here, we explain the effectiveness of low rank models in data science by considering a simple generative model for these matrices: we suppose that each row or column is associated to a (possibly high dimensional) bounded latent variable, and entries of the matrix are generated by applying a piecewise analytic function to these latent variables. These matrices are in general full rank. However, we show that we can approximate every entry of an m x n matrix drawn from this model to within a fixed absolute error by a low rank matrix whose rank grows as O (log(m+n)). Hence any sufficiently large matrix from such a latent variable model can be approximated, up to a small entrywise error, by a low rank matrix.

引用

页码：144 / 160

页数：17

共 50 条

[1] INFERENCE ON LOW-RANK DATA MATRICES WITH APPLICATIONS TO MICROARRAY DATA
Feng, Xingdong
He, Xuming
ANNALS OF APPLIED STATISTICS, 2009, 3 (04): : 1634 - 1654
[2] Limitations on low rank approximations for covariance matrices of spatial data
Stein, Michael L.
SPATIAL STATISTICS, 2014, 8 : 1 - 19
[3] On the compression of low rank matrices
Cheng, H
Gimbutas, Z
Martinsson, PG
Rokhlin, V
SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2005, 26 (04): : 1389 - 1404
[4] Robust Low-Rank Approximation of Data Matrices With Elementwise Contamination
Maronna, Ricardo A.
Yohai, Victor J.
TECHNOMETRICS, 2008, 50 (03) : 295 - 304
[5] WHY BIG DATA = BIG DEAL
Dhar, Vasant
BIG DATA, 2014, 2 (02) : 55 - +
[6] Why Big Data Is a Big Deal(Ⅰ)
机械设计与制造工程, 2011, (20) : 55 - 56
[7] Why Big Data Is a Big Deal (Ⅱ)
机械设计与制造工程, 2011, (22) : 46 - 48
[8] Low Rank Approximation of a Set of Matrices
Hasan, Mohammed A.
2010 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, 2010, : 3517 - 3520
[9] Improved Testing of Low Rank Matrices
Li, Yi
Wang, Zhengyu
Woodruff, David P.
PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'14), 2014, : 691 - 700
[10] Powers of low rank sparse matrices
Cohen, Keren
THEORETICAL COMPUTER SCIENCE, 2025, 1032

← 1 2 3 4 5 →