Fast and accurate out-of-core PCA framework for large scale biobank data

被引:4
|
作者
Li, Zilong [1 ]
Meisner, Jonas [2 ,3 ]
Albrechtsen, Anders [1 ]
机构
[1] Univ Copenhagen, Dept Biol, Sect Computat & RNA Biol, DK-2200 Copenhagen, Denmark
[2] Copenhagen Univ Hosp, Mental Hlth Ctr Copenhagen, Biol & Precis Psychiat, DK-2100 Copenhagen, Denmark
[3] Univ Copenhagen, Novo Nord Fdn Ctr Prot Res, DK-2200 Copenhagen, Denmark
关键词
ALGORITHM; GENOME;
D O I
10.1101/gr.277525.122
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Principal component analysis (PCA) is widely used in statistics, machine learning, and genomics for dimensionality reduction and uncovering low-dimensional latent structure. To address the challenges posed by ever-growing data size, fast and memory-efficient PCA methods have gained prominence. In this paper, we propose a novel randomized singular value decomposition (RSVD) algorithm implemented in PCAone, featuring a window-based optimization scheme that enables accelerated convergence while improving the accuracy. Additionally, PCAone incorporates out-of-core and multithreaded implementations for the existing Implicitly Restarted Arnoldi Method (IRAM) and RSVD. Through comprehensive evaluations using multiple large-scale real-world data sets in different fields, we show the advantage of PCAone over existing methods. The new algorithm achieves significantly faster computation time while maintaining accuracy comparable to the slower IRAM method. Notably, our analyses of UK Biobank, comprising around 0.5 million individuals and 6.1 million common single nucleotide polymorphisms, show that PCAone accurately computes the top 40 principal components within 9 h. This analysis effectively captures population structure, signals of selection, structural variants, and low recombination regions, utilizing <20 GB of memory and 20 CPU threads. Furthermore, when applied to single-cell RNA sequencing data featuring 1.3 million cells, PCAone, accurately capturing the top 40 principal components in 49 min. This performance represents a 10-fold improvement over state-of-the-art tools.
引用
收藏
页码:1599 / 1608
页数:10
相关论文
共 50 条
  • [41] Fast and exact out-of-core and distributed k-means clustering
    Ruoming Jin
    Anjan Goswami
    Gagan Agrawal
    Knowledge and Information Systems, 2006, 10 : 17 - 40
  • [42] Out-of-core GPU Memory Management for MapReduce-based Large-scale Graph Processing
    Shirahata, Koichi
    Sato, Hitoshi
    Matsuoka, Satoshi
    2014 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2014, : 221 - 229
  • [43] MultiLogVC: Efficient Out-of-Core Graph Processing Framework for Flash Storage
    Matam, Kiran Kumar
    Hashemi, Hanieh
    Annavaram, Murali
    2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2021, : 245 - 255
  • [44] A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPU
    Shen, Jingcheng
    Ino, Fumihiko
    Farres, Albert
    Hanzich, Mauricio
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2020, E103D (12): : 2421 - 2434
  • [45] Active Flash: Out-of-core Data Analytics on Flash Storage
    Boboila, Simona
    Kim, Youngjae
    Vazhkudai, Sudharshan S.
    Desnoyers, Peter
    Shipman, Galen M.
    2012 IEEE 28TH SYMPOSIUM ON MASS STORAGE SYSTEMS AND TECHNOLOGIES (MSST), 2012,
  • [46] A data-centric directive-based framework to accelerate out-of-core stencil computation on a GPU
    Shen, Jingcheng
    Ino, Fumihiko
    Farrés, Albert
    Hanzich, Mauricio
    IEICE Transactions on Information and Systems, 2020, E103D (12) : 2421 - 2434
  • [47] Data locality optimization for synthesis of efficient out-of-core algorithms
    Krishnan, S
    Krishnamoorthy, S
    Baumgartner, G
    Cociorva, D
    Lam, CC
    Sadayappan, P
    Ramanujam, J
    Bernholdt, DE
    Choppella, V
    HIGH PERFORMANCE COMPUTING - HIPC 2003, 2003, 2913 : 406 - 417
  • [48] Interactive out-of-core visualization of multiresolution time series data
    Bergeron, R. Daniel
    Foulks, Andrew
    NUMERICAL MODELING OF SPACE PLASMA FLOWS: ASTRONUM-2006, 2006, 359 : 285 - +
  • [49] A unified framework for optimizing locality, parallelism, and communication in out-of-core computations
    Kandemir, M
    Choudhary, A
    Ramanujam, J
    Kandaswamy, MA
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2000, 11 (07) : 648 - 668
  • [50] Out-of-core Data Management for Path Tracing on Hybrid Resources
    Budge, Brian
    Bernardin, Tony
    Stuart, Jeff A.
    Sengupta, Shubhabrata
    Joy, Kenneth I.
    Owens, John D.
    COMPUTER GRAPHICS FORUM, 2009, 28 (02) : 385 - 396