Cooler: scalable storage for Hi-C data and other genomically labeled arrays

被引:410
作者
Abdennur, Nezar [1 ]
Mirny, Leonid A. [1 ,2 ]
机构
[1] MIT, Inst Med Engn & Sci, 77 Massachusetts Ave, Cambridge, MA 02139 USA
[2] MIT, Dept Phys, Cambridge, MA 02139 USA
基金
美国国家卫生研究院;
关键词
D O I
10.1093/bioinformatics/btz540
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation Most existing coverage-based (epi)genomic datasets are one-dimensional, but newer technologies probing interactions (physical, genetic, etc.) produce quantitative maps with two-dimensional genomic coordinate systems. Storage and computational costs mount sharply with data resolution when such maps are stored in dense form. Hence, there is a pressing need to develop data storage strategies that handle the full range of useful resolutions in multidimensional genomic datasets by taking advantage of their sparse nature, while supporting efficient compression and providing fast random access to facilitate development of scalable algorithms for data analysis. Results We developed a file format called cooler, based on a sparse data model, that can support genomically labeled matrices at any resolution. It has the flexibility to accommodate various descriptions of the data axes (genomic coordinates, tracks and bin annotations), resolutions, data density patterns and metadata. Cooler is based on HDF5 and is supported by a Python library and command line suite to create, read, inspect and manipulate cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium. Availability and implementation Cooler is cross-platform, BSD-licensed and can be installed from the Python package index or the bioconda repository. The source code is maintained on Github at https://github.com/mirnylab/cooler. Supplementary information Supplementary data are available at Bioinformatics online.
引用
收藏
页码:311 / 316
页数:6
相关论文
共 29 条
[1]  
Abadi D. J., 2008, P 2008 ACM SIGMOD IN, P967, DOI DOI 10.1145/1376616.1376712
[2]   HiCPlotter integrates genomic data with interaction matrices [J].
Akdemir, Kadir Caner ;
Chin, Lynda .
GENOME BIOLOGY, 2015, 16
[3]  
Collette A., 2013, Python and HDF5: unlocking scientific data
[4]   How best to identify chromosomal interactions: a comparison of approaches [J].
Davies, James O. J. ;
Oudelaar, A. Marieke ;
Higgs, Douglas R. ;
Hughes, Jim R. .
NATURE METHODS, 2017, 14 (02) :125-134
[5]   Capturing chromosome conformation [J].
Dekker, J ;
Rippe, K ;
Dekker, M ;
Kleckner, N .
SCIENCE, 2002, 295 (5558) :1306-1311
[6]   The second decade of 3C technologies: detailed insights into nuclear organization [J].
Denker, Annette ;
de laat, Wouter .
GENES & DEVELOPMENT, 2016, 30 (12) :1357-1382
[7]  
Dougherty M.T., 2009, QUEUE, V7, P20
[8]   Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom [J].
Durand, Neva C. ;
Robinson, James T. ;
Shamim, Muhammad S. ;
Machol, Ido ;
Mesirov, Jill P. ;
Lander, Eric S. ;
Aiden, Erez Lieberman .
CELL SYSTEMS, 2016, 3 (01) :99-101
[9]  
Folk M., 2011, P EDBTICDT 2011 WORK, P36, DOI [10.1145/1966895.1966900, DOI 10.1145/1966895.1966900]
[10]   Bioconda: sustainable and comprehensive software distribution for the life sciences [J].
Gruening, Bjoern ;
Dale, Ryan ;
Sjoedin, Andreas ;
Chapman, Brad A. ;
Rowe, Jillian ;
Tomkins-Tinch, Christopher H. ;
Valieris, Renan ;
Koester, Johannes ;
Team, Bioconda .
NATURE METHODS, 2018, 15 (07) :475-476