These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure

被引:45
|
作者
Zhang, Qingpeng [1 ]
Pell, Jason [1 ]
Canino-Koning, Rosangela [1 ]
Howe, Adina Chuang [2 ,3 ]
Brown, C. Titus [1 ,2 ]
机构
[1] Michigan State Univ, Dept Comp Sci & Engn, E Lansing, MI 48824 USA
[2] Michigan State Univ, Dept Microbiol & Mol Genet, E Lansing, MI 48824 USA
[3] Michigan State Univ, Dept Plant Soil & Microbial Sci, E Lansing, MI 48824 USA
来源
PLOS ONE | 2014年 / 9卷 / 07期
基金
美国农业部; 美国国家科学基金会; 美国国家卫生研究院;
关键词
GENERATION; SIZE;
D O I
10.1371/journal.pone.0101271
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.
引用
收藏
页数:13
相关论文
共 40 条
  • [1] Robust k-mer frequency estimation using gapped k-mers
    Ghandi, Mahmoud
    Mohammad-Noori, Morteza
    Beer, Michael A.
    JOURNAL OF MATHEMATICAL BIOLOGY, 2014, 69 (02) : 469 - 500
  • [2] Efficient Techniques for k-mer Counting
    Mamun, Abdullah-Al
    Pal, Soumitra
    Rajasekaran, Sanguthevar
    2015 IEEE 5TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL ADVANCES IN BIO AND MEDICAL SCIENCES (ICCABS), 2015,
  • [3] The K-mer File Format: a standardized and compact disk representation of sets of k-mers
    Dufresne, Yoann
    Lemane, Teo
    Marijon, Pierre
    Peterlongo, Pierre
    Rahman, Amatur
    Kokot, Marek
    Medvedev, Paul
    Deorowicz, Sebastian
    Chikhi, Rayan
    BIOINFORMATICS, 2022, 38 (18) : 4423 - 4425
  • [4] Efficient counting of k-mers in DNA sequences using a bloom filter
    Melsted, Pall
    Pritchard, Jonathan K.
    BMC BIOINFORMATICS, 2011, 12
  • [5] K-mer Counting for Genomic Big Data
    Ge, Jianqiu
    Guo, Ning
    Meng, Jintao
    Wang, Bingqiang
    Balaji, Pavan
    Feng, Shengzhong
    Zhou, Jiaxiu
    Wei, Yanjie
    BIG DATA - BIGDATA 2018, 2018, 10968 : 345 - 351
  • [6] Efficient counting of k-mers in DNA sequences using a bloom filter
    Páll Melsted
    Jonathan K Pritchard
    BMC Bioinformatics, 12
  • [7] BLight: efficient exact associative structure for k-mers
    Marchet, Camille
    Kerbiriou, Mael
    Limasset, Antoine
    BIOINFORMATICS, 2021, 37 (18) : 2858 - 2865
  • [8] Bayesian hierarchical model of protein-binding microarray k-mer data reduces noise and identifies transcription factor subclasses and preferred k-mers
    Jiang, Bo
    Liu, Jun S.
    Bulyk, Martha L.
    BIOINFORMATICS, 2013, 29 (11) : 1390 - 1398
  • [9] Multiple comparative metagenomics using multiset k-mer counting
    Benoit, Gaetan
    Peterlongo, Pierre
    Mariadassou, Mahendra
    Drezen, Erwan
    Schbath, Sophie
    Lavenier, Dominique
    Lemaitre, Claire
    PEERJ COMPUTER SCIENCE, 2016,
  • [10] GaKCo: A Fast Gapped k-mer String Kernel Using Counting
    Singh, Ritambhara
    Sekhon, Arshdeep
    Kowsari, Kamran
    Lanchantin, Jack
    Wang, Beilun
    Qi, Yanjun
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2017, PT I, 2017, 10534 : 356 - 373