These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure

被引：45

作者：

Zhang, Qingpeng ^{[1
]}

Pell, Jason ^{[1
]}

Canino-Koning, Rosangela ^{[1
]}

Howe, Adina Chuang ^{[2
,3
]}

Brown, C. Titus ^{[1
,2
]}

机构：

[1] Michigan State Univ, Dept Comp Sci & Engn, E Lansing, MI 48824 USA

[2] Michigan State Univ, Dept Microbiol & Mol Genet, E Lansing, MI 48824 USA

[3] Michigan State Univ, Dept Plant Soil & Microbial Sci, E Lansing, MI 48824 USA

来源：

PLOS ONE | 2014年 / 9卷 / 07期

基金：

美国农业部; 美国国家科学基金会; 美国国家卫生研究院;

关键词：

GENERATION; SIZE;

D O I：

10.1371/journal.pone.0101271

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.

引用

页数：13

共 40 条

[1] Robust k-mer frequency estimation using gapped k-mers
Ghandi, Mahmoud
Mohammad-Noori, Morteza
Beer, Michael A.
JOURNAL OF MATHEMATICAL BIOLOGY, 2014, 69 (02) : 469 - 500
[2] Efficient Techniques for k-mer Counting
Mamun, Abdullah-Al
Pal, Soumitra
Rajasekaran, Sanguthevar
2015 IEEE 5TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL ADVANCES IN BIO AND MEDICAL SCIENCES (ICCABS), 2015,
[3] The K-mer File Format: a standardized and compact disk representation of sets of k-mers
Dufresne, Yoann
Lemane, Teo
Marijon, Pierre
Peterlongo, Pierre
Rahman, Amatur
Kokot, Marek
Medvedev, Paul
Deorowicz, Sebastian
Chikhi, Rayan
BIOINFORMATICS, 2022, 38 (18) : 4423 - 4425
[4] Efficient counting of k-mers in DNA sequences using a bloom filter
Melsted, Pall
Pritchard, Jonathan K.
BMC BIOINFORMATICS, 2011, 12
[5] K-mer Counting for Genomic Big Data
Ge, Jianqiu
Guo, Ning
Meng, Jintao
Wang, Bingqiang
Balaji, Pavan
Feng, Shengzhong
Zhou, Jiaxiu
Wei, Yanjie
BIG DATA - BIGDATA 2018, 2018, 10968 : 345 - 351
[6] Efficient counting of k-mers in DNA sequences using a bloom filter
Páll Melsted
Jonathan K Pritchard
BMC Bioinformatics, 12
[7] BLight: efficient exact associative structure for k-mers
Marchet, Camille
Kerbiriou, Mael
Limasset, Antoine
BIOINFORMATICS, 2021, 37 (18) : 2858 - 2865
[8] Bayesian hierarchical model of protein-binding microarray k-mer data reduces noise and identifies transcription factor subclasses and preferred k-mers
Jiang, Bo
Liu, Jun S.
Bulyk, Martha L.
BIOINFORMATICS, 2013, 29 (11) : 1390 - 1398
[9] Multiple comparative metagenomics using multiset k-mer counting
Benoit, Gaetan
Peterlongo, Pierre
Mariadassou, Mahendra
Drezen, Erwan
Schbath, Sophie
Lavenier, Dominique
Lemaitre, Claire
PEERJ COMPUTER SCIENCE, 2016,
[10] GaKCo: A Fast Gapped k-mer String Kernel Using Counting
Singh, Ritambhara
Sekhon, Arshdeep
Kowsari, Kamran
Lanchantin, Jack
Wang, Beilun
Qi, Yanjun
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2017, PT I, 2017, 10534 : 356 - 373

← 1 2 3 4 →