A Framework for Space-Efficient String Kernels

被引:0
|
作者
Djamal Belazzougui
Fabio Cunial
机构
[1] Centre de Recherche sur l’Information Scientifique et Technique (DTISI-CERIST),
[2] Max Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG),undefined
来源
Algorithmica | 2017年 / 79卷
关键词
Substring kernel; Substring complexity; Burrows–Wheeler transform; Maximal repeat; Minimal absent word; Suffix-link tree; Probabilistic suffix tree; Variable-length Markov chain; Matching statistics;
D O I
暂无
中图分类号
学科分类号
摘要
String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact kernels on pairs of strings of total length n, like the k-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in O(nd) time and in o(n) bits of space in addition to the input, using just a rangeDistinct\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathtt {rangeDistinct}$$\end{document} data structure on the Burrows–Wheeler transform of the input strings that takes O(d) time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple values of k, like the k-mer profile and the k-th order empirical entropy, and for calibrating the value of k using the data. All such algorithms become O(n) using a suitable implementation of the rangeDistinct\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathtt {rangeDistinct}$$\end{document} data structure, and by concatenating them to a suitable BWT construction algorithm, we can compute all the mentioned kernels and complexity measures, directly from the input strings, in O(n) time and in O(nlogσ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(n\log {\sigma })$$\end{document} bits of space in addition to the input, where σ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma $$\end{document} is the size of the alphabet. Using similar data structures, we also show how to build a compact representation of the variable-length Markov chain of a string T of length n, that takes just 3nlogσ+o(nlogσ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$3n\log {\sigma }+o(n\log {\sigma })$$\end{document} bits of space, and that can be learnt in randomized O(n) time using O(nlogσ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(n\log {\sigma })$$\end{document} bits of space in addition to the input. Such model can then be used to assign a probability to a query string S of length m in O(m) time and in 2m+o(m)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2m+o(m)$$\end{document} bits of additional space, thus providing an alternative, compositional measure of the similarity between S and T that does not require alignment.
引用
收藏
页码:857 / 883
页数:26
相关论文
共 50 条
  • [1] A Framework for Space-Efficient String Kernels
    Belazzougui, Djamal
    Cunial, Fabio
    ALGORITHMICA, 2017, 79 (03) : 857 - 883
  • [2] Space-efficient Feature Maps for String Alignment Kernels
    Tabei, Yasuo
    Yamanishi, Yoshihiro
    Pagh, Rasmus
    2019 19TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2019), 2019, : 1312 - 1317
  • [3] Space-Efficient Feature Maps for String Alignment Kernels
    Tabei, Yasuo
    Yamanishi, Yoshihiro
    Pagh, Rasmus
    DATA SCIENCE AND ENGINEERING, 2020, 5 (02) : 168 - 179
  • [4] Space-Efficient Feature Maps for String Alignment Kernels
    Yasuo Tabei
    Yoshihiro Yamanishi
    Rasmus Pagh
    Data Science and Engineering, 2020, 5 : 168 - 179
  • [5] Space-Efficient Framework for Top-k String Retrieval Problems
    Hon, Wing-Kai
    Shah, Rahul
    Vitter, Jeffrey Scott
    2009 50TH ANNUAL IEEE SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE: FOCS 2009, PROCEEDINGS, 2009, : 713 - 722
  • [6] Space-efficient multiple string matching automata
    Zhang, M. (zhangmeng@jlu.edu.cn), 1600, Inderscience Publishers (05):
  • [7] Space-Efficient String Mining under Frequency Constraints
    Fischer, Johannes
    Makinen, Veli
    Valimaki, Niko
    ICDM 2008: EIGHTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2008, : 193 - +
  • [8] Fast String Matching with Space-efficient Word Graphs
    Yata, Susumu
    Morita, Kazuhiro
    Fuketa, Masao
    Aoe, Jun-ichi
    IIT: 2008 INTERNATIONAL CONFERENCE ON INNOVATIONS IN INFORMATION TECHNOLOGY, 2008, : 484 - 488
  • [9] HashTrie: A space-efficient multiple string matching algorithm
    2015, Editorial Board of Journal on Communications (36):
  • [10] Space-efficient computation of parallel approximate string matching
    Muhammad Umair Sadiq
    Muhammad Murtaza Yousaf
    The Journal of Supercomputing, 2023, 79 : 9093 - 9126