A Bayesian nonparametric approach to count-min sketch under power-law data streams

被引:0
|
作者
Dolera, Emanuele [1 ]
Favaro, Stefano [2 ,3 ]
Peluchetti, Stefano [4 ]
机构
[1] Univ Pavia, Pavia, Italy
[2] Univ Torino, Turin, Italy
[3] Coll Carlo Alberto, Turin, Italy
[4] Cogent Labs, Tokyo, Japan
基金
欧洲研究理事会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The count-min sketch (CMS) is a randomized data structure that provides estimates of tokens' frequencies in a large data stream using a compressed representation of the data by random hashing. In this paper, we rely on a recent Bayesian nonparametric (BNP) view on the CMS to develop a novel learning-augmented CMS under power-law data streams. We assume that tokens in the stream are drawn from an unknown discrete distribution, which is endowed with a normalized inverse Gaussian process (NIGP) prior. Then, using distributional properties of the NIGP, we compute the posterior distribution of a token's frequency in the stream, given the hashed data, and in turn corresponding BNP estimates. Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens. This is known to be a desirable feature in the context of natural language processing, where it is indeed common in the context of the power-law behaviour of the data.
引用
收藏
页码:226 / +
页数:11
相关论文
共 11 条
  • [1] A Bayesian Nonparametric View on Count-Min Sketch
    Cai, Diana
    Mitzenmacher, Michael
    Adams, Ryan P.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [2] Dynamic Count-Min Sketch for Analytical Queries over Continuous Data Streams
    Zhu, Xiaobo
    Wu, Guangjun
    Zhang, Hong
    Wang, Shupeng
    Ma, Bingnan
    2018 IEEE 25TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2018, : 225 - 234
  • [3] An improved data stream summary: the count-min sketch and its applications
    Cormode, G
    Muthukrishnan, S
    JOURNAL OF ALGORITHMS-COGNITION INFORMATICS AND LOGIC, 2005, 55 (01): : 58 - 75
  • [4] An improved data stream summary: The count-min sketch and its applications
    Cormode, G
    Muthukrishnan, S
    LATIN 2004: THEORETICAL INFORMATICS, 2004, 2976 : 29 - 38
  • [5] A Bayesian nonparametric approach to correct for underreporting in count data
    Arima, Serena
    Polettini, Silvia
    Pasculli, Giuseppe
    Gesualdo, Loreto
    Pesce, Francesco
    Procaccini, Deni-Aldo
    BIOSTATISTICS, 2023, 25 (03) : 904 - 918
  • [6] Set-Min Sketch: A Probabilistic Map for Power-Law Distributions with Application to k-Mer Annotation
    Shibuya, Yoshihiro
    Belazzougui, Djamal
    Kucherov, Gregory
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2022, 29 (02) : 140 - 154
  • [8] Data-Driven Adaptive Robust Unit Commitment Under Wind Power Uncertainty: A Bayesian Nonparametric Approach
    Ning, Chao
    You, Fengqi
    IEEE TRANSACTIONS ON POWER SYSTEMS, 2019, 34 (03) : 2409 - 2418
  • [9] Bayesian analysis on a natural conjugate prior for the nonhomogeneous Poisson process with a power-law intensity under time-truncated sampling
    Huang, Po-Yao
    Huang, Yeu-Shiang
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2024,
  • [10] Data-Driven Chance-Constrained Optimal Gas-Power Flow Calculation: A Bayesian Nonparametric Approach
    Wang, Jingyao
    Wang, Cheng
    Liang, Yile
    Bi, Tianshu
    Shafie-khah, Miadreza
    Catalao, Joao P. S.
    IEEE TRANSACTIONS ON POWER SYSTEMS, 2021, 36 (05) : 4683 - 4698