Adaptive Estimation of Weight Coefficient in a Time-weighted Incremental EM-algorithm for Data Streams

被引：0

作者：

Nissenbaum, Olga, V ^{[1
]}

Kharchenko, Anastasia M. ^{[1
]}

机构：

[1] Tyumen State Univ, Tyumen, Russia

来源：

VESTNIK TOMSKOGO GOSUDARSTVENNOGO UNIVERSITETA-UPRAVLENIE VYCHISLITELNAJA TEHNIKA I INFORMATIKA-TOMSK STATE UNIVERSITY JOURNAL OF CONTROL AND COMPUTER SCIENCE | 2016年 / 37卷 / 04期

关键词：

clustering; time-weight; data stream; Gaussian mixture model; damped window model;

D O I：

10.17223/19988605/37/7

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Application of data stream clustering is quite general. Applications of data streams include mining data generated by sensor networks, meteorological analysis, stock market analysis, computer network traffic monitoring, long-term population-based, cognitive, real-time sociological studies etc. These applications involve datasets that are far too large to fit in main memory and are typically stored in a secondary storage device. Real-time processing of large volumes of data requires efficient, fast algorithms and data compression. Moreover, recently obtained objects are often more important than the old ones, so several clustering algorithms implement a damped window model with a user-defined decay coefficient. This coefficient is not obvious. An important task in the development of machine learning algorithms is to reduce the number of algorithm parameters (user-defined parameter) through the development of adaptive mechanisms. Such a mechanism for determining the decay coefficient proposed in this paper. A Gaussian mixture data stream algorithm is constructed with damped window model. Decay coefficient for the object according damped window is w=w(Delta t) where Delta t is time since object arrived from the stream into cluster, weight of the cluster W(t) is sum of all its objects. Recent objects receive higher weight than older ones, and the weights of the objects decrease with time. It is usually to assume w(Delta t)=e(-a Delta t) (a >= 0), with following properties: 1) when arrives, the object weight is equal to one: Tv(0)=1; 2) over time object weight monotonic decreases to zero: lim(Delta t ->infinity) w(Delta t) = 0; At 3) if weight of object w(Delta t(1)) after Delta t(1) units of time is known, then after more Delta t(2) it can be easily recalculates as w(Delta t(1)+Delta t(2)) = w(Delta t(1))w(Delta t(2)); 4) if weight of cluster W(t) for the instant in time t is known, and during the next time period Delta t there was no new objects arrived in this cluster, then weight of this cluster can be recalculates as W(t+Delta t) = W(t)w(Delta t). Let at initial instant in time we have a set of K Gaussian clusters C-k that described by their means mu(k) and correlation matrixes Sigma(k). Weight W-k at initial time defined as a number of objects in cluster C-k. Let {x(1), x(2),...}is an element of R-d is data stream, i.e. set of objects with timestamps t(1), t(2),.... When a new object x with timestamp t arrives to the cluster C-k, the following algorithm executes. Input: new object x, cluster parameters: mean mu(k), correlation matrix k, weight W-k(k=1,2,...,K), time period since last object arrives Delta t. 1. Recalculate weights W-k =e(-a Delta t)W(k) (k=1,2,...,K). 2. Calculate probabilities of belonging x to the k-th cluster pi(k) = W-k phi(k)(x vertical bar mu(k),Sigma(k))/Sigma W-k(i=1)i phi(i)(x vertical bar mu(i), Sigma(i))k. Put x into the most probable cluster. 3. For this cluster (in with xis, index of cluster is dropped) recalculate mean mu, correlation matrix Sigma and weight W mu(+)=W mu(-)+x/W+1;Sigma(+)=W/W+1(Sigma(-)+(mu(-)-x)(2)/W+1);W=W+1, where indexes - and + corresponds to the values before and after recalculation. Note that weight function depends on a >= 0, that may be defined individually for each C-k. We propose the following algorithm to recalculate a separately for each cluster at the time the object arrives into cluster. Input: x(1), x(2),...,x(N)- last N objects putted into cluster, t(1), t(2),...,t(N) - their timestamps, mu(0) - cluster mean Sigma(0) - cluster correlation matrix. 1. At initial instance of time t(0) adopt a=0, i.e. we assume that cluster is not moving. 2. When object x(i), with timestamp t(i), (i <= N) arrives into the cluster, accumulate variables mu = (x(1) + x(2) + ...+ x(N))/N and Sigma(N)(t-1)(t(i)-t(0)), 3. If i=N calculate v=N(mu-mu(0))/Sigma(N)(i=1)(t(i)-t(0)) using accumulated variables. Calculate a = -vertical bar v vertical bar ln epsilon/rhn dr, where r is Euclidean distance between no and confidence ellipse boundary (confidence probability p = 1 - epsilon) in direction of v vector. Set t(0)=t(N), assign no and Sigma(0) a corresponding values of cluster at the time tN and return to step 1. We perform an experiment using an imitation model of Gaussian mixture clusters with moving centroids. Some results are shown in this article. Proposed algorithm is undemanding to resources (time, memory) and therefore is suitable for real-time monitoring in large dynamic systems, such as computer systems and networks. Quality of decay coefficient adaptation is acceptable as follows from the experimental data.

引用

页码：65 / 72

页数：8

共 23 条

[1] Time-weighted counting for recently frequent pattern mining in data streams
Kang, Yongsub U.
Kang, U.
KNOWLEDGE AND INFORMATION SYSTEMS, 2017, 53 (02) : 391 - 422
[2] Time-weighted counting for recently frequent pattern mining in data streams
Yongsub Lim
U. Kang
Knowledge and Information Systems, 2017, 53 : 391 - 422
[3] High Dimension Finite Mixture Gaussian Model Estimation for Short Time Fourier Decomposition by EM-Algorithm
Chen, Mei
Liu, Yan
Zhuang, Mingguang
2008 INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION, VOLS 1-4, 2008, : 686 - +
[4] Maximum likelihood estimation for tied survival data under Cox regression model via EM-algorithm
Thomas H. Scheike
Yanqing Sun
Lifetime Data Analysis, 2007, 13 : 399 - 420
[5] Maximum likelihood estimation for tied survival data under Cox regression model via EM-algorithm
Scheike, Thomas H.
Sun, Yanqing
LIFETIME DATA ANALYSIS, 2007, 13 (03) : 399 - 420
[6] CUDA-based parallelization of time-weighted dynamic time warping algorithm for time series analysis of remote sensing data
Guo, Hengliang
Xu, Bowen
Yang, Hong
Li, Bingyang
Yue, Yuanyuan
Zhao, Shan
COMPUTERS & GEOSCIENCES, 2022, 164
[7] On data and parameter estimation using the variational Bayesian EM-algorithm for block-fading frequency-selective MIMO channels
Christensen, Lars P. B.
Larsen, Jan
2006 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-13, 2006, : 4135 - 4138
[8] A modified em-algorithm for estimating the parameters of inverse Gaussian distribution based on time-censored Wiener degradation data
Lee, Ming-Yung
Tang, Jen
STATISTICA SINICA, 2007, 17 (03) : 873 - 893
[9] A partial imputation EM-algorithm to adjust the overestimated shape parameter of the Weibull distribution fitted to the clinical time-to-event data
Choi, Kyungmee
Park, Sung Min
Han, Seunghoon
Yim, Dong-Seok
COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2020, 197
[10] Data Fusion Algorithm Based on Classification Adaptive Estimation Weighted Fusion in WSN
Yan, Dong
Liu, Peixue
Yue, Xiujie
Wang, Penghao
Liu, Minghua
Li, Baoshun
WIRELESS PERSONAL COMMUNICATIONS, 2022, 127 (04) : 2859 - 2871

← 1 2 3 →