Efficient Online Evaluation of Big Data Stream Classifiers

被引:110
|
作者
Bifet, Albert [1 ]
Morales, Gianmarco De Francisci [2 ]
Read, Jesse [3 ]
Holmes, Geoff [4 ]
Pfahringer, Bernhard [4 ]
机构
[1] HUAWEI, Noahs Ark Lab, Hong Kong, Peoples R China
[2] Aalto Univ, Helsinki, Finland
[3] Aalto Univ, HIIT, Helsinki, Finland
[4] Univ Waikato, Hamilton, New Zealand
关键词
Data Streams; Evaluation; Online Learning; Classification; CLASSIFICATION; AGREEMENT;
D O I
10.1145/2783258.2783372
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.
引用
收藏
页码:59 / 68
页数:10
相关论文
共 50 条
  • [31] Model-based Performance Evaluation of Batch and Stream Applications for Big Data
    Kross, Johannes
    Krcmar, Helmut
    2017 IEEE 25TH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS, AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS (MASCOTS), 2017, : 80 - 86
  • [32] Average Restrain Divider of Evaluation Value (ARDEV) in data stream algorithm for big data prediction
    Wibisono, Ari
    Sarwinda, Devvi
    KNOWLEDGE-BASED SYSTEMS, 2019, 176 : 29 - 39
  • [33] Efficient Streaming Detection of Hidden Clusters in Big Data Using Subspace Stream Clustering
    Hassani, Marwan
    Seidl, Thomas
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2014, 2014, 8505 : 146 - 160
  • [34] Evaluation of an Online Oral English Teaching Model Using Big Data
    Song, Yuchen
    Wei, Yi
    Shen, Yun
    Xu, Manman
    MOBILE INFORMATION SYSTEMS, 2022, 2022
  • [35] Data Stream Mining to Address Big Data Problems
    Olmezogullari, Erdi
    Ari, Ismail
    Celebi, Omer Faruk
    Ergut, Salih
    2013 21ST SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2013,
  • [36] Highly efficient incremental estimation of Gaussian mixture models for online data stream clustering
    Song, MZ
    Wang, HB
    INTELLIGENT COMPUTING: THEORY AND APPLICATIONS III, 2005, 5803 : 174 - 183
  • [37] Data Stream Analysis of Online Activities
    Kawabata, Koki
    Matsubara, Yasuko
    Sakurai, Yasushi
    IEEE 2018 INTERNATIONAL CONGRESS ON CYBERMATICS / 2018 IEEE CONFERENCES ON INTERNET OF THINGS, GREEN COMPUTING AND COMMUNICATIONS, CYBER, PHYSICAL AND SOCIAL COMPUTING, SMART DATA, BLOCKCHAIN, COMPUTER AND INFORMATION TECHNOLOGY, 2018, : 925 - 926
  • [38] Improving the performance of data stream classifiers by mining recurring contexts
    Wang, Yong
    Li, Zhanhuai
    Zhang, Yang
    Zhang, Longbo
    Jiang, Yun
    ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2006, 4093 : 1094 - 1106
  • [39] Performance Evaluation and Analysis of Multiple Scenarios of Big Data Stream Computing on Storm Platform
    Sun, Dawei
    Yan, Hongbin
    Gao, Shang
    Zhou, Zhangbing
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2018, 12 (07): : 2977 - 2997
  • [40] Multivariate stream data classification using simple text classifiers
    Seo, Sungbo
    Kang, Jaewoo
    Lee, Dongwon
    Ryu, Kean Ho
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2006, 4080 : 420 - 429